Skip to main content

A genome-wide scan using tree-based association analysis for candidate loci related to fasting plasma glucose levels



In the analysis of complex traits such as fasting plasma glucose levels, researchers often adjust the trait for some important covariates before assessing gene susceptibility, and may at times encounter confounding among the covariates and the susceptible genes. Previously, the tree-based method has been employed to accommodate the heterogeneity in complex traits. In this study, we performed a genome-wide screen on fasting glucose levels in the offspring generation of the Framingham Heart Study provided by the Genetic Analysis Workshop 13. We defined one quantitative trait and converted it to a dichotomous trait based on a predetermined cut-off value, and performed association analyses using regression and classification trees for the two traits, respectively. A marker was interpreted as positive if at least one of its alleles exhibited association in both analyses. Our purpose was to identify candidate genes susceptible to fasting glucose levels in the presence of other covariates. The covariates entered in the analysis including sex, body mass index, and lipids (total plasma cholesterol, high density lipoprotein cholesterol, and triglycerides) of the subjects, and those of their parents.


Four out of seven positive regions in chromosomes 1, 2, 6, 11, 16, 18, and 19 from our analyses harbored or were very close to previously reported diabetes related genes or potential candidate genes.


This screen method that employed tree-based association showed promise for identifying candidate loci in the presence of covariates in genome scans for complex traits.


Problem 1 of Genetic Analysis Workshop 13 (GAW13) provided the data from the Framingham Heart Study. We focused on the offspring cohort due to the missing rate of the data in the parental cohort.

Because the history of medical intervention, including the adjustment of lifestyle and the use of anti-diabetic medications were not available, we chose the highest fasting plasma glucose levels across the course of follow-up as the targeted quantitative trait to indicate the potential risk for abnormal glucose disposal. As suggested by the American Diabetes Association, the impaired fasting glucose (IFG, fasting plasma glucose between 110 and 125 mg/dl) appears as a risk factor for type 2 diabetes mellitus (T2DM) [1]. We further used the lower limit of IFG (≥110 mg/dl) as the cut-off to transform this quantitative trait into a dichotomy. In this way, we included the subjects in the group with one or more incidences of higher fasting plasma glucose. We then performed association analyses using regression and classification trees for the two traits, respectively. A marker was considered positive if at least one of its alleles showed association in both analyses.

Our purpose was to identify candidate genes related to the fasting glucose levels in the presence of covariates. We found a few interesting markers that are closely linked with some potential candidate genes biologically relevant to glucose metabolism.


Data processing

For the phenotype measurements, the corresponding covariates were created using their cross-sectional means. The covariates entered in the analysis included sex, body mass index, and lipids (total plasma cholesterol, high density lipoprotein cholesterol, and triglycerides) for each subjects. To control for potential familial correlations, the cross-sectional means of the maternal and paternal phenotype measurements were also included as covariates.

For the genotypic data, an allele was chosen to enter the analyses if its allele frequency is at least 10%. Alleles with frequencies less than 10% but from the same marker are categorized as an incognito allele. The allelic covariates were created using the technique proposed by Zhang and Bonney [2].

Association analysis using classification trees

The classification tree (CT) and regression tree (RT) methods are both built on the recursive partition technique; they can be used to partition a study population into homogeneous disjointed subgroups. The optimal tree is created by both growing and pruning procedures. The maximal tree is built by splitting each node into two child nodes until the purity of the terminal node is achieved. In splitting, the best choice of the child node is derived while the minimum of the entropy impurity function is reached. In pruning, it is processed for each binary class j in the subtree τ until the unconditional misclassification rate is attained, where c(j|i) is the cost that a class j is classified as a class i and IP is the entropy impurity function. In general, choice of the cost depends on the severity of the misclassification. In this study, equal cost was chosen for both misclassifications because it frequently gives most satisfactory analyses [3], i.e., c(1|0) = c(0|1). The optimal tree in RT is similar to that in CT with a different impurity function , i.e., the within-node variance in the tree τ. More details of CT, RT, and corresponding splitting criteria are described elsewhere [35].

Tree-based association analysis was implemented by using genotype measurements such as allelic covariates and related phenotype measurements to construct binary trees. An allele shows association with the trait if its corresponding covariate is included in the optimal tree.

To illustrate the tree construction, a portion of an optimal tree created by CT is presented in Figure 1. First, a total of 1667 subjects (the offspring generation) were divided into two groups according to whether averaged BMI was less than 26.35 or not (node 1 to nodes 2 and 8). Those with averaged BMI higher than 26.35 were further subdivided according to their HBP status (node 8 to nodes 9 and 15). Those 314 subjects in node 9 were further divided into node 10 (or 14) if their averaged maternal triglyceride was lower (or higher) than 135.5 mg/dl. Finally, if the genotype was absent of allele 266 in D16S2620 then the subject was likely to have a fasting glucose levels lower than 110 mg/dl. In summary, allele 266 in D16S2620 was associated with fasting glucose levels for those with higher BMI (>26.35), no HBP, and lower maternal triglyceride levels (<135.5 mg/dl).

Figure 1
figure 1

A proportion of an optimal tree from classification tree Decision node criteria: 1. BMI: average BMI across-section 2. HBP: ever hypertension in all sections 3. MTG: average maternal triglycerides 4. Allele 266 of D16S2620 Definition of classes: Class 0, glucose levels < 110 mg/dl in all section; Class 1, at least one observed fasting glucose levels >110 mg/dl.

Genome-wide screen

In this study, we conducted a genome-wide screen to identify the candidate gene in the presence of a set of specified covariates. We performed RT- and CT-based association analyses on the quantitative and dichotomy traits, respectively. A marker was interpreted as positive if at least one of its alleles showed association in both association analyses. The allelic covariates from the same chromosome were entered in the analyses simultaneously. The genome-wide screen consisted of 22 such processes for the autosomes. The computer programs QUEST [6] and RT [7] were used to construct the binary trees for the CT and RT analyses.

Web-searching for candidate genes

The map position was defined using Ensemble Genome Server at Sanger Institute For candidate gene search, we used Online Medelian Inheritance in Man at National Center for Biotechnology Information or euGene


Table 1 shows these seven candidate regions, consisting of nine positive markers in both analyses, were on chromosomes 1p, 2p, 6q, 11p, 16q, 18p, and 19q. Among these seven regions, four regions, covering the four markers, D1S1665, D6S474, D11S1981, and D19S254, were closely linked to the genes previously reported to be relevant to glucose metabolism or diabetes mellitus (details listed in Table 1).

Table 1 Positive markers found in the analyses using classification and regression trees

Discussions and Conclusions

In this study, the intent of our screen method was to identify candidate markers rather than to pinpoint susceptibility alleles, although it can be applied to detect the allelic or non-allelic heterogeneity. The cut-off value used in CT in this analysis was chosen for a biological reason. However, the analysis was sensitive to the choice of cut-offs when the subjects were largely clustered around the cut-off point (>110 mg/dl). Only three regions on 1p, 16q, and 18p were consistently positive at neighboring cut-offs from 100 to 120.

Although covariates such as BMI and HBP, which are associated with fasting glucose level, were included in our analyses, the cut-off of these covariates in our final 22 optimal trees were not the same. Further studies are needed to inspect the impact of different cut-off and associated alleles.

From a different point of view, our method used the RT analysis on the quantitative trait to validate the results from CT such that the positive markers showed association in both analyses. Notably, four out of the seven candidate regions harbored previously reported genes that are related to glucose metabolism or diabetes mellitus. In conclusion, our screen method shows promise for searching candidate loci in genome scans for complex traits.


  1. The Expert Committee on the Diagnosis and Classification of Diabetes Mellitus: Report of the Expert Committee on the Diagnosis and Classification of Diabetes Mellitus. Diabetes Care. 1997, 20: 1183-1197.

    Article  Google Scholar 

  2. Zhang Z, Bonney GE: Use of classification trees for association studies. Genet Epidemiol. 2000, 19: 323-332. 10.1002/1098-2272(200012)19:4<323::AID-GEPI4>3.0.CO;2-5.

    Article  CAS  PubMed  Google Scholar 

  3. Breiman L, Friedman JH, Olshen RA, Stone CJ: Classification and Regression Trees. New York, Chapman and Hall. 1989

    Google Scholar 

  4. Chang CJ, Fann CSJ: Using data mining to address heterogeneity in the Southampton data. Genet Epidemiol. 2001, 21: S180-S185.

    PubMed  Google Scholar 

  5. Fann CSJ, Shugart YY, Lachman H, Collins A, Chang CJ: The impact of redefining affection status for alcoholism on affected-sib-pair analysis. Genet Epidemiol. 1999, 17: S151-S156. 10.1002/(SICI)1098-2272(1999)17:2<151::AID-GEPI5>3.3.CO;2-B.

    Article  PubMed  Google Scholar 

  6. Department of Mathematics, National Chung Cheng University,Taiwan: Quest User Manual. Version 1.8.8. Taiwan. 2000

    Google Scholar 

  7. Torgo L: RT 4.1 User's Manual. University of Porto, Porto, Portugal. 2001

    Google Scholar 

Download references


This study was partially supported by the National Science Council in Taiwan (NSC 91-3112-B-001-006-M51) and National Taiwan University Hospital (NTUH-91A15).

Author information

Authors and Affiliations


Corresponding author

Correspondence to Cathy SJ Fann.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chen, CH., Chang, C.J., Yang, WS. et al. A genome-wide scan using tree-based association analysis for candidate loci related to fasting plasma glucose levels. BMC Genet 4 (Suppl 1), S65 (2003).

Download citation

  • Published:

  • DOI: