Skip to main content
  • Methodology article
  • Open access
  • Published:

A systematic search for SNPs/haplotypes associated with disease phenotypes using a haplotype-based stepwise procedure



Genotyping technologies enable us to genotype multiple Single Nucleotide Polymorphisms (SNPs) within selected genes/regions, providing data for haplotype association analysis. While haplotype-based association analysis is powerful for detecting untyped causal alleles in linkage-disequilibrium (LD) with neighboring SNPs/haplotypes, the inclusion of extraneous SNPs could reduce its power by increasing the number of haplotypes with each additional SNP.


Here, we propose a haplotype-based stepwise procedure (HBSP) to eliminate extraneous SNPs. To evaluate its properties, we applied HBSP to both simulated and real data, generated from a study of genetic associations of the bactericidal/permeability-increasing (BPI) gene with pulmonary function in a cohort of patients following bone marrow transplantation.


Under the null hypothesis, use of the HBSP gave results that retained the desired false positive error rates when multiple comparisons were considered. Under various alternative hypotheses, HBSP had adequate power to detect modest genetic associations in case-control studies with 500, 1,000 or 2,000 subjects. In the current application, HBSP led to the identification of two specific SNPs with a positive validation.


These results demonstrate that HBSP retains the essence of haplotype-based association analysis while improving analytic power by excluding extraneous SNPs. Minimizing the number of SNPs also enables simpler interpretation and more cost-effective applications.


Genotyping technology now enables population researchers to genotype dozens to thousands of SNPs within any selected candidate gene or within any genomic region. Such SNP data are increasingly collected in disease association studies, using a case-control study design [1, 2], with the analytic objective of assessing association between SNP genotypes and a disease phenotype of interest. While traditional analyses have involved correlating phenotypes with individual SNP genotypes [3], a complementary approach involves inferring haplotypes of SNPs or their distributions, then assessment of haplotypic associations with the disease phenotype [47].

Haplotype-based association analysis has several advantages over association analysis with single SNPs. First, haplotypes of multiple SNPs can reduce the number of comparisons to be made during the analysis. Typically, SNPs within a narrow region are in high LD, i.e., adjacent SNP alleles on the same chromosome are highly correlated. Consequently, with K such SNPs, the total number of haplotypes formed by these SNPs is generally much smaller than the number of all possible haplotypes (= 2K-1). For a typical gene, the number of common haplotypes, even with variable numbers of SNPs, is on the order of 10–15 [8], with a few notable exceptions such as the major histocompatibility (MHC) genes [9]. Secondly, a haplotype is naturally interpreted as genetic polymorphisms of SNP alleles on the same chromosome. After filling in the non-SNP nucleotides between SNPs, one has a fully phased DNA sequence. This sequence, if it lies in the coding region of a gene, can be converted into an amino acid sequence, and thus haplotypic variations may result in protein variations, an important biological context to consider. Third, haplotypes themselves tend to be conserved and shaped by evolutionary processes. Recent population genetic studies of the human genome have suggested that recombination processes, together with other population genetic forces, have created long-range haplotype blocks [10, 11]. These block structures are also useful for reducing the number of statistical comparisons, as well as for interpretation of disease associations with common extended haplotypes. Additionally, haplotype-based associations are useful for mapping unknown disease mutations. As opposed to assuming a direct relationship between a phenotype and an individual SNP, one or more disease-causing mutations may be in high LD with adjacent SNPs; hence, extended haplotypes formed by these known SNPs may serve as markers for disease-causing mutations yet to be discovered [12]. Indeed, a haplotype of multiple SNPs may be thought of as an allele at a multi-allelic marker locus, and increasing polymorphism with multiple haplotypes improves the power to detect disease associations.

There are also disadvantages when using haplotype-based association analyses: the main disadvantage is that haplotype-based association analyses may have reduced power in detecting SNP-level associations. If in truth, the disease association is with a single functional SNP, the haplotype-based association can be less efficient due to the fact that including irrelevant SNPs effectively divides the samples into multiple haplotype groups, hence reducing sample sizes and consequently decreasing the power of the study. Moreover, dividing a single SNP association into multiple haplotype associations will also incur the penalty associated with multiple comparisons. This loss of statistical efficiency is exacerbated if the haplotype analysis includes many SNPs, resulting in an excessively large number of haplotypes.

To retain the advantages of haplotype-based analysis and overcome its disadvantages, we propose a HBSP to systematically search for a subset of SNPs whose haplotypes associate with the disease phenotype. Following the principle of stepwise regression methodology, for example, forward or backward selection [13], we have developed forward and backward haplotype-based stepwise procedures. For example, in the backward procedure, one gradually de-selects one SNP at a time, provided that the exclusion of individual SNPs does not alter the observed haplotypic association. In this report, we introduce the methodology of the HBSP, report our results from simulation studies with finite sample sizes, and illustrate its clinical applicability by using the HBSP to select functional SNPs within the BPI gene, which has been independently shown to be significantly associated with pulmonary function in post-transplant patients.


BPI and Pulmonary Function among Transplant Patients

Given the importance of innate immunity in protection from diseases of the airway, we conducted a genetic association study using a candidate gene approach to determine if polymorphisms in genes of the innate immune pathway are associated with the development of hematopoietic stem cell transplant- (HCT-) related airflow obstruction (AFO), the details of which have been published elsewhere [14]. This two-tiered (including discovery and validation phases) case-control genetic association study selected tagging SNPs from 15 genes from the innate immunity pathway. Cases were defined as patients who experienced an annual decrease of the forced expiratory volume in the first second (FEV1) > 5%, with their lowest post-transplant ratio of FEV1 to forced vital capacity < 0.8. This study discovered and validated the association of multiple tagging SNP haplotypes on the BPI gene with the AFO phenotype. In the analyses below, eight tagging SNPs from the BPI gene are used for illustrative purposes.

Notation, Model and Estimation Procedures


Consider a case-control study with n subjects (i = 1,2,..., n), with cases denoted by d i = 1 and controls denoted as d i = 0. Let x i = (xi 1, ...x ic )' denote a vector of c covariates, such as clinical, demographical, and lifestyle variables. Also obtained from the i th subject is a biological sample, which is genotyped for multiple SNPs. Let g i = (gi 1, gi 2, ...,g iq ) denote genotypes of linearly ordered SNPs within a well-defined genomic region, such as a candidate gene region. Let g ij = a ij : a ˙ i j MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafmyyaeMbaiaadaWgaaWcbaGaemyAaKMaemOAaOgabeaaaaa@300D@ denote a pair of alleles at the j th locus in the i th individual, where bi-allelic SNP alleles a ij and a ˙ i j MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafmyyaeMbaiaadaWgaaWcbaGaemyAaKMaemOAaOgabeaaaaa@300D@ take a value of 0 and 1, corresponding to the major and minor allele, respectively. Due to the nature of the genotyping technology, the parental origin (or phase) of individual alleles is unknown. Let Ω i = (Ωi 1, Ωi 2, ..., Ω iq ) denote a vector of phase indicators: Ω ij = 0 implies that the first allele at the j th locus for the i th subject a ij is inherited from the father, with the other allele, a ˙ i j MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafmyyaeMbaiaadaWgaaWcbaGaemyAaKMaemOAaOgabeaaaaa@300D@ , from the mother. In contrast, Ω ij = 1 implies that a ij is inherited from the mother and a ˙ i j MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafmyyaeMbaiaadaWgaaWcbaGaemyAaKMaemOAaOgabeaaaaa@300D@ from the father. When phases are known, (g i , Ω i ) defines two haplotypes called a diplotype, denoted as H i : H ˙ i MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafmisaGKbaiaadaWgaaWcbaGaemyAaKgabeaaaaa@2E7E@ . Each haplotype consists of q SNPs and may be written as H i = ai 1ai 2ai 3...aiq-1a iq . For q SNPs, there are r possible haplotypes, denoted as (h1, h2, ..., h r ).

The penetrance of haplotypes and covariates to the disease phenotype is quantified through a logistic regression model. The logistic penetrance function can be formally written as

Pr ( d i = 1 | H i , H ˙ i , x i ) = 1 1 + exp { α β ˜ ' [ K ( H i ) + K ( H ˙ i ) ] γ ˜ ' x i } , MathType@MTEF@5@5@+=feaagaart1ev2aqatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGagiiuaaLaeiOCaiNaeiikaGIaemizaq2aaSbaaSqaaiabdMgaPbqabaGccqGH9aqpcqaIXaqmcqGG8baFcqWGibasdaWgaaWcbaGaemyAaKgabeaakiabcYcaSiqbdIeaizaacaWaaSbaaSqaaiabdMgaPbqabaGccqGGSaalcqWG4baEdaWgaaWcbaGaemyAaKgabeaakiabcMcaPiabg2da9KqbaoaalaaabaGaeGymaedabaGaeGymaeJaey4kaSIagiyzauMaeiiEaGNaeiiCaaNaei4EaSNaeyOeI0IaeqySdeMaeyOeI0IafqOSdiMba4bacqGGNaWjcqGGBbWwcqWGlbWscqGGOaakcqWGibasdaWgaaqaaiabdMgaPbqabaGaeiykaKIaey4kaSIaem4saSKaeiikaGIafmisaGKbaiaadaWgaaqaaiabdMgaPbqabaGaeiykaKIaeiyxa0LaeyOeI0Iafq4SdCMba4bacqGGNaWjcqWG4baEdaWgaaqaaiabdMgaPbqabaGaeiyFa0haaOGaeiilaWcaaa@683F@

which takes values between 0 and 1, quantifying the probability of being diseased. The K(·) is a vector of (r-1) indicator functions, i.e K(H i ) = (I(H i = h2), I(H i = h3), ..., I(H i = h r ))' where the haplotype h1 is treated as a reference. The unknown regression coefficient vector = (β1, β2,..., βr-1)' quantifies haplotype-specific penetrance to the phenotype and is estimated from β ˜ MathType@MTEF@5@5@+=feaagaart1ev2aqatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafqOSdiMba4baaaa@2D9C@ the data, along with other unknown regression parameters, the coefficient of intercept α and the coefficients of the other covariates γ ˜ MathType@MTEF@5@5@+=feaagaart1ev2aqatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafq4SdCMba4baaaa@2DA2@ . The regression coefficient β j is also referred as the logarithm of odds ratio (OR). The estimation algorithm for parameters and their covariates were developed elsewhere [4].

A Wald Test Statistic

In our proposed HBSP described below, we either add (in forward selection) or remove (in backward selection) one SNP at each time. Suppose that there are q SNP loci with r different haplotypes denoted as (h1, h2, ..., h r ) considered in the above logistic regression model. The estimations of β ˜ MathType@MTEF@5@5@+=feaagaart1ev2aqatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafqOSdiMba4baaaa@2D9C@ and its covariance Σ β ˜ MathType@MTEF@5@5@+=feaagaart1ev2aqatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaeu4Odm1aaSbaaSqaaiqbek7aIzaaEaaabeaaaaa@2F4C@ are denoted by β ^ ˜ = ( β ^ 1 , β ^ 2 , β ^ 3 , ... , β ^ r 1 ) MathType@MTEF@5@5@+=feaagaart1ev2aqatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafqOSdiMbaKGba4bacqGH9aqpcqGGOaakcuaHYoGygaqcamaaBaaaleaacqaIXaqmaeqaaOGaeiilaWIafqOSdiMbaKaadaWgaaWcbaGaeGOmaidabeaakiabcYcaSiqbek7aIzaajaWaaSbaaSqaaiabiodaZaqabaGccqGGSaalcqGGUaGlcqGGUaGlcqGGUaGlcqGGSaalcuaHYoGygaqcamaaBaaaleaacqWGYbGCcqGHsislcqaIXaqmaeqaaOGaeiykaKcaaa@444B@ and Σ ^ β ˜ MathType@MTEF@5@5@+=feaagaart1ev2aqatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafu4OdmLbaKaadaWgaaWcbaGafqOSdiMba4baaeqaaaaa@2F5C@ .

To examine the contribution of a particular SNP to the overall haplotypic association, we remove one SNP at a time from the haplotypes. For example, if the q th SNP is removed, some of the haplotypes may be merged if their haplotypic differences were due to allelic difference at the q th SNP. To assess the contribution of the q th SNP to the disease association, it is equivalent to test if the coefficients of merged haplotypes are equal. Suppose s unique haplotypes are observed after removing the q th SNP. So (r-s) haplotypes are merged with one of s unique haplotypes, thus number of equalities to be tested is (r-s).

Under the null hypothesis that the q th SNP has no contribution to the haplotype-based association, we thus compute haplotype-based parameters for s haplotypes with (q-1). Further, under the null hypothesis, one could use estimated s haplotype-based parameters to assign haplotype-based parameters for all r haplotypes, as if q SNPs been included. Let β ˜ ˜ = ( β ˜ 1 , β ˜ 2 , ... , β ˜ r 1 ) MathType@MTEF@5@5@+=feaagaart1ev2aqatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafqOSdiMba4HbaGaacqGH9aqpcqGGOaakcuaHYoGygaacamaaBaaaleaacqaIXaqmaeqaaOGaeiilaWIafqOSdiMbaGaadaWgaaWcbaGaeGOmaidabeaakiabcYcaSiabc6caUiabc6caUiabc6caUiabcYcaSiqbek7aIzaaiaWaaSbaaSqaaiabdkhaYjabgkHiTiabigdaXaqabaGccqGGPaqkaaa@408C@ denote such haplotype-based parameters obtained under the null hypothesis with the q th SNP removed.

To test whether the q th SNP contributes significantly to the haplotype-based association, we construct the following Wald statistic:

χ r s 2 = ( β ^ ˜ β ˜ ˜ ) ' Σ β ˜ ˜ 1 ( β ^ ˜ β ˜ ˜ ) , MathType@MTEF@5@5@+=feaagaart1ev2aqatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaeq4Xdm2aa0baaSqaaiabdkhaYjabgkHiTiabdohaZbqaaiabikdaYaaakiabg2da9iabcIcaOiqbek7aIzaajyaaEaGaeyOeI0IafqOSdiMbaGGba4bacqGGPaqkcqGGNaWjcqqHJoWudaqhaaWcbaGafqOSdiMbaGGba4baaeaacqGHsislcqaIXaqmaaGccqGGOaakcuaHYoGygaqcgaGhaiabgkHiTiqbek7aIzaaiyaaEaGaeiykaKIaeiilaWcaaa@4785@

where Σ β ˜ ˜ MathType@MTEF@5@5@+=feaagaart1ev2aqatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaeu4Odm1aaSbaaSqaaiqbek7aIzaaiyaaEaaabeaaaaa@2F5A@ is the estimated covariance matrix of coefficients under the null hypothesis. By the Central Limit theorem, one can show that the above Wald statistic has an asymptotic chi-square distribution with r-s degrees of freedom, under the null hypothesis. With finite sample sizes, our simulation results support the approximations by the stated chi-square distribution (not shown). Based upon this distribution, one can estimate the probability that quantifies the statistical significance in removing the q th SNP.

Two exceptional cases are worth mentioning. The first case is that if the q th SNP is in perfect LD with the remaining SNPs, removing that SNP would not alter the distribution of haplotypes. Consequently, estimated log ORs from the reduced haplotype analysis will be exactly the same as those in the full haplotype analysis, i.e., β ˜ = β ^ MathType@MTEF@5@5@+=feaagaart1ev2aqatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafqOSdiMbaGaacqGH9aqpcuaHYoGygaqcaaaa@303D@ . Naturally, the above Wald-statistic equals zero, with zero degrees of freedom. In such a case, the SNP is removed without requiring a test. The second special case is that when only one SNP is in the model, the Wald statistic [2] degenerates to χ 1 2 = ( β ^ / σ ^ ) 2 MathType@MTEF@5@5@+=feaagaart1ev2aqatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaeq4Xdm2aa0baaSqaaiabbgdaXaqaaiabikdaYaaakiabg2da9iabcIcaOiqbek7aIzaajaGaei4la8Iafq4WdmNbaKaacqGGPaqkdaahaaWcbeqaaiabikdaYaaaaaa@37E0@ .

A Forward Selection Procedure

Consider a haplotype analysis with Q SNPs from a gene or region. A forward selection procedure can be used to evaluate haplotypic associations with a single SNP, two SNPs, and progressively increasing numbers of SNPs. This procedure will be terminated when the minimum p-value exceeds a pre-set threshold (Figure 1A). Within this procedure, the threshold value c f is chosen to control false positive error rates. Here, we control the overall false positive error rate less than a pre-fixed rate, say 5% (further explored below).

Figure 1
figure 1

Outline of the three computational algorithms for the stepwise selection of SNPs. (A) Forward selection procedure. (B) Backward selection procedure. (C) Hybrid Selection Procedure.

A Backward Selection Procedure

While computationally efficient, the forward selection procedure may miss significant haplotypic associations due to variable haplotypic domains, i.e., the parameter domains are not necessarily hierarchical when a SNP is progressively reduced [6]. The desire to overcome this limitation motivates the backward selection procedure as an alternative to the forward selection procedure. The basic idea is to start with the haplotypes of all Q SNPs and then to eliminate irrelevant SNPs one SNP at a time (Figure 1B). The topic of choosing the threshold value c b for the backward selection is discussed further below.

A Hybrid Selection Procedure

While being generally preferred, the backward selection procedure can be time-consuming and prohibitive when the number of SNPs to be analyzed is large and each haplotype-based association analysis requires a substantial computation. To overcome this challenge, one may consider a hybrid selection procedure, i.e., combining both forward and backward selection procedures, patterning after the usual stepwise regression approach. One possible strategy is to initially use the forward selection procedure to add SNPs into haplotypes, and then to de-select those "selected SNPs" from established haplotypes with the backward selection procedure. The hybrid procedure, involving forward and backward selection threshold values c f and c b , stems directly from those described above (Figure 1C).

Permutation-Based Assessment of False Positive Error Rates

As noted above, the threshold values c f and c b for forward and backward selection procedures, respectively, are closely connected with false positive error rates, and stringent threshold values correspond to low false positive error rates. Choosing threshold values can be a challenge due to multiple comparisons with a series of highly correlated chi-square tests. The correlatedness among these tests can not be easily quantified due to varying levels of LD within a gene or within a specific region. However, a simple Bonferroni correction ignoring the correlation could lead to excessively conservative results.

We propose to use a permutation-based assessment to evaluate the false positive error rate (FPER), based on which the corresponding threshold value is chosen. Without requiring any distributional assumptions, the basic idea is to permute disease phenotypes across all subjects to create samples that could arise from the null hypothesis, in which SNPs have no associations with the disease phenotype. Thus, analysis of the permuted data, utilizing either the forward or backward procedure or a hybrid of both, would yield relevant statistics, in particular, p-values. Following analysis of the permuted data, the number of false positive errors is counted based on the pre-chosen threshold values (c f or c b ). Repeating the permutation, say, 1000 times results in a sample of false-positive error counts. The average value over all permutations is taken as an estimate of the FPER. Thus, the threshold value (c f or c b ) is chosen in such a way that the ultimate FPER is controlled at the pre-fixed rate.

Simulation Studies

Simulation studies were conducted under the null hypothesis and also under alternative hypotheses. Assuming a typical coalescent process, we simulated 15 SNPs in varying degrees of LD, which were then used for the procedure (Figure 2). Under the null hypothesis, the simulated phenotype had no association with any simulated SNPs. For alternative hypotheses, we considered two different scenarios: 1) the phenotype was associated with a single SNP, and 2) the phenotype was associated with a haplotype of two SNPs.

Figure 2
figure 2

LD patterns among 15 simulated SNPs and their haplotype block boundaries, indicated by solid lines: (a) Standard D'/LOD pattern is shown, and (b) the LD pattern with r2 using confidence intervals.

Using a coalescent model, we generated genotype data under a typical evolutionary process, based upon Hudson's coalescent simulation program [15]. For the simulation, we specified a scaled recombination rate of 10, i.e., 4× (number of generations) × (recombination rate) to generate 2,000 haplotypes and with 150 segregation sites. From the 150 segregation sites, we selected 15 SNPs that had a minor allele frequency greater than 5%. Based upon 2,000 simulated haplotypes, we computed the haplotype frequencies of the 15 SNPs to represent the "true" haplotype frequencies in the study population. In this particular population, we observed 19 unique haplotypes with frequencies exceeding 0.1% (Table 1).

Table 1 Distribution of 19 simulated haplotypes

The following procedure was used to simulate the study population of one million people: we randomly drew a pair of haplotypes from the above haplotypic distribution to form individual diplotypes, but stripped away the phase information during this process. Using the penetrance function [1], with parameters specified corresponding to the null hypothesis and various alternative hypotheses, we computed the probability of an individual developing the disease. Based upon computed probabilities, we then simulated a binary disease status by the Bernoulli process. This simulation process was repeated one million times, resulting in the targeted study population. We also simulated two demographic variables: gender and age. For gender, we assumed that men and women were equally represented in the population. We assumed age to be uniformly distributed from 20 to 80 years. Under these assumptions, we randomly assigned gender and age to all individuals in the simulated study population

From the simulated study population, we randomly drew equal numbers of cases and controls to form case-control data sets. We considered different sample sizes in the simulation to test sample size effects. For each configuration of parameters, we repeated the simulations 1,000 times. To ensure the validity of the simulated results irrespective of SNP choices, we randomly selected one or two adjacent SNPs as functional SNPs in each replication. For each simulated data set, we used the HBSP to identify the functional SNP or haplotype and verified the finding. If the finding matched the functional element, it was a true positive finding. The percentages of true findings among the 1,000 replicates were recorded to quantify the true discovery rate (TDR), i.e., the percentage of true SNP associations identified. Since some SNPs were in LD with each other, we treated the discovery of SNP association, if it was at high LD with the true functional SNP (D' ≥ 0.8), as an acceptable discovery. To quantify this imperfect discovery, we introduced a statistic, the imperfect true discovery rate (iTDR). If the discovered SNP was neither the causal SNP nor at high LD with the causal SNP, it was counted as a false positive finding. Ideally, the rate of such false positive errors would be largely comparable to the FPER, which was controlled as described above.


Simulated Data

Null Hypothesis

Under the null hypothesis, log ORs related to the haplotypes in the penetrance function [1] are set to 0. For gender and age, the coefficients were set at 0 and 0.01, respectively, with an intercept of 1. Sample sizes varied from 500 to 2000, with an equal number of cases and controls. Results of the simulation are reported in the first row of Table 2. The estimated FPER did not significantly deviate from 0.05, which was the chosen threshold, across the three sample sizes.

Table 2 False positive error rates are estimated under the null hypothesis.

Alternative Hypothesis with a Single Functional SNP

Under the alternative hypothesis with a single but randomly chosen SNP, we specified related ORs as 1.10, 1.20, 1.30, 1.40, 1.50, 2.00, 3.00 and 5.00 in the penetrance function [1], to correspond to associations ranging from weak to strong. Table 2 lists the estimated TDR, imperfect true discovery rate iTDR, and FPER percentages for each value. As expected, the TDR increased with increasing sample size, 57.6% to 77.8% and then to 83.0% to detect OR = 1.5, as the sample size increased from 500 to 1000 and to 2000, respectively. Similarly, the TDR increased with the increasing OR values, 13.6% to 77.8% and then to 96.3% to detect ORs = 1.1, 1.5 and 5.0, respectively, with the sample size fixed at 1000. As expected, estimated FPER values were largely around 5%, with a few exceptions. Some minor elevations in FPER were probably due to weak LD with the functional SNP.

Alternative Hypothesis with Two Functional SNPs

Under the alternative hypothesis with two associated SNPs, we randomly chose two adjacent SNPs; they form four possible haplotypes (00, 01, 10, 11) with 00 as the reference haplotype. We created five different scenarios assuming different ORs for different haplotypes. Under the first scenario, the OR corresponding to haplotype 10 took values ranging from 1.3 to 5.0, which is similar to the single SNP situation. As expected, the TDR, iTDR and FPER estimates were comparable to those under the single SNP alternative hypotheses. Specifically, a case-control study of 500 subjects would have a 57.5% chance of detecting the true functional SNP with an OR of 2.0. At an increased sample size of 2000, the study would be able to detect an OR of 2.0 with nearly 80% TDR. The greater iTDR was mostly due to more than two SNPs being at high LD with these two functional SNPs. FPER values were generally less than 5%, with a few exceptions, partly because some SNPs had weak LD with the functional SNPs.

Under the second scenario, the OR corresponding to haplotype 11 varied from 1.3 to 5.0. While FPER estimates were around 5%, the TDR of detecting such associations was quite low, ranging from 10% to 36% for detecting OR of 2.0 with sample size varying from 500 to 2000, respectively. The primary reason for the reduced power was that the haplotype frequency for haplotype 11 was much lower than the others. Again, iTDR values were much greater than comparable TDR values, for the same reason stated above. When we increased the OR associated with haplotype 10 to 1.3 (scenario 3) and 1.5 (scenario 4), the TDR appreciably increased.

Under the fifth scenario, three OR values for three haplotypes deviated from 1.0, and both the TDR and the iTDR for detecting such genetic associations by the HBSP became very powerful. Even with a sample size of 500 subjects, the study had a 50~54% TDR and an 82~90% iTDR for detection of the true SNP-haplotype associations with the stated OR values.

BPI Clinical Data

In our earlier analysis of a discovery cohort (N = 393), we identified BPI as an important candidate gene for the development of HCT-related AFO [14]. BPI was tagged by eight SNPs (C2738G, G7258A, G9083C, A10214G, G17016G, C23356T, A33065G, G36045A). Three haplotypes, tagged by these SNPs, were found to be significantly associated with the phenotype. Repeat analysis of these tagging SNPs in an independent validation cohort (N = 209) again revealed that multiple tagging SNP haplotypes were significantly associated with the AFO phenotype [14]. In the interest of reducing the number of SNP markers necessary to identify at-risk patients in clinical practice, we applied the proposed algorithm to find the most informative tagging SNPs. We applied the forward, backward and hybrid procedures to the discovery cohort, while adjusting for clinical covariates that were previously identified as important clinical risk factors for HCT-related AFO (age at transplant, pretransplant one-second forced expiratory volume, extent of graft versus host disease, and duration of follow-up). All three models identified the same two SNPs (C23356T and A33065G) as the most significantly associated with the disease phenotype (Table 3). With TG as the reference haplotype, the CA, CG, and TA haplotypes had ORs ranging from 1.52, 1.75 and 3.36, respectively (p-values 0.027, 0.014 and < 0.001, respectively). In the validation cohort, analysis of these SNPs again confirmed the statistical significance, with the exception of the TA haplotype. All ORs were comparable to those in the discovery cohort. These results confirmed that our approach can be applied to clinical data to identify the most informative SNP markers across a genetic region of significance.

Table 3 Estimated haplotype frequencies of two SNPs (C23356T and A33065G), estimated log odds ratios and their standard errors for all common haplotypes formed by identified SNPs


A complementary approach to the HBSP is the direct application of the stepwise regression approach to assess disease associations with SNP alleles or genotypes at multiple SNP loci, as described by Clayton and his colleagues [3, 16]. Basically, this approach treats individual SNP alleles or genotypes as covariates and then assesses their associations with the disease phenotype via the logistic regression model [1]. To identify functional SNPs, they propose using the usual stepwise regression technique to systematically analyze all SNP genotypes and their cross products. Those cross-product terms, if significant, are surrogates for possible haplotypic associations. While its key advantage includes the simplicity and familiarity to most population researchers, the interpretation of cross-product terms as possible haplotypic associations is not straightforward. Moreover, such an approach does not take full advantage of extended common haplotypes across many SNPs, because one has to numerate all cross products to detect a high-order interaction; e.g., eight SNPs will create 256 (= 28) allelic cross products or 6,561 (= 38) genotypic cross products.

To compare both stepwise approaches, we utilized simulation studies to assess the TDR, iTDR and FPER for the stepwise regression approach. We conducted the simulation studies under both null and alternative hypothesis and used the same 15 simulated SNPs from the simulation study population generated for our proposed approach. The simulation results are reported in Table 4. Under scenario 1, the OR corresponding to haplotype 10 takes values ranging from 1.3 to 5.0, and the FPER and iTDR estimates from the usual approach were comparable to those from the HBSP. However, the TDRs for detecting associations with the usual stepwise regression technique were lower than those obtained with HBSP (Table 2). Under scenario 2, the OR corresponding to haplotype 11 varies from 1.3 to 5.0. While the FPER estimates were around 5%, the TDR for detecting associations was quite low, even though the sample size reached 2000. Thus, compared to HBSP, the usual stepwise regression technique has less power for detecting true genetic associations and compatible power in discovering imperfect true genetic associations.

Table 4 False positive error rate under the null hypothesis, and true discovery rate (false positive error rate) under alternative hypotheses 1 when the Clayton's stepwise approach is used to select a subset of SNPs.

For illustrative purposes, we have also applied stepwise logistic regression models to the BPI discovery cohort. The forward stepwise selection procedures did not detect any SNPs at the significance level of α = 0.05. The backward stepwise elimination procedure detected the joint effect of the two SNPs (C23365T and A33065G), the results from which were consistent to those obtained by HBSP.

Recently, Cheng and colleagues have described another method of mapping functional sites with SNP-haplotypes [17], which shares a similar scientific objective to the HBSP. The key idea underlying their approach is to compute an overall statistic that summarizes all associations within a sliding window of multiple SNPs. Systematically covering genes/regions with sliding and overlapping windows, their method can pinpoint one or more sites that are functionally associated with the disease phenotype. The end result of this approach is to identify functional sites, which is complementary to that of the HBSP. In fact, one can apply the HBSP to produce the overall statistic for each sliding window, as a way of detecting haplotypic associations in an efficient manner.

While the HBSP performed well for the studies reported in this paper, its function can also be extended to other types of analysis. The HBSP can be adapted for other phenotypes, such as continuous, censored or categorical, via their corresponding link functions. As presented above, the key statistic needed to construct the Wald statistic [2] is the covariance matrix of all estimated parameters, which is typically obtainable from the maximum likelihood or estimating equations. In addition, the current method is structured to detect major genetic associations via the assumed penetrance model [1], and is not designed to detect gene-environmental or gene-gene interactions. To achieve those objectives, we can extend the penetrance model by including those interactions, which have been elaborated elsewhere [4].


The HBSP described above is effective in selecting a subset of SNPs whose haplotypes are significantly associated with a disease phenotype by eliminating SNPs with random polymorphisms. The HBSP retains the advantages of haplotype-based analysis while minimizing the deficiencies associated with typical haplotype-based analysis that includes extraneous SNPs. Simulation studies indicated that our permutation scheme effectively controls the false positive error rate while HBSP has adequate power to identify those functional SNPs/haplotypes. The illustrative example with SNP data from the BPI gene in a transplant cohort demonstrated the success of the HBSP in identifying two consecutive SNPs out of eight SNPs from the discovery cohort and in validating their associations with the disease phenotype from clinical data.


  1. Cheng R, Ma JZ, Elston RC, Li MD: Fine mapping functional sites or regions from case-control data using haplotypes of multiple linked SNPs. Ann Hum Genet. 2005, 69 (Pt 1): 102-12. 10.1046/j.1529-8817.2004.00140.x.

    Article  CAS  PubMed  Google Scholar 

  2. Neale BM, Sham PC: The future of association studies: gene-based analysis and replication. Am J Hum Genet. 2004, 75 (3): 353-62. 10.1086/423901.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  3. Cordell HJ, Clayton DG: A unified stepwise regression procedure for evaluating the relative effects of polymorphisms within a gene using case/control or family data: application to HLA in type 1 diabetes. Am J Hum Genet. 2002, 70 (1): 124-41. 10.1086/338007.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  4. Zhao LP, Li SS, Khalid N: A method for the assessment of disease associations with single- nucleotide polymorphism haplotypes and environmental variables in case- control studies. Am J Hum Genet. 2003, 72 (5): 1231-50. 10.1086/375140.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  5. Zhao H, Pfeiffer R, Gail MH: Haplotype analysis in population genetics and association studies. Pharmacogenomics. 2003, 4 (2): 171-8. 10.1517/phgs.

    Article  PubMed  Google Scholar 

  6. Niu T: Algorithms for inferring haplotypes. Genet Epidemiol. 2004, 27 (4): 334-47. 10.1002/gepi.20024.

    Article  PubMed  Google Scholar 

  7. Clark AG: The role of haplotypes in candidate gene studies. Genet Epidemiol. 2004, 27 (4): 321-33. 10.1002/gepi.20025.

    Article  PubMed  Google Scholar 

  8. Stephens JC, Schneider JA, Tanguay DA, Choi J, Acharya T, Stanley SE, Jiang R, Messer CJ, Chew A, Han JH, Duan J, Carr JL, Lee MS, Koshy B, Kumar AM, Zhang G, Newell WR, Windemuth A, Xu C, Kalbfleisch TS, Shaner SL, Arnold K, Schulz V, Drysdale CM, Nandabalan K, Judson RS, Ruano G, Vovis GF: Haplotype variation and linkage disequilibrium in 313 human genes. Science. 2001, 293 (5529): 489-93. 10.1126/science.1059431.

    Article  CAS  PubMed  Google Scholar 

  9. Cullen M, Noble J, Erlich H, Thorpe K, Beck S, Klitz W, Trowsdale J, Carrington M: Characterization of recombination in the HLA class II region. Am J Hum Genet. 1997, 60 (2): 397-407.

    PubMed Central  CAS  PubMed  Google Scholar 

  10. Daly MJ, Rioux JD, Schaffner SF, Hudson TJ, Lander ES: High-resolution haplotype structure in the human genome. Nat Genet. 2001, 29 (2): 229-32. 10.1038/ng1001-229.

    Article  CAS  PubMed  Google Scholar 

  11. Wall JD, Pritchard JK: Haplotype blocks and linkage disequilibrium in the human genome. Nat Rev Genet. 2003, 4 (8): 587-97. 10.1038/nrg1123.

    Article  CAS  PubMed  Google Scholar 

  12. Lin S, Chakravarti A, Cutler DJ: Exhaustive allelic transmission disequilibrium tests as a new approach to genome-wide association studies. Nat Genet. 2004, 36 (11): 1181-8. 10.1038/ng1457.

    Article  CAS  PubMed  Google Scholar 

  13. Snedecor GWCWG: Statistical methods. 1989, Iowa: The Iowa State University Press

    Google Scholar 

  14. Chien JW, Zhao LP, Hansen JA, Fan WH, Parimon T, Clark JG: Genetic variation in bactericidal/permeability-increasing protein influences the risk of developing rapid airflow decline after hematopoietic cell transplantation. Blood. 2006, 107 (5): 2200-7. 10.1182/blood-2005-06-2338.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  15. Hudson RR: Generating samples under a Wright-Fisher neutral model of genetic variation. Bioinformatics. 2002, 18 (2): 337-8. 10.1093/bioinformatics/18.2.337.

    Article  CAS  PubMed  Google Scholar 

  16. Clayton D, Chapman J, Cooper J: Use of unphased multilocus genotype data in indirect association studies. Genet Epidemiol. 2004, 27 (4): 415-28. 10.1002/gepi.20032.

    Article  PubMed  Google Scholar 

  17. Cheng C, Kimmel R, Neiman P, Zhao LP: Array rank order regression analysis for the detection of gene copy-number changes in human cancer. Genomics. 2003, 82 (2): 122-9. 10.1016/S0888-7543(03)00122-8.

    Article  CAS  PubMed  Google Scholar 

Download references


This work was supported in part by grants from the NCI (5R01 CA106320-05, 5 P01 CA53996-24, 1R01 CA112512-01), NHLBI (K23HL69860), an American Lung Association of Washington Research Grant, and the National Marrow Donor Program (Amy Strelzer Manasevit Research Award).

Author information

Authors and Affiliations


Corresponding author

Correspondence to Lue Ping Zhao.

Additional information

Authors' contributions

YY carried out numerical calculations and simulations, SL implemented the methodology and simulation strategy, JC contributed BPI data and helped to interpret results, JA participated in the manuscript preparation, LZ conceived the idea, led the development and simulation, and participated in the manuscript preparation. All authors read and approved the final manuscript.

Yin Yang, Shuying Sue Li contributed equally to this work.

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Rights and permissions

Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Yang, Y., Li, S.S., Chien, J.W. et al. A systematic search for SNPs/haplotypes associated with disease phenotypes using a haplotype-based stepwise procedure. BMC Genet 9, 90 (2008).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: