The completion of the HapMap project has stimulated further development of haplotype-based methodologies for disease associations. A key aspect of such development is the statistical inference of individual diplotypes from unphased genotypes. Several methodologies for inferring haplotypes have been developed, but they have not been evaluated extensively to determine which method not only performs well, but also can be easily incorporated in downstream haplotype-based association analyses. In this paper, we attempt to do so. Our evaluation was carried out by comparing the two leading Bayesian methods, implemented in PHASE and HAPLOTYPER, and the two leading empirical methods, implemented in PL-EM and HPlus. We used these methods to analyze real data, namely the dense genotypes on X-chromosome of 30 European and 30 African trios provided by the International HapMap Project, and simulated genotype data. Our conclusions are based on these analyses.
All programs performed very well on X-chromosome data, with an average similarity index of 0.99 and an average prediction rate of 0.99 for both European and African trios. On simulated data with approximation of coalescence, PHASE implementing the Bayesian method based on the coalescence approximation outperformed other programs on small sample sizes. When the sample size increased, other programs performed as well as PHASE. PL-EM and HPlus implementing empirical methods required much less running time than the programs implementing the Bayesian methods. They required only one hundredth or thousandth of the running time required by PHASE, particularly when analyzing large sample sizes and large umber of SNPs.
For large sample sizes (hundreds or more), which most association studies require, the two empirical methods might be used since they infer the haplotypes as accurately as any Bayesian methods and can be incorporated easily into downstream haplotype-based analyses such as haplotype-association analyses.
The completion of the HapMap project has stimulated further interest in haplotype inference from un-phased single nucleotide polymorphism (SNP) genotypes . Recent evidence indicates the human genome has hot spots and cold spots for recombination, and it is divided into multiple haplotype blocks, each of which has only a limited number of haplotypes [2–5]. Such haplotype block structure in the human genome suggests that haplotype-based methods may play an important role in genetic association studies [6, 7]. Haplotypes can be generated experimentally by dissecting out single chromosomes and inserting the entire chromosome into a yeast artificial chromosome  or by using rodent-human hybrid techniques to physically separate two chromosomes . However, both technologies are experimentally challenging and cost prohibitive for use in population research at this time. The most commonly used technologies generate un-phased SNP genotypes. One way to resolve individual haplotypes is via family data, which is expensive to collect . Another option is to resolve individual haplotypes statistically. Clark's heuristic algorithm is probably among the first statistical methods for inferring haplotypes from genotypes of unrelated individuals .
Many maximum likelihood methods have been developed, and almost all share the same scientific objectives and likelihood framework. The fundamental difference among these methods is a prior assumption for the distribution of haplotypes: methods with prior assumption for distribution of haplotypes are referred to as Bayesian methods, and the methods without any prior assumption are called empirical methods.
Estimation-maximization (EM) algorithm, a maximum likelihood method, was first introduced to infer haplotypes from unrelated individuals [11–13], but those earlier works were computationally demanding when processing large number of SNPs. More recently Qin et al.  discussed a new strategy for estimating haplotype frequencies using the EM algorithm, which largely improved performance, especially when analyzing data with large numbers of SNPs. Li et al.  have applied the estimating equation (EE) technique and further improved the statistical and computational efficiency in the estimation of haplotype frequencies and their standard errors. Both EM and EE methods are empirical methods.
Stephens et al. were probably among the first groups to propose a model-based Bayesian method  under the assumption of coalescence of haplotypes. Later it was modified to improve statistical and computational efficiency [17, 18]. Niu et al.  took a Bayesian approach but chose a Dirichlet distribution for haplotypes as their prior, and published a computational algorithm to handle a large number of SNPs, which was referred to as partition-ligation (PL).
Which method performs the best? Several papers have attempted to address this question and their conclusions are not without controversy [14, 19–22]. These papers compared the performances of haplotyping methods based on a limited number of available haplotype data sets and some simulated data. Recently, Marroni et al.  used genotype data provided by Illumina and Affymetrix for Genetic Analysis Workshop 14. The data include genotypes of 104 mother-son pairs with Caucasian ancestry on 313 SNPs of the X-chromosome. Because males have only one copy of the X-chromosome, mother haplotypes can be resolved from their sons' genotypes. Instead of evaluating the performances of the methods for analyzing SNPs within haplotype blocks, Marroni et al. investigated the 14 series of unphased genotypes of 5 or 10 SNPs with different values of linkage disequilibrium (LD). In this paper, we used the dense genotypes on the X-chromosome of 30 European and 30 African trios provided by the International HapMap project. The X-chromosome was chosen because mother haplotypes can be unambiguously resolved from the genotypes of trios. Resolved haplotypes were divided into blocks using the Haploview program , providing abundant haplotype data sets. Among the identified haplotype blocks, we randomly selected 500 blocks to evaluate haplotyping method performances. To evaluate the performances of haplotype methods on the data with larger sample sizes, we conducted some simulation studies under different scenarios. In our first set of simulations, we generated haplotypes based on real data on the X-chromosome. In our second set of simulations, we generated haplotypes using Hudson's coalescent program  to investigate how much efficiency is gained by assuming coalescence prior in PHASE compared to empirical methods without assuming a prior. Programs used for comparisons are PHASE (version 2.1)  for the model-based Bayesian method , HAPLOTYPER  for the empirical Bayesian method , PL-EM  for the EM method , and HPlus  for the EE method . Because the accurate estimation of haplotype frequencies and inference of individual haplotypes are both critical in assessing haplotype association with disease phenotypes [30–34], our comparisons focus on evaluating method performance from these two angles.
HapMap Trio Data
We used X-chromosome genotype data of 30 European and 30 African trios from the HapMap project . With trio data, mother haplotypes can be resolved unambiguously from her offspring's and the father of her offspring's genotype data. The mother's two chromosomes are separated at each locus as transmitted or not transmitted to her child. If the child is male, the child's X-chromosome is transmitted from his mother, therefore mother's allele that matches the child's allele is the transmitted allele. If the child is female, one chromosome is from her mother and the other is from her father. Using the father's allele on the X-chromosome, we can deduce which allele is transmitted from the mother. Hence, 30 mother haplotype pairs are determined by sorting out transmitted alleles from untransmitted alleles, and these sixty (= 30 × 2) represent the true haplotypes, which are not readily obtainable for any autosome chromosomes.
Applying this procedure to the HapMap Phase II data (July 2005 release), we obtained phase-resolved X-chromosome SNP data. We further divided these SNPs into haplotype blocks using Haploview software , by specifying for the method described in , the parameters of confidence interval minima 0.8 and 0.5 for strong LD, upper confidence interval maximum 0.6 for strong recombination, the fraction of strong LD in informative comparisons to be at least 0.95, and excluding the SNPs with MAF less than 0.05. We chose the parameters to be less stringent than default values to get larger blocks that are still in high LD. The block identification was done separately for European and African mother haplotypes. Within each population, we randomly chose 500 haplotype blocks to compare haplotyping methods. Among the 500 European mother haplotype blocks, the number of SNPs ranges from 2 to 195 with mean of 13 and median of 7. Among the 500 African mother haplotype blocks, the number of SNPs ranges from 2 to 33 with mean of 5 and median of 3. It had been shown that African populations have shorter haplotype blocks than European populations .
Consider a sample of n unrelated individuals from a study population. From each individual, we observe q SNP-genotypes on a specific region in the genome. Let = (gi 1, …, giq) denote the q SNP-genotypes for the i th individual. When genotype gijis heterozygous, the phase (parental origin of the two alleles) becomes ambiguous and has two solutions denoted by pij. Let = (pi 1, …, piq) denote the phase of . Given phase , genotype uniquely determines a diplotype (a pair of compatible haplotypes), Hi= (Hi 1, Hi 2), i.e. | = (Hi 1, Hi 2). Therefore, for a genotype with m heterozygous loci, there are 2m-1possible resolutions for phase and diplotypes. Let = (θ1, θ2, …, θT) denote population haplotype frequencies where T is the total number of haplotypes.
All methods compared in this paper use the maximum likelihood approach or its variation. They all shared the same likelihood function of haplotype frequency = (θ1, θ2, …, θT), which can be written as
where f(Hij|) is the probability of haplotype Hijgiven the population's haplotype frequencies; f() is the prior probability of phase.
The estimation-maximization (EM) algorithm was used to obtain maximum likelihood estimates of haplotype frequencies, . To avoid trapping in a local maximum, the programs implementing EM algorithm require multiple initial values to ensure the global maximum. Excoffier and Slatkin  used bootstrapping to estimate standard errors of estimates of haplotype frequencies, and implemented the method in ARLEQIN. Qin et al.  implemented Louis' method  and implemented the method in PL-EM. Applying estimation equation technique, Li et al.  efficiently estimated the haplotype frequencies and their standard errors and implemented the method in HPlus.
Different from the empirical approaches described above, the model-based Bayesian method  further assumes that haplotypes are coalescent so future-sampled individual haplotypes Hiis assumed to be more similar to the previously sampled haplotypes, H-i. This Bayesian method was implemented in PHASE software program. Another Bayesian method  assumes that prior distribution of haplotype frequency follows a Dirichlet distribution with hyperparameter = (β1, …, βT. Using Gibbs sampling algorithm, Niu et al. sampled a pair of compatible haplotypes for each individual and estimate the haplotype frequencies, and this method was implemented in HAPLOTYPER.
Accurately estimating haplotype frequencies and inferring individual haplotypes are both critical in assessing haplotype association with disease phenotypes [30–34]. Here we consider two measures to evaluate the accuracy of haplotype frequency estimates and the inferred individual haplotypes. The first measure is the similarity index  defined as , where θjand are the true and the estimated frequency of the j th haplotype, to measure the overall similarity between the estimated and the sample haplotype frequencies and the value of the similarity index ranges from zero to one. The second measure is the prediction rate that measures the percent of correct predictions for all haplotypes from their genotypes compared to the sampled haplotypes. Since HAPLOTYPER gives only one pair of compatible haplotypes for each individual, we calculated the prediction rate based on the best prediction for each individual. The prediction rate weighted by the posterior probability of a pair of inferred haplotypes was also used to evaluate other programs. The results are similar between the two prediction rates. Running time is recorded to measure the computational efficiency of the implemented algorithms. All computer programs were run under their default or recommended settings on computer with a dual Pentium III 800 MHz with 2 GB RAM.
Genotype data on the X-chromosome from the International HapMap project
All four programs inferred haplotypes with high accuracy from the genotypes on the X-chromosome in the 500 selected European haplotype blocks, but HAPLOTYPER failed to resolve any results in 18 blocks and PHASE performed poorly in one of the haplotype blocks with similarity index of 0.29 and prediction rate of 0.28. The mean similarity index and the mean prediction rate are 0.99 and the medians are 1.0 for all programs (Table 1). The standard deviation of the similarity index is 0.029, 0.024, 0.040, and 0.24 and the range is (0.73, 1.0), (0.83, 1.0), (0.29, 1.0), and (0.73, 1.0) for PL-EM, HPlus, PHASE, and HAPLOTYPER, respectively. The standard deviation of the prediction rate is 0.029, 0.025, 0.040, and 0.25, respectively, and the range is (0.73, 1.0), (0.83, 1.0), (0.28, 1.0), and (0.73, 1.0) for PL-EM, HPlus, PHASE, and HAPLOTYPER, respectively. The running time is 98, 20, 9935, and 422 seconds for PL-EM, HPlus, PHASE, and HAPLOTYPER, respectively.
All four programs performed similarly on the 500 African haplotype blocks as they did on the 500 European haplotype blocks (Table 2), except that HAPLOTYPER failed to converge in 2 blocks. Since African populations have shorter blocks than European populations, all programs required less running time; PL-EM, HPlus, PHASE, and HAPLOTYPER took 15, 7, 1174, and 37 seconds, respectively.
In general, the performances of all programs are affected by the percentage of heterozygous individuals, since heterozygosis at multiple loci indicates the uncertainty of individual haplotypes. To investigate the impact of this factor, we examined its relationship with similarity index and prediction rate in analyzing the 500 European haplotype blocks (Figure 1). It appears that all programs tended to perform better for the data with a lower percentage of uncertainty. Even with high percentage of uncertainty in the genotype data, all programs still performed with high accuracy. The other impact factor is the LD of SNPs in the blocks that had been investigated in recent paper . Figure 2 shows the relation between performances with the LD of haplotype blocks. Since our focus is to evaluate program performance on haplotype blocks, SNPs are in high LD within blocks. With high LD, all programs perform well, except PHASE, which had a poor performance on one block. Multi-locus LD is measured using the formulation derived in the paper .
In the first set of simulations, we randomly selected three series of SNPs with low LD. For each selected series of SNPs, we estimated frequencies from the 60 haplotypes of the 30 European mothers. Based on these frequencies, we randomly drew haplotypes to form genotypes of individuals. The number of individuals was 100, 150, 200, 250, and 300, respectively. For a given sample size, we then generated 100 replicated data sets and analyzed each data set using all four programs. Table 3 shows the average performance of each program over 100 replicates. The similarity index and prediction rate from analyzing the original data are presented in the first row of each block. It is clear that PHASE is superior to the other programs with respect to performance indices for a small sample size, but when the sample size increases, the other programs, especially PL-EM and HPlus, performed as well as PHASE and sometimes (e.g. in the second selected block) outperformed it on the prediction rate.
We also used Hudson's coalescent program to generate phase-resolved haplotype data. We generated data sets with a mutation rate of 4 (= 4Neθ) and sample sizes of 100, 150, 200, and 250, respectively. For each sample size, we repeated simulations 100 times. Table 4 lists the average performance indices for each program and sample size. For all sample sizes, PHASE consistently performed better than all other programs with respect to both similarity index and prediction rate. This result supports the notion that when the modeling assumption is valid, PHASE is more efficient than other methods (empirical Bayesian or empirical methods). However, it is important to note that the differences between PHASE and others become less marked with larger sample sizes. This result was expected because the gain by PHASE due to the coalescent assumption diminishes and the likelihood methods approach their full efficiency with increased sample sizes. We also conducted simulation studies with different coalescent model parameters, and the results (not shown) are largely comparable to those shown in Table 4.
In both Tables 3 and 4, average running times are recorded on the far right for comparison purposes. For all simulations, PHASE requires much more computational time than others and HPlus requires the least computational time among the four.
The key difference between the Bayesian and empirical methods compared in this paper is the use of priors (the approximate coalescent prior by PHASE and the Dirichlet prior by HAPLOTYPER). If the prior approximates real haplotype data, Bayesian methods gain some efficiency using the prior. On the other hand, efficiency may be lost because of a wrong prior. The influence of the prior is non-negligible when the sample size is small. In this case, because the real haplotypes tend to coalesce, PHASE using the approximate coalescent prior is likely to produce more efficient estimates, and HAPLOTYPER using the Dirichlet prior may produce less efficient estimates than the empirical methods. This phenomenon was observed when inferring haplotypes from the simulated genotypes using Hudson's coalescent program . However, the superior performance of PHASE diminishes when the sample size increases (Table 4). For genotype data with low LD, PHASE using the approximate coalescent prior would gain some efficiency when the sample size is small, such as 30, which we investigated here, and 104, which Marroni et al.  investigated. However, when the sample increases to 150 or larger, PL-EM and HPlus implementing empirical methods can perform as well as PHASE (Table 3).
Recently, Kimmel and Shamir  developed a new likelihood method to infer haplotypes and identify haplotype blocks. Their likelihood uses not only the parameters of haplotype frequencies but also the parameters of the probability of observing a variant allele in each locus and each haplotype. Using the EM algorithm, they estimated haplotype frequencies. It deserves debate whether this new likelihood is better than the one used in most methods. In terms of performance, Kimmel and Shamir  claimed that PHASE performed slightly better using its default setting and was a hundred times slower than GERBIL implementing the new likelihood method. Their results comparing PHASE and the empirical methods are similar to ours.
The recent advent of genotyping technologies is rapidly transforming genetic association studies by providing more SNPs (more than 500,000 SNPs) on arrays and by reducing the cost of genotyping individual samples (around $500~1000 per sample). The next-generation genome wide studies will likely use several hundreds or thousands of SNPs on hundreds or thousands of individuals. To gain both statistical and computational efficiency, haplotype-based analyses will be increasingly used, especially for those regions with high LD. With such massive data, as we show in this paper, the empirical methods such as EM and EE can infer haplotypes as accurately as a time-consuming method, such as the model-based Bayesian method that PHASE implement, and they can be easily incorporated into downstream haplotype-based analyses. The empirical methods had already been used in many haplotype-based association methods [32, 34, 40].
The International HapMap Consortium: A haplotype map of the human genome. Nature. 2005, 437 (27): 1299-1320. 10.1038/nature04226.
Green ED, Cox DR, Myers RM: The human genome project and its impact on the study of human disease. The genetic basis of human cancer. Edited by: Vogelstein B, Kinzler KW. 1998, New York , McGraw-Hill, Health professional division, 33-63.
Qin ZS, Niu T, Liu JS: Partition-ligation-expectation-maximization algorithm for haplotype inference with single-nucleotide polymorphisms. American Journal of Human Genetics. 2002, 71: 1242-1247. 10.1086/344207.
Li SS, Khalid N, Carlson C, Zhao LP: Estimating Haplotype Frequencies and Standard Errors for Multiple Single Nucleotide Polymorphisms. Biostatistics. 2003, 4 (4): 513-522. 10.1093/biostatistics/4.4.513.
Marroni F, Toni C, Pennato B, Tsai YY, Duggal P, Bailey-Wilson J, Presciuttini S: Haplotypic structure of the X chromosome in the COGA population sample and the quality of its reconstruction by extant software packages. BMC Genetics. 2005, 6(Suppl I):S77: 1-5.
Fallin D, Cohen A, Essioux L, Chumakov I, Blumenfeld M, Cohen D, Schork NJ: Genetic analysis of case/control data using estimated haplotype frequencies: application to APOE locus variation and Alzheimer's disease. Genome Research. 2001, 11 (1): 143-151. 10.1101/gr.148401.
Schaid DJ, Rowland CM, Tines DE, Jacobson RM, Poland GA: Score Tests for Association between Traits and Haplotypes when Linkage Phase Is Ambiguous. American Journal of Human Genetics. 2002, 70: 425-434. 10.1086/338688.
Zhao LP, Li SS, Khalid N: A Method for Assessing Disease Associations with SNP Haplotypes and Environmental Variables in Case-Control Studies. American Journal of Human Genetics. 2003, 72: 1231-1250. 10.1086/375140.
Stram DO, Pearce CL, Bretsky P, Freedman M, Hirschhorn JN, Altschuler D, Kolonel LN, Henderson BE, Thomas DC: Modeling and E-M estimation of haplotype-specific relative risks from genotype data for a case-control study of unrelated individuals. Human Heredity. 2003, 55: 179-190. 10.1159/000073202.
Gabriel SB, Schaffner SF, Nguyen H, Moore JM, Roy J, Blumenstiel B, Higgins J, DeFelice M, Lochner A, Faggart M, Liu-Cordero SN, Rotimi C, Adeyemo A, Cooper R, Ward R, Lander ES, Daly MJ, Altshuler D: The structure of haplotype blocks in the human genome. Science. 2002, 296 (5576): 2225-2229. 10.1126/science.1069424.
Authors thank Dr. Peter Donnelly for suggesting X-chromosome genotypes from the HapMap Project as the basis for comparison, and thank Mike Feolo for initial discussions on the need for comparing different methods. This work is supported in part by grants from National Institute of Health (5RO1 CA106320-05, 1RO1 CA112512-01, 2 P01 DK053369-07A1, and 5 PO1 CA53996-24).
Authors and Affiliations
Division of Public Health, Fred Hutchinson Cancer Research Center, Seattle, WA, USA
Shuying Sue Li & Lue Ping Zhao
Quality Indicator Project, Maryland Hospital Association, MD, USA
The author(s) declare that they have no competing interests.
SSL did the study design, directed the analyses, and drafted the manuscript. JJC ran all the analyses and simulations. LPZ contributed to the study design and the draft of the manuscript. All authors approved the final manuscript.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.