 Proceedings
 Open Access
 Published:
How to quantify information loss due to phase ambiguity in haplotype casecontrol studies
BMC Genetics volume 6, Article number: S108 (2005)
Abstract
Assigning haplotypes in a casecontrol study is a challenging problem. We proposed a method to quantify the information loss due to missing phase information. We determined which individuals were responsible for the information loss, and calculated how much information could be gained when the ambiguous individuals could be resolved by adding additional parental information.
Background
Currently the majority of association studies using singlenucleotide polymorphism (SNP) markers for complex diseases are casecontrol diseasemarker studies. In this paper, we consider a limited number of SNPs within a candidate region, with the aim of estimating haplotype frequencies and haplotype effects on disease status. This approach requires information about how to assign haplotypes from the observed genotypes. This phase information can be inferred using statistical procedures such as the expectationmaximization (EM) algorithm.
As Hodge et al. [1] showed, in general the probability of not being able to assign haplotypes with certainty increases with the number of the loci, and with the allele frequencies approaching 0.5. Accepting the "best" configuration of haplotypes as the "real" haplotype without critically examining data it might lead to misleading results. Therefore it might be useful to screen data beforehand using some measure of uncertainty.
There exists software with an option to print out all possible haplotype configurations with corresponding posterior probabilities. We wondered whether we could use this extra information to settle some of the current issues in haplotype analysis: how do you determine which individuals are responsible for the information loss, and how much information do we gain when parental genotypes were available?
With these issues in mind, we first defined the information loss as complete data information (without uncertainty) minus the observed information [2]. Under the assumption of HardyWeinberg equilibrium (HWE), we first considered the information content of each individual according to the diagonal elements of the information matrix. Considering the correlation between haplotypes, we employed Doptimality [3], which maximizes the determinant of the observed information matrix. With this measure, forward stepwise selection was applied to select the individuals that potentially yield the largest gain in information.
Methods
Suppose we have a sample of n unrelated individuals from a population. From each individual we observe m multilocus SNP genotypes. Under HWE, the distribution of haplotypes is assumed to be multinomial, and the joint distribution of the paired haplotypes is equal to the product of the two marginal distributions. The haplotype will be described by a k(= 2^{m}) dimensional vector H with its elements 0 or 1, and P(H_{ i }= 1) = π_{ i }denotes the frequency of haplotype i ∈ {1, ..., k}. If there is no uncertainty, then for an (ordered) haplotype pairs (H_{1}, H_{2}) of one individual, j may be described with a kvector H_{ ind, j }= H_{1} + H_{2}, where H_{ ind, j }∈ {0, 1, 2}, socalled haplotype dosage. Let C denote the covariance matrix of H as follows:
Using a natural parameterization of π:
where α = ln π, the total information with no phase ambiguity in the data is I_{ tot }= 2nC, and the covariance of estimated π is . In the case of uncertain haplotypes the total information from n individuals that is contained in the observed data is given by , where L_{ i }denotes the individual information loss due to phase uncertainty. As Louis [2] observed, this can be nicely interpreted as "observed information = complete data information  missing information". Since we lose information, the covariance of estimates will increase: , approximately. So when we have no ambiguities in our data, L_{ i }= 0, and the covariance becomes simply C/2n.
We first investigated the diagonal elements of L_{ i }in cases. Although the use of the trace of Lmatrix (Aoptimality [3]) is an intuitive method to select individuals who need additional information, it does not consider the possible correlations of the parameters. Instead we propose to maximize the determinant of the information matrix based on Doptimality [3].
Finally, the real interest lies in quantifying the information loss due to haplotype ambiguities in the setting of casecontrol studies. This can be achieved by considering cases and controls separately as the two independent sample problem, and by combining the results using a (multiplicative) disease model: for example, by minimizing
, where · denotes the determinant.
Results
After performing a linkage analysis for the microsatellite markers, we analyzed SNP packet 153, including the microsatellite marker D03S0127 and 19 SNPs. Our example casecontrol data consist of 200 unrelated subjects and three loci. The case population consists of 100 affected offsprings selected from each family of Danacaa population replicate 8. To select a suitable subregion for our purpose we employed the sliding scores [4], and decided to study threelocus haplotypes based on B03T3056, B03T3057, and B03T3058. The computations were done with the programming language R [5].
We quantified the information loss per haplotype by Aoptimality (Table 1). For the "rare" haplotype 212 in cases, the information loss reaches almost 54% with respect to the situation of no uncertainty. Note that the relative information loss compared to the maximum information (%) can be interpreted as (1  R^{2}) × 100, where R^{2} is the haplotype uncertainty measure by Stram et al. [6]. We can already detect different missing patterns between haplotypes.
The next question is: if you have options to solve the phase problem by collecting the additional family information, which individual would you select first? Using Aoptimality we calculated the information loss per individual and per haplotype for cases in Table 2. We grouped individuals with identical genotypes, the order of the group identifications being determined by the trace of L_{ i }(the column "Tot. loss"). The characters of the group identifiers denote the genotypes at the SNPs, where 1 and 2 stand for homozygotes 1/1 and 2/2, respectively, and H denotes a heterozygote 1/2. The values of the last row give the information loss per haplotype as in Table 1. The highest label (HHH) denotes the group with highest loss, therefore potentially having the highest information gain. Hence, applying Aoptimality the order of groups to be selected is: HHH, H1H, HH1, etc.
Figure 1 shows the forward selection of individuals using the Doptimality criterion. The groups in the ylabels are ordered as in Table 2. Applying Doptimality we clearly see the potentially most informative persons are those with genotype H1H, and not the group of persons with three heterozygous loci, HHH. Hence, Figure 1 also illustrates the discrepancies in using two different criteria. Heuristically we might explain this as follows. In Table 2, the haplotype 111 has the largest information loss. Within 111 the individuals contributing the largest loss are the type H1H. Selecting (or resolving) one individual in this group will change the table, and we repeat the procedure. While Table 2 only represents the diagonal elements, Figure 1 gives a more complete representation of the structure of the loss matrix. Specifically, the jumps between the groups are caused by the correlations between the parameters. Moreover, at the beginning of the selection procedure we gain more information than at the end.
Observe that the above results are valid under the assumption that we could completely resolve the ambiguous haplotypes. When we actually added the parental information for this data, we could resolve about 71% of ambiguous individuals (number of cases = 100). Because it would depend heavily on the structure of data, for general usage we calculated the expected loss conditional on all possible parental genotypes. Using Aoptimality, approximately 65% of information loss in average could be recovered.
Conclusions and Discussions
The expected loss considering all possible (and compatible) parental genotypes does not differ much between the genotypic groups; it does not matter whether the individual is heterozygous on 2 loci, or 3 loci. For example, all heterozygous individuals might have two heterozygous parents (HHH), or two homozygous parents (father with type 111, mother 222). It clearly depends on the allele frequencies, hence on the structure of data. Our ongoing investigation shows that the selection patterns also depend strongly on the questions asked; that is, whether we are interested in each group, in pooled groups, or in terms of haplotype risks in "minimizing error" or in "maximizing power".
Although selecting the informative individuals based on Aoptimality is not as accurate as the method based on Doptimality, it is an intuitive method to understand the structure of uncertainty of the data. However, in some situations when the correlations of the parameters are not ignorable, our proposed methods might give more insight into the data. In our future work, we will investigate haplotype effects on disease status and some other extensions: focusing on "interesting" haplotypes, including missing data, or studying the behavior with an increasing number of SNPs.
Abbreviations
 EM:

Expectation maximization
 HWE:

HardyWeinberg equilibrium
 SNP:

Singlenucleotide polymorphism
References
 1.
Hodge SE, Boenke M, Spence MA: Loss of information due to ambiguous haplotyping. Nat Genet. 1999, 21: 360361. 10.1038/7687.
 2.
Louis T: Finding the observed information matrix when using the EM algorithm. J Roy Stat Soc B Met. 1982, 44: 226233.
 3.
Fedorov VV: Theory of Optimal Experiments. 1972, New York: Academic Press
 4.
Clayton D, Jones H: Transmission/disequilibrium tests for extended marker haplotypes. Am J Hum Genet. 1999, 65: 11611169. 10.1086/302566.
 5.
R Development Core Team: R: A language and environment for statistical computing. R Found Stat Comput. Vienna, Austria, ISBN 3900051003
 6.
Stram DO, Leigh Pearce C, Bretsky P, Freedman M, Hirschhorn JN, Altshuler D, Kolonel LN, Henderson BE, Thomas DC: Modeling and EM estimation of haplotypespecific relative risks from genotype data for a casecontrol study of unrelated individuals. Hum Hered. 2003, 55: 179190. 10.1159/000073202.
Acknowledgements
This paper originates from the GENOMEUTWIN project that is supported by the European Union Contract No. QLG2CT200201254.
Author information
Affiliations
Corresponding author
Additional information
Authors' contributions
HWU performed the analyses and wrote the manuscript. HWU and JJHD carried out the preliminary linkage analyses. All authors participated in the development of the methods, interpreted of the results of the analysis, read the manuscript, and approved the final manuscript.
Rights and permissions
About this article
Cite this article
Uh, HW., HouwingDuistermaat, J.J., Putter, H. et al. How to quantify information loss due to phase ambiguity in haplotype casecontrol studies. BMC Genet 6, S108 (2005). https://doi.org/10.1186/147121566S1S108
Published:
Keywords
 Information Loss
 Information Matrix
 Parental Genotype
 Phase Ambiguity
 Haplotype Effect