Bayesian estimates of linkage disequilibrium

Background The maximum likelihood estimator of D' – a standard measure of linkage disequilibrium – is biased toward disequilibrium, and the bias is particularly evident in small samples and rare haplotypes. Results This paper proposes a Bayesian estimation of D' to address this problem. The reduction of the bias is achieved by using a prior distribution on the pair-wise associations between single nucleotide polymorphisms (SNP)s that increases the likelihood of equilibrium with increasing physical distances between pairs of SNPs. We show how to compute the Bayesian estimate using a stochastic estimation based on MCMC methods, and also propose a numerical approximation to the Bayesian estimates that can be used to estimate patterns of LD in large datasets of SNPs. Conclusion Our Bayesian estimator of D' corrects the bias toward disequilibrium that affects the maximum likelihood estimator. A consequence of this feature is a more objective view about the extent of linkage disequilibrium in the human genome, and a more realistic number of tagging SNPs to fully exploit the power of genome wide association studies.


Background
Single nucleotide polymorphisms (SNPs) are an invaluable resource to identify regions of the human genome that may be associated with disease. A key to this process is linkage disequilibrium (LD) that is defined as the non-random association between the alleles of SNPs [1]. Although LD may occur between SNPs that are not in linkage but are associated, we will focus on the LD due to the spatial structure of the genome. In this situation, the nonrandom association implies that pairs of alleles in the same haplotype occur differently from what we would expect in a random pairing and several measures of LD have been proposed to capture the departure from independent pairing of the alleles of SNPs [2]. In this paper we will limit attention to D, its normalized version D', and the well known bias of the Maximum Likelihood Estimate (MLE) of D' toward disequilibrium [2,3]. This bias is particularly large in small samples and SNPs with rare alleles to the point that SNPs whose alleles occur independently may be inferred to be in strong LD [4]. However, relying on small samples to identify patterns of LD is not unusual: for example, the International HapMap Project (IHMP) aims to establish genome-wide patterns of LD using genotype data of at most 30 trios or 45 unrelated individuals [5]. The genotype data typed in this small number of samples are used to describe the extent of LD in the human genome, and derive a map of the haplotypes and the SNPs that are sufficient to tag the human genome. These results will have a deep impact on genome wide association studies and, in particular, inferring larger blocks of LD than the real ones may lead to the selection of an insufficient number of SNPs and hence decrease the power of genome wide association studies. In this scenario, biasing the estimate of LD toward equilibrium appears to be a safer alternative.
Several solutions have been proposed to reduce the bias of the MLE of D' toward disequilibrium [4]. A pragmatic solution is to impose some "ad hoc" threshold on the minimum allele frequency (MAF) of those SNPs that can be used to infer the pattern of LD [6]. Imposing this threshold leads to a non-random selection of SNPs and may introduce ascertainment bias [7,8]. The thought behind our approach is that the bias of the MLEs of D and D' is due to the lack of information in the data to discriminate between equilibrium and different magnitude of disequilibrium, and any attempt to correct this bias is due to fail as it was acknowledged in [3]. However, it is known that, on average, the strength of LD due to linkage decreases as the physical distance between SNPs increases [9,10]. Therefore, we propose a Bayesian estimator of D' that allows us to integrate data with prior information about the pattern of LD decay. To this end, we use a prior distribution on the pairwise dependencies between different SNPs that is a decreasing function of their physical distance. We show how to compute the posterior estimate of D' using Markov Chain Monte Carlo methods, and provide a numerical approximation that can be used for fast estimation of LD in large regions of the human genome. As we show in simulated and real data from the IHMP, the effect of the prior distribution is to drastically reduce the bias toward disequilibrium even in small samples, and to remove the need of arbitrary thresholds on the MAF. We also show that, compared to the MLE, our estimators lead to infer patterns of LD decay that are much closer to published results [10], and confirms the existence of haplotype blocks as regions of low recombination. The method is implemented in a computer program called Bayesian Linkage (BLink) [11].

Results and discussion
The traditional D and D' Given two SNPs L 1 and L 2 , with alleles A/a and B/b, A and B the major alleles, we define the probability of the haplotype ij by p ij = p (L 1 = i, L 2 = j), i = A, a, j = B, b. As in [12], we assume the relation p A ≥ p B on the probabilities p A = p(L 1 = A), p B = p(L 2 = B) from which the inequality p Ab ≥ p aB follows.
The two SNPs are in linkage equilibrium when the co-occurrence of two alleles on the same haplotype is random, e.g. p ij = p i p j for all i = A, a, j = B, b. On the other hand, LD implies some form of dependency in the alleles on the same haplotype and hence departure from independence of the probabilities p ij . Although there are many ways to measure departure from independence in a 2 × 2 table [13], a widely used measure of LD is the parameter D defined by Because the domain of D is a function of the allele frequencies, different normalization methods have been proposed to facilitate the interpretation [2]. The most common one is the measure that was suggested by Lewontin [14] and is defined as D/max D: is defined in the interval [-1, 1], with = ± 1 describing perfect disequilibrium and = 0 describing equilibrium. It is also common to take the absolute value of , say D' = | | to have a measure in the interval [0,1]. This is for example the default measure of LD in the popular program HaploView [6].

Maximum likelihood estimation
Suppose now that we have a data set of N individuals and n = 2N known haplotypes for the two SNPs (we assume here known phase for all haplotypes and discuss the phasing issue at the end of this section). We denote by n ij (i = A, a, j = B, b) the frequencies of the four haplotypes, and by n i and n j the allele frequencies with n A ≥ n B . Assuming that the four haplotypes follow a multinomial distribution with probabilities p ij , the likelihood function can be written as: is the indicator function defined as I(x ∈ X) = 1 if x ∈ X and 0 otherwise. Note that: • whenever n ab = 0, so that = -1 and = 1; • whenever n aB = 0, so that = 1 and = 1.
These two facts determine the bias toward disequilibrium of and D' that can lead to infer that two SNPs are in disequilibrium when they are actually in equilibrium. For example, the expected number of haplotypes ab in a sample of n haplotypes between two SNPs in equilibrium is np a p b . If both p a and p b are smaller than 0.1, then the minimum sample size n that yields an expected number of haplotypes ab greater than 1 is 100, and this number goes up to 400 when both p a and p b are 0.05. Therefore, data from small samples and rare alleles do not provide information to discriminate between equilibrium and disequilibrium, while the MLE of D' returns its maximum value consistent with perfect disequilibrium. This situation is well known and supported by simulation studies. Teare and coauthors [4] showed the extent of this bias through extensive simulations in which they examined the effect of sample size, MAF and strength of LD on the MLE of D'. Their study suggested that the bias is severe in small samples (less than 100 subjects), when both alleles are rare (MAF less than 0.05), and the two SNPs are in equilibrium. An "ad hoc" solution consists of disregarding those SNPs with a MAF below some threshold. Although this reduces the bias of the MLEs, it introduces an ascertainment bias that may impact the inferred pattern of LD [7].

Bayesian approach
Our Bayesian estimator is based on the following intuition: on average, the magnitude of disequilibrium between two SNPs decreases at exponential rate with their physical distance. We use this information to build a conjugate prior distribution on the parameters p ij with the property that, a priori, the larger the distance between two SNPs, the more likely the two SNPs are in linkage equilibrium. The standard conjugate prior to a multinomial distribution is a Dirichlet distribution with density function defined as: Given data n ij , the posterior distribution is still a Dirichlet distribution with density function: in which the prior hyper-parameters α ij are updated into α ij + n ij . The prior means of the parameters p ij are α ij /α T , where α T = ∑ ij α ij . The posterior means become E(p ij |n) = (α ij + n ij )/(α T + n) and can be used as point estimates of the parameters. Furthermore, the posterior distributions of the marginal probabilities p A and p B follow Beta distributions with hyper-parameters (α A + n A , α a + n a ) and The inference on the parameters D, D s and D' is more complex. First, we note that we can write these parameters as follows: Equations (9) and (10) define the parameters D s and D' as mixtures of two components, with weights p(D < 0) and p(D ≥ 0). The two components are non linear functions of the parameters p ij , as is the parameter D, and make the exact inference on these parameters intractable. However, we can resort on Markov Chain Monte Carlo methods to generate a sample of values of either parameters from their posterior distribution that can be used for further inference. In Figure 1, we provide a model description that can be used in Winbugs 1.4 to generate samples from the posterior distributions of the parameters D, D s and D'.

Choice of the prior distribution
To complete the specification of the Bayesian model, we need to provide values for the hyper-parameters. Because we wish to encode the information that departure from equilibrium of any two SNPs is a function of their physical distance, we define: with α > 0, and the parameter d that represents the physical distance between the two SNPs in Mb (1 Mb = 1, 000 nucleotide bases). With this choice of hyper-parameters, the prior means of the probabilities p ij are E(p ij ) = 1/4 for all i = A, a, j = B, b, so that, a priori, the two SNPs are expected to be in equilibrium. On the other hand, the posterior means of p ij can be written as: and, hence, as a weighted average of n ij /n (the MLE estimates of p ij ) and the prior probabilities 1/4. The first weight n/(n + 4α(1 -exp(-d))) is an increasing function of the sample size n, and a decreasing function of α and d, while the second weight 4α(1 -exp(-d))/(n + 4α(1 -exp(d))) is an increasing function of α and d, and a decreasing function of n. Therefore, for large sample sizes, the posterior means of p ij approach the MLEs. This is consistent with the fact that, in large samples, the effect of the prior distribution on the posterior distribution becomes negligible. However, when the distance d decreases, the function 1 -exp(-d) approaches 0, and the weight n/(n + 4α(1 -exp(-d))) approaches 1, so that the Bayesian estimate becomes closer to the MLE. In the limiting case d = 0, or α = 0, the two estimates are identical. For fixed α and increasing distance (essentially d > 0.5Mb), the second weight approaches its maximum value 4α/(n + 4α), and larger values of α further increase the weight given to the prior mean. To contain the effect of the prior distribution, we use α = 1 and simulation studies that are described in the next section show that this choice produces a good trade-off between robustness and bias.
In the absence of a closed form expression for the prior distributions of the parameters D, and D', we investi-  SNPs is d = 0.001 Mb. The prior density peaks at the extreme value D' = 0 that represents equilibrium between the two SNPs, and D' = 1 that represents perfect disequilibrium. Therefore, for small distance, this bimodal distribution makes these opposite situations almost equally likely. The effect of larger α is to shift the density toward equilibrium: for example the probability that D' < 0.5 is 0.54 when α = 1 and becomes 0.57 when α = 4. The effect of increasing α appears negligible in this situation, but it is more evident when the two SNPs are at a larger distance. For example, the two plots in the second panel depict the prior density of D' when d = 0.1 Mb. Compared to the densities in the top panel, now the prior densities are slightly skewed toward 0, thus making disequilibrium less likely.
Once again, greater values of α (right panel) increase the skewness toward equilibrium. The two plots in the bottom panel show the prior densities when the distance between the two SNPs is 0.5 Mb and confirm the increasing weight given to equilibrium for larger distance d and larger α values: the prior density now assigns probability 0.68 to the event D' < 0.5, when α = 1, and probability 0.84 to the same event when α = 4. It is interesting to observe that, as the distance between SNPs decreases, then the prior hyperparameters approach 0, and the Bayesian estimates approach the MLE estimates. Accordingly, the prior distribution moves mass from D' = 0 to D' = 1 as the distance decreases and this explains the bimodal shape of the prior density in the top two panels.
As an example, Table 1 displays the frequencies of the four haplotypes AB, Ab, aB and ab that were observed between SNPs S1 and S2 Chromosome 22, at the positions 15040669, 15043944. These are real data that were derived from the thirty trios of the CEPH population (Utah residents with ancestry from northern and western Europe) who provided the DNA samples for the IHMP [5]. The observed haplotype frequencies are consistent with the hypothesis of linkage equilibrium, because the expected number of haplotypes ab is 0.5 under equilibrium and the assumption that the population allele frequencies equal the marginal estimates p a = 0.03 and p b = 0.14. However, the lack of observed haplotypes ab could be due to perfect LD between each pair of SNPs. Given that the physical distance between S1 and S2 is 0.0032 Mb, and the average D' in chromosome 22 ranges between 0.8 and 1 for SNPs that are within 0.01 Mb, and becomes less than 0.5 for SNPs that are distant more than 0.1 Mb [5], it is likely that S1 and S2 are in disequilibrium. Consider now a third SNP S3 in the position 15405264 of chromosome 22. The frequencies of the four haplotypes between S1 and S3 is the same as in Table 1 but now the physical distance between these two SNPs is 0.364 Mb. Given the extent of LD, equilibrium is more likely between S1 and S3, although the haplotype frequencies are the same. The MLE of D' is 1 in both cases, with the same confidence interval, while the Bayesian estimate of D' changes with the distance between the two SNPs. The plots in Figure 3 show the prior distribution of D' (plots on the left) and the posterior distribution (plots on the right) between the closest SNPs (first row) and the first and third SNP (second row). A priori, disequilibrium and equilibrium are equally likely when the two SNPs are very close, but the posterior distribution is dominated by the data and the left skewness is consistent with the hypothesis of disequilibrium between the two SNPs. The point estimate of D' is 0.96 with 95% credible interval (0.37, 1). The effect of the same data on the distribution of D' for SNPs that are further apart is however more contained: the posterior distribution of D' remains skewed to the right, the point estimates of D' is 0.39, and the 95% credible interval is (0, 0.97) showing the large uncertainty.

Approximate estimates
In practical applications, we have computed the Bayesian estimate of D' for regions with at most 200 SNPs that would correspond to examining a block of approximately 500 kb assuming one SNP every 2.5 kb. However, if the focus is generating a point estimate of the parameters to be able to display LD over large regions or an entire chromosome as we have shown in [15], resort to MCMC methods may become unfeasible. It is possible to compute the exact posterior mean of D, and from this we can derive approximate estimates of D s and D' based on a Taylor expansion. To this end, we replace the weights p(D < 0) and p(D ≥ 0) in Equations (9)   the approximation is close to the MLE and therefore suffers of some bias toward disequilibrium. However, we will show with results of simulations in the next section that this bias is smaller. Because of this similarity with the MLE, we will refer to these approximate estimates as the maximum a posteriori (MAP).

Unknown phase
When the genotype data are unphased, the ML estimation uses the EM algorithm to infer the unknown phase given the distribution of known haplotypes [16]. We adopt the same procedure for the calculation of the MAP estimates. Given the frequencies of known haplotypes, n ij , i = A, a, j = B, b, the algorithm first computes the MAP estimates of the haplotype frequencies p ij , and then alternates an expectation step to replace the unphased haplotypes by their expected phase and a maximization step to compute the MAP estimates using observed and expected haplotypes. The algorithm typically converges in less than 4 steps. Unknown haplotypes are regarded as missing values in the stochastic analysis, so that they become param-eters of the model and are estimated within the Gibbs sampling algorithm. We also note that, when the genotype data are from trios, we use all phased haplotypes to compute the initial frequencies, regardless of whether they are transmitted from parents to offspring.

The method is implemented in the computer program
BLink that is developed in C++ and is available from the supplementary web site [11]. The software accepts genotype data from either unrelated individuals or nuclear families consisting of two parents and one child.

Evaluation
We examined the performance of the Bayesian estimator in three groups of simulated data and a real data set derived from the IHMP. All data used in this evaluation are available from the supplementary web site [11].
Examples of prior to posterior analysis using the data in Table 2  Figure 3 Examples of prior to posterior analysis using the data in Table 2. The x-axes display D' and the y-axes display the empirical estimate of the density function inferred with the program WinBugs 1.4. Figure (a) is the prior density of D' to measure the LD between the SNPs S1 and S2 that are at a distance of 3.2 kb. Figure (b) displays the posterior density, given the data in Table 1, and shows that data overwhelm the prior distribution when the distance is relatively small. Figure (c) is the prior density of D' to measure the LD between the SNPs S1 and S3 that are now more distant (364 kb). Figure (d) shows the posterior distribution, given the same data in Table 1, and shows now the relatively smaller effect of the data on the prior density.

Group 1
The objectives of the first simulation study were (1) to compare the performance of the Bayesian estimates and the MLE for different sample sizes and small values of the MAF, and (2) to assess the accuracy of the MAP approximation to the stochastic estimates of D s and D'. We generated samples of 60, 120, 240 haplotypes, in which we modeled the true D' as D' = exp(-d) for a distance d ranging from 0 to 0.5 Mb. For each value of D' and each sample size, we generated 1,000 samples of haplotypes by using the joint probability of haplotypes defined by Equation (1), with p B generated from a uniform distribution in the interval [0.5; 0.9) and p A generated from a uniform distribution in the interval [p B ; 0.95). In each simulated sample, we computed the MLE, and the MAP estimate of D s , as well as the stochastic estimate of D s using Gibbs sampling. To compute the stochastic estimates we run the chain for an initial burn-in of 1, 000 iterations and then based the inference on a second sample of 1, 000 iterations. We used as point estimate the median value of the simulated sample and α = 1 in each analysis.

Group 2
In this second set, we generated a sample of 1, 000 individuals in a region of 0.5 Mb with the program MS that simulates genotype data under a variety of neutral models [17]. We considered a population of 1 million individuals, a mutation rate of 10E -9 per base pair, and a recombination rate of 8 × 10E -9 between adjacent base pairs per generation. Only 10% of the 8080 SNPs in the sample of 1, 000 individuals were randomly selected and, from this sample, we randomly generated subsamples of sizes 60, 120, 240 and 480 haplotypes. In the absence of "true" values for D', we studied the decay of LD inferred by the MLE and the MAP estimator for increasing physical distances, versus the LD decay inferred in the original sample of 1, 000 individuals. Each point in the plot is the average estimate of D' for all the SNPs within a physical distance of d ± 0.01 Mb. By averaging the LD between pairs of SNPs at increasing distance, these plots are used to summarize the decay of LD over large regions [18,10]. Ascertainment bias was assessed by repeating the analysis with these thresholds on the MAF: 0, 0.05, 0.1, 0.2. Sensitivity to the prior distribution was assessed by repeating the analysis for α = 0.25, 1, 2, 4.

Group 3
To examine the robustness of the MAP estimators, we also generated data under a different model of allele frequency, linkage disequilibrium and population differentiation that is implemented in the software COSI [19]. We simulated a sample of 1, 000 individuals under the calibrated model for the European population that considers bottlenecks, migration and recombination hotspots spac-ing 0.085 Mb [19]. We randomly selected 10% of the generated 32452 SNPs and from this sample we randomly selected subsamples of 60, 120, 240 and 480 haplotypes. We produced LD decay plots using the thresholds 0, 0.05, 0.1, 0.2 on the MAF and the range of α values that were used for the analysis of the simulated data in group 2.

Real data
Real data were obtained from the first phase of the IHMP [5]. We used genotype data of the 30 trios of the CEPH and Yoruba in Ibadan, Nigeria and chromosome 22 because its pattern of LD has been widely studied [9]. This chromosome was genotyped in 19120 and 19854 SNPs in the CEPH and Yoruba samples. We produced LD decay plots using the thresholds on the MAF and the range of α values that we used for the analysis of the simulated data in group 2. We also produced more informative graphical displays of pairwise LD, by generating bi-dimensional maps similar to those generated by the program Haploview, but with a lower resolution to enable the display of LD over larger regions. The maps were generated with the program BMapBuilder [15] using the MLE and the MAP estimate of D'.   appears to achieve a good balance between robustness and bias reduction. These results provide evidence of a block structure of the human genome that does not appear to be an artifact of low SNP density [5]. However, they also suggest the presence of smaller blocks of LD that may impact on the minimum number of tag SNPs needed to have powerful genome wide association studies.

Results
LD decay plots for the data generated in Group 3, with high LD, MAF> 0.05, and sporadic hotspots

Conclusion
A good estimation of D' is crucial for a better understanding of patterns of LD, a robust identification of haplotype blocks, more accurate algorithms for haplotype reconstruction, and better reproducibility of genetic studies. The popular MLE of D' is biased toward disequilibrium, and requires the use of thresholds on the MAF that have been shown to introduce ascertainment bias. By using an informative prior that models the LD between SNPs based on their physical distance, we define a Bayesian estimator that outperforms the MLE without increasing computational complexity. Our estimator is slightly biased toward equilibrium, but this bias tends to disappear quickly with increasing sample sizes, and at a faster rate than the bias toward disequilibrium of the MLE. Furthermore, our eval-uation shows that the MAP estimator does not require any thresholds on the MAF.
There are several limitations to this work. The probability distribution of the haplotypes is modelled using a multinomial distribution with a Dirichlet prior, and this assumption can be relaxed to include more general models. Also, the prior distribution does not take into account recombination hotspots. We have assessed the impact of this assumption in our simulations, but more evaluation is needed. Our analysis is now limited to biallelic SNPs, however our Bayesian model can be extended to include measures of LD for multi-allelic SNPs. For example, a firstorder approximation of the average estimator of D' suggested in [2] can be computed by averaging the MAP estimates. Some more work is needed to examine the effect of the prior hyper-parameters. In future work we will extend our results to other measures of LD, particularly r 2 = D/ (p A p a p B p b ). Some preliminary results that are posted in our