Testing for homogeneity of gametic disequilibrium across strata

Yin, Xiaolin; Ma, Wenqing; Tang, Manlai; Guo, Jianhua

doi:10.1186/1471-2156-8-85

Methodology article
Open access
Published: 20 December 2007

Testing for homogeneity of gametic disequilibrium across strata

Xiaolin Yin¹,
Wenqing Ma¹,
Manlai Tang² &
…
Jianhua Guo¹

BMC Genetics volume 8, Article number: 85 (2007) Cite this article

5374 Accesses
1 Citations
Metrics details

Abstract

Background

Assessing the non-random associations of alleles at different loci, or gametic disequilibrium, can provide clues about aspects of population histories and mating behavior and can be useful in locating disease genes. For gametic data which are available from several strata with different allele probabilities, it is necessary to verify that the strata are homogeneous in terms of gametic disequilibrium.

Results

Using the likelihood score theory generalized to nuisance parameters we derive a score test for homogeneity of gametic disequilibrium across several independent populations. Simulation results demonstrate that the empirical type I error rates of our score homogeneity test perform satisfactorily in the sense that they are close to the pre-chosen 0.05 nominal level. The associated power and sample size formulae are derived. We illustrate our test with a data set from a study of the cystic fibrosis transmembrane conductance regulator gene.

Conclusion

We propose a large-sample homogeneity test on gametic disequilibrium across several independent populations based on the likelihood score theory generalized to nuisance parameters. Our simulation results show that our test is more reliable than the traditional test based on the Fisher's test of homogeneity among correlation coefficients.

Background

Measuring gametic disequilibrium can provide important information about aspects of population histories and mating behavior [1] and can be useful in locating disease genes [2]. The term gametic disequilibrium is used in this article instead of the traditional term linkage disequilibrium to measure the extent of non-random association because such non-random association may be present between unlinked loci [3]. Various measures of gametic disequilibrium have been proposed [4–6], ranging from pairs of diallelic loci model to multiple multiallelic loci model. In this article, we consider the gametic disequilibrium which is defined as the difference between the gametic probability and its expected probability under the assumption of no statistical association of alleles, and the gametic disequilibrium calculations are based on two-allele, two-locus model [7].

Consider two loci, A and B, each having two possible alleles (A₀, A₁) and (B₀, B₁), respectively. With two loci and two alleles, there are four possible gametes, namely, A₀B₀, A₀B₁, A₁B₀ and A₁B₁. The gametic disequilibrium between the two loci is defined by

D = p_{A_{1} B_{1}} - p_{A_{1}} p_{B_{1}},

where $p_{A_{i}}$ and $p_{B_{j}}$ denote the allele probabilities of A_iand B_j, $p_{A_{i} B_{j}}$ denotes the gamete probability of A_iB_j, i, j = 0, 1. Suppose that the gametic data are available from K strata and let p_ijkdenote the gametic probability of array of A_iB_jfor the k-th stratum, i, j = 0, 1; k = 1,...,K, ∑_i,jp_ijk= 1 for each k. According to the relationship between allelic probability and gametic probability, the allele probabilities of A₀, A₁, B₀ and B₁ are derived as p_0+k, p_1+k, p_+0kand p_+1k, respectively. Here "+" denote the summation over 0 and 1, for example, p_0+k= p_00k+ p_01k. For stratum k (k = 1,...,K), the gametic disequilibrium is calculated as

D_k= p_11k- p_1+kp_+1k.

It is easy to show that D_kis bounded by

D_k,min≤ D_k≤ D_k,max,

where D_k,min= -min{p_1+kp_+1k, p_0+kp_+0k}, D_k,max= min{p_1+kp_+0k, p_0+kp_+1k}. Testing for the homogeneity of gametic disequilibrium among strata can be informative in discriminating among the evolutionary agents generating them in natural population [8]. Detecting gametic disequilibrium can be informative in mapping gene and providing meaningful clues of population evolution. Combining the evidence of gametic disequilibrium across several strata may be more sufficient to support the clues, in contrast to analysis with each strata. In this case, it is crucial to test the homogeneity of gametic disequilibrium across strata before combining the data. For this purpose, it is interesting to consider the following hypothesis

H₀ : D₁ = ⋯ = D_K versus H₁ : D_i≠ D_jfor at least a pair i ≠ j. (1)

Weir [9] recommended a homogeneity test on gametic disequilibrium, based on Fisher's test of homogeneity among correlation coefficients [10]. In his method, the gametic disequilibrium D_kis first transformed to a correlation coefficient r_kby r_k= D_k/ $\sqrt{p_{1 + k} p_{0 + k} p_{+ 1 k} p_{+ 0 k}}$ , r_kis then transformed to a normal variable z_kby Fisher's z transformation, and a weighted sum of squares of the z values which has χ² distribution with K - 1 degrees of freedom is finally proposed for testing homogeneity of gametic disequilibrium. As pointed out by Zapata and Alvarez [8], this test is actually for homogeneity of r values instead of D values. They may not be equivalent when the allele probabilities are different across strata. Instead, Zapata and Alvarez [8] suggested the use of the normalized difference D' [11]. Specifically, ${D^{'}}_{k}$ is the ratio of D_kto D_k,maxwhen D_k> 0, or the ratio of D_kto -D_k,minwhen D_k< 0. Zapata and Alvarez obtained the bias-corrected confidence interval for each D' value across strata via the bootstrap method. Hence, acceptance or rejection of homogeneity of D' values can be determined by evaluating the obtained confidence intervals. For the example considered in Zapata and Alvarez [8], there is no intersection for the confidence intervals obtained from all strata. Hence, one has evidence to reject the null hypothesis of homogeneity. Unfortunately, Zapata and Alvarez [8] did not discuss the decision rules for cases such as intersections exist but the extent are different. Hence, no rigorous rule based on this confidence interval approach was proposed and this makes their method less practicable. However, no rigorous rule based on this confidence interval approach was proposed and this makes their method less practicable. It should be noted that the homogeneity test of either r values or D' values is not equivalent to the homogeneity test of D values. In particular, transformation D' only guarantees that the range of D' is [-1, 1]. However, there remains difficulties in interpreting the value of D'. Lewontin [11] noted that values of D' at different loci and in different populations tend to vary with the values of the allele probabilities, so that the problem of cross-locus and cross-population comparisons is not fully overcome by the use of D'. In this article, without doing any transformation, we develop an asymptotic homogeneity test directly based on D values via score method.

Methods

Homogeneity test

Let x_ijk(i, j = 0, 1 and k = 1,⋯,K) be the number of the gamete A_iB_jin the k-th stratum with the total gametes being n_k= x_00k+ x_01k+ x_10k+ s_11k. Let M(n_k, {p_ijk}) denote the quadrinomial distribution with parameter vector (p_00k, p_01k, p_10k, p_11k)'. Thus, we have {x_ijk: i, j = 0, 1} ~ M(n_k, {p_ijk}) for k = 1,...,K. The homogeneity hypothesis in (1) is of interest in this article. Here, we assume that K is fixed and n_kis sufficiently large for k = 1, 2,...,K. Noticing that p_00k= p_0+kp_+0k+ D_k, p_01k= p_0+kp_+1k- D_k, p_10k= p_1+kp_+0k- D_k, p_11k= p_1+kp_+1k+ D_k, the log-likelihood for the k-th stratum can be expressed in terms of D_k, p_1+kand p_+1k(k = 1,....,K). That is,

\begin{array}{l} l_{k} (D_{k}, p_{1 + k}, p_{+ 1 k}) & = & x_{00 k} \ln (p_{0 + k} p_{+ 0 k} + D_{k}) + x_{01 k} l n (p_{0 + k} p_{+ 1 k} - D_{k}) + \\ x_{10 k} \ln (p_{1 + k} p_{+ 0 k} - D_{k}) + x_{11 k} l n (p_{1 + k} p_{+ 1 k} + D_{k}), \end{array}

where p_0+k= 1 - p_1+k, p_+0k= 1 - p_+1k. Let D denote the common gametic disequilibrium under H₀, p₁₊ = (p₁₊₁,...,p_1+K)' and p₊₁ = (p₊₁₁,...,p_+1K)' denote the nuisance parameter vectors. Under H₀, the total log-likelihood for all K strata is given by

l (D, p_{1 +}, p_{+ 1}) = \sum_{k = 1}^{K} l_{k} (D, p_{1 + k}, p_{+ 1 k}) .

Hence, the efficient scores for the k-th stratum (i.e., the first order derivatives of l_k(D, p_1+k, p_+1k) with respect to D, p_1+kand p_+1k) are given by

\begin{array}{l} S_{k D} (D, p_{1 + k}, p_{+ 1 k}) & = & \frac{\partial l_{k}}{\partial D} \\ = & \frac{x_{00 k}}{p_{0 + k} p_{+ 0 k} + D} - \frac{x_{01 k}}{p_{0 + k} p_{+ 1 k} - D} - \frac{x_{10 k}}{p_{1 + k} p_{+ 0 k} - D} + \frac{x_{11 k}}{p_{1 + k} p_{+ 1 k} + D}, \\ S_{k p_{1 + k}} (D, p_{1 + k}, p_{+ 1 k}) & = & \frac{\partial l_{k}}{\partial p_{1 + k}} \\ = & - \frac{x_{00 k} p_{+ 0 k}}{p_{0 + k} p_{+ 0 k} + D} - \frac{x_{01 k} p_{+ 1 k}}{p_{0 + k} p_{+ 1 k} - D} + \frac{x_{10 k} p_{+ 0 k}}{p_{1 + k} p_{+ 0 k} - D} + \frac{x_{11 k} p_{+ 1 k}}{p_{1 + k} p_{+ 1 k} + D}, \\ S_{k p_{+ 1 k}} (D, p_{1 + k}, p_{+ 1 k}) & = & \frac{\partial l_{k}}{\partial p_{+ 1 k}} \\ = & - \frac{x_{00 k} p_{0 + k}}{p_{0 + k} p_{+ 0 k} + D} - \frac{x_{10 k} p_{1 + k}}{p_{1 + k} p_{+ 0 k} - D} + \frac{x_{01 k} p_{0 + k}}{p_{0 + k} p_{+ 1 k} - D} + \frac{x_{11 k} p_{1 + k}}{p_{1 + k} p_{+ 1 k} + D} . \end{array}

If $\hat{D}$ , ${\hat{p}}_{1 +}$ and ${\hat{p}}_{+ 1}$ are the maximum likelihood estimates (MLEs) of D, p₁₊ and p₊₁ under H₀, respectively, then they satisfy the following 2K + 1 equations:

{\begin{array}{l} \sum_{k = 1}^{K} S_{k D} (\hat{D}, {\hat{p}}_{1 + k}, {\hat{p}}_{+ 1 k}) = 0, \\ \begin{matrix} S_{k p_{1 + k}} (\hat{D}, {\hat{p}}_{1 + k}, {\hat{p}}_{+ 1 k}) = 0, & k = 1, 2, \dots, K, \end{matrix} \\ \begin{matrix} S_{k p_{+ 1 k}} (\hat{D}, {\hat{p}}_{1 + k}, {\hat{p}}_{+ 1 k}) = 0, & k = 1, 2, \dots, K . \end{matrix} \end{array}

Variances and covariances for the efficient scores are given by

\begin{array}{l} I_{k D D} & = & V a r (S_{k D} (D, p_{1 + k}, p_{+ 1 k})) \\ = & n_{k} [\frac{p_{0 + k}}{(p_{0 + k} p_{+ 0 k} + D) (p_{0 + k} p_{+ 1 k} - D)} + \frac{p_{1 + k}}{(p_{1 + k} p_{+ 0 k} - D) (p_{1 + k} p_{+ 1 k} + D)}], \\ I_{k p_{1 + k} p_{1 + k}} & = & V a r (S_{k p_{1 + k}} (D, p_{1 + k}, p_{+ 1 k})) \\ = & n_{k} [\frac{p_{+ 0 k}^{3}}{(p_{0 + k} p_{+ 0 k} + D) (p_{1 + k} p_{+ 0 k} - D)} + \frac{p_{+ 1 k}^{3}}{(p_{0 + k} p_{+ 1 k} - D) (p_{1 + k} p_{+ 1 k} + D)}], \\ I_{k p_{+ 1 k} p_{+ 1 k}} & = & V a r (S_{k p_{+ 1 k}} (D, p_{1 + k}, p_{+ 1 k})) \\ = & n_{k} [\frac{p_{0 + k}^{3}}{(p_{0 + k} p_{+ 0 k} + D) (p_{0 + k} p_{+ 1 k} - D)} + \frac{p_{1 + k}^{3}}{(p_{1 + k} p_{+ 0 k} - D) (p_{1 + k} p_{+ 1 k} + D)}], \\ I_{k D p_{1 + k}} & = & C o v (S_{k D} (D, p_{1 + k}, p_{+ 1 k}), S_{k p_{1 + k}} (D, p_{1 + k}, p_{+ 1 k})) \\ = & n_{k} [\frac{p_{+ 1 k}^{2}}{(p_{0 + k} p_{+ 1 k} - D) (p_{1 + k} p_{+ 1 k} + D)} - \frac{p_{+ 0 k}^{2}}{(p_{0 + k} p_{+ 0 k} + D) (p_{1 + k} p_{+ 0 k} - D)}], \\ I_{k D p_{+ 1 k}} & = & C o v (S_{k D} (D, p_{1 + k}, p_{+ 1 k}), S_{k p_{+ 1 k}} (D, p_{1 + k}, p_{+ 1 k})) \\ = & n_{k} [\frac{p_{1 + k}^{2}}{(p_{1 + k} p_{+ 0 k} - D) (p_{1 + k} p_{+ 1 k} + D)} - \frac{p_{0 + k}^{2}}{(p_{0 + k} p_{+ 0 k} + D) (p_{0 + k} p_{+ 1 k} - D)}], \\ I_{k p_{1 + k} p_{+ 1 k}} & = & C o v (S_{k p_{1 + k}} (D, p_{1 + k}, p_{+ 1 k}), S_{k p_{+ 1 k}} (D, p_{1 + k}, p_{+ 1 k})) \\ = & - n_{k} D [\frac{p_{0 + k}}{(p_{0 + k} p_{+ 0 k} + D) (p_{0 + k} p_{+ 1 k} - D)} + \frac{p_{1 + k}}{(p_{1 + k} p_{+ 0 k} - D) (p_{1 + k} p_{+ 1 k} + D)}] . \end{array}

Denote

I_{k D | p_{1 + k} p_{+ 1 k}} = I_{k D D} - (I_{k D p_{1 + k}}, I_{k D p_{+ 1 k}}) {(\begin{matrix} I_{k p_{1 + k} p_{1 + k}} & I_{k p_{1 + k} p_{+ 1 k}} \\ I_{k p_{1 + k} p_{+ 1 k}} & I_{k p_{+ 1 k} p_{+ 1 k}} \end{matrix})}^{- 1} (I_{k D p_{1 + k}}, I_{k D p_{+ 1 k}})^{'} .

Hence, the likelihood score test for the homogeneity hypothesis H₀ : D₁ = ⋯ = D_Kis given by

X^{2} = \sum_{k = 1}^{K} \frac{S_{k D}^{2} (\hat{D}, {\hat{p}}_{1 + k}, {\hat{p}}_{+ 1 k})}{I_{k D | p_{1 + k} p_{+ 1 k}} (\hat{D}, {\hat{p}}_{1 + k}, {\hat{p}}_{+ 1 k})},

which asymptotically follows the chi-square distribution with K - 1 degrees of freedom under H₀.

Unfortunately, $\hat{D}$ , ${\hat{p}}_{1 +}$ and ${\hat{p}}_{+ 1}$ cannot be expressed in a closed form and this makes the likelihood score test X² less appealing in practice. To overcome this issue, applying the theory of homogeneity score test extended to nuisance parameters [12] we propose the following modified score statistic

X^{2 *} = \sum_{k = 1}^{K} \frac{S_{k D}^{2} (D^{*}, p_{1 + k}^{*}, p_{+ 1 k}^{*})}{I_{k D | p_{1 + k} p_{+ 1 k}} (D^{*}, p_{1 + k}^{*}, p_{+ 1 k}^{*})} - \frac{{[\sum_{k = 1}^{K} S_{k D} (D^{*}, p_{1 + k}^{*}, p_{+ 1 k}^{*})]}^{2}}{\sum_{k = 1}^{K} I_{k D | p_{1 + k} p_{+ 1 k}} (D^{*}, p_{1 + k}^{*}, p_{+ 1 k}^{*})},

(2)

where D*, $p_{1 +}^{*}$ and $p_{+ 1}^{*}$ are any consistent estimators of D, p₁₊ and p₊₁, respectively. To this end, we choose D* to be $\sum_{k = 1}^{K} (\frac{x_{00 k} x_{11 k}}{x_{01 k} x_{10 k}} - 1) / \sum_{k = 1}^{K} \frac{n_{k}^{2}}{x_{01 k} x_{10 k}}$ , and $p_{1 + k}^{*}$ and $p_{+ 1 k}^{*}$ be the solutions to the following equations

{\begin{matrix} S_{k p_{1 + k}} (D^{*}, p_{1 + k}, p_{+ 1 k}) \equiv - \frac{x_{00 k} p_{+ 0 k}}{p_{0 + k} p_{+ 0 k} + D^{*}} - \frac{x_{01 k} p_{+ 1 k}}{p_{0 + k} p_{+ 1 k} - D^{*}} + \frac{x_{10 k} p_{+ 0 k}}{p_{1 + k} p_{+ 0 k} - D^{*}} + \frac{x_{11 k} p_{+ 1 k}}{p_{1 + k} p_{+ 1 k} + D^{*}} = 0, \\ S_{k p_{+ 1 k}} (D^{*}, p_{1 + k}, p_{+ 1 k}) \equiv - \frac{x_{00 k} p_{0 + k}}{p_{0 + k} p_{+ 0 k} + D^{*}} - \frac{x_{10 k} p_{1 + k}}{p_{1 + k} p_{+ 0 k} - D^{*}} + \frac{x_{01 k} p_{0 + k}}{p_{0 + k} p_{+ 1 k} - D^{*}} + \frac{x_{11 k} p_{1 + k}}{p_{1 + k} p_{+ 1 k} + D^{*}} = 0, \end{matrix}

or equivalently the following quartic polynomial equations,

{\begin{array}{l} a_{0} + a_{1} p_{+ 1 k} + a_{2} p_{+ 1 k}^{2} + a_{3} p_{+ 1 k}^{3} + a_{4} p_{+ 1 k}^{4} = 0, \\ b_{0} + b_{1} p_{1 + k} + b_{2} p_{1 + k}^{2} + b_{3} p_{1 + k}^{3} + b_{4} p_{1 + k}^{4} = 0, \end{array}

where

\begin{array}{l} a_{0} & = & [x_{+ 0 k} (p_{1 + k} - D^{*}) - x_{10 k}] {(D^{*})}^{2}, \\ a_{1} & = & (n_{k} + x_{+ 0 k}) D^{*} p_{1 + k}^{2} - [2 (n_{k} + x_{+ 0 k}) D^{*} + n_{k} + 2 x_{10 k}] D^{*} p_{1 + k} + \\ [n_{k} {(D^{*})}^{2} + (n_{k} + 2 x_{10 k}) D^{*} + x_{10 k}] D^{*}, \\ a_{2} & = & n_{k} p_{1 + k}^{3} - [(4 n_{k} + x_{+ 0 k}) D^{*} + n_{k} + x_{1 + k}] p_{1 + k}^{2} + [3 n_{k} {(D^{*})}^{2} + \\ (3 n_{k} + 4 x_{10 k} + 2 x_{11 k}) D^{*} + x_{1 + k}] p_{1 + k} - [(n_{k} + x_{1 + k}) D^{*} + 2 x_{10 k} + x_{11 k}] D^{*}, \\ a_{3} & = & - 2 n_{k} p_{1 + k}^{3} + [3 n_{k} D^{*} + 2 (n_{k} + x_{1 + k})] p_{1 + k}^{2} - 2 [(n_{k} + x_{1 + k}) D^{*} + x_{1 + k}] p_{1 + k} + x_{1 + k} D^{*}, \\ a_{4} & = & n_{k} p_{1 + k}^{3} - (n_{k} + x_{1 + k}) p_{1 + k}^{2} + x_{1 + k} p_{1 + k}, \\ b_{0} & = & [x_{0 + k} (p_{+ 1 k} - D^{*}) - x_{01 k}] {(D^{*})}^{2}, \\ b_{1} & = & (n_{k} + x_{0 + k}) D^{*} p_{+ 1 k}^{2} - [2 (n_{k} + x_{0 + k}) D^{*} + n_{k} + 2 x_{01 k}] D^{*} p_{+ 1 k} + \\ [n_{k} {(D^{*})}^{2} + (n_{k} + 2 x_{01 k}) D^{*} + x_{01 k}] D^{*}, \\ b_{2} & = & n_{k} p_{+ 1 k}^{3} - [(4 n_{k} + x_{+ 1 k}) D^{*} + n_{k} + x_{+ 1 k}] p_{+ 1 k}^{2} + [3 n_{k} {(D^{*})}^{2} + \\ (3 n_{k} + 4 x_{01 k} + 2 x_{11 k}) D^{*} + x_{+ 1 k}] p_{+ 1 k} - [(n_{k} + x_{+ 1 k}) D^{*} + 2 x_{01 k} + x_{11 k}] D^{*}, \\ b_{3} & = & - 2 n_{k} p_{+ 1 k}^{3} + [3 n_{k} D^{*} + 2 (n_{k} + x_{+ 1 k})] p_{+ 1 k}^{2} - 2 [(n_{k} + x_{+ 1 k}) D^{*} + x_{+ 1 k}] p_{+ 1 k} + x_{+ 1 k} D^{*}, \\ b_{4} & = & n_{k} p_{+ 1 k}^{3} - (n_{k} + x_{+ 1 k}) p_{+ 1 k}^{2} + x_{+ 1 k} p_{+ 1 k} . \end{array}

Here, D* is analogous to the well-known Mantel-Haenszel estimator [13]. It is a consistent estimator to D. In general, it is not an efficient estimator to D. The proof of consistency and the conditions for achieving asymptotic efficiency for D* is presented in Appendix. We notice that the calculation of $I_{k D | p_{1 + k} p_{+ 1 k}}$ in (2) is quite tedious. Nonetheless, it is easy to show that $I_{k D | p_{1 + k} p_{+ 1 k}}$ is simply given by n_k/w_k(D, p_1+k, p_+1k) with $w_{k} (D, p_{1 + k}, p_{+ 1 k}) = p_{11 k} p_{00 k}^{2} + p_{10 k} p_{01 k}^{2} + p_{01 k} p_{10 k}^{2} + p_{00 k} p_{11 k}^{2} - 4 D^{2}$ (see Appendix for the proof). It can be shown that X^2* has an asymptotic chi-square distribution with K - 1 degrees of freedom under H₀. Therefore, the homogeneity hypothesis H₀ is rejected at level α when X^2* ≥ $χ_{K - 1, (1 - α)}^{2}$ , where $χ_{K - 1, (1 - α)}^{2}$ is the 100 × (1 - α) percentile point of the chi-square distribution with K - 1 degrees of freedom. Finally, it is noteworthy that if the consistent estimators of D, p₁₊ and p₊₁ are the constrained MLEs under H₀ then the second term of (2) vanishes, since $\sum_{k = 1}^{K} S_{k D} (D^{*}, p_{1 + k}^{*}, p_{+ 1 k}^{*}) = 0$ , and (2) reduces to the likelihood score statistic.

Asymptotic power and sample size

We will present the asymptotic power and sample size formulae based on X^2* [14]. For this purpose, we assume n_k= na_kfor some n and a_k> 0. Let ${\bar{D}}_{k}$ , ${\bar{p}}_{1 + k}$ and ${\bar{p}}_{+ 1 k}$ be the true parameter values for D_k, p_1+kand p_+1kunder H₁, where k = 1, 2,⋯,K and ${\bar{D}}_{k} \neq {\bar{D}}_{j}$ for at least a pair k ≠ j. Thus, the asymptotic power for the homogeneity score test X^2* at α level is given by

P r (X^{2 *} \geq χ_{K - 1, (1 - α)}^{2} | H_{1}) = P r (χ_{K - 1}^{2} (Δ) \geq χ_{K - 1, (1 - α)}^{2},

where $χ_{K - 1}^{2} (Δ)$ denotes the non-central chi-square distribution with K - 1 degrees of freedom with the non-centrality parameter being

\begin{array}{l} Δ & = & n {\sum_{k = 1}^{K} \frac{a_{k} {(\frac{{\bar{p}}_{0 + k} {\bar{p}}_{+ 0 k} + {\bar{D}}_{k}}{p_{0 + k} p_{+ 0 k} + d} - \frac{{\bar{p}}_{0 + k} {\bar{p}}_{+ 1 k} - {\bar{D}}_{k}}{p_{0 + k} p_{+ 1 k} - d} - \frac{{\bar{p}}_{1 + k} {\bar{p}}_{+ 0 k} - {\bar{D}}_{k}}{p_{1 + k} p_{+ 0 k} - d} + \frac{{\bar{p}}_{1 + k} {\bar{p}}_{+ 1 k} + {\bar{D}}_{k}}{p_{1 + k} p_{+ 1 k} + d})}^{2}}{1 / w_{k} (d, p_{1 + k}, p_{+ 1 k})} - \\ \frac{{\sum_{k = 1}^{K} a_{k} {(\frac{{\bar{p}}_{0 + k} {\bar{p}}_{+ 0 k} + {\bar{D}}_{k}}{p_{0 + k} p_{+ 0 k} + d} - \frac{{\bar{p}}_{0 + k} {\bar{p}}_{+ 1 k} - {\bar{D}}_{k}}{p_{0 + k} p_{+ 1 k} - d} - \frac{{\bar{p}}_{1 + k} {\bar{p}}_{+ 0 k} - {\bar{D}}_{k}}{p_{1 + k} p_{+ 0 k} - d} + \frac{{\bar{p}}_{1 + k} {\bar{p}}_{+ 1 k} + {\bar{D}}_{k}}{p_{1 + k} p_{+ 1 k} + d})}^{2}}{\sum_{k = 1}^{K} [a_{k} / w_{k} (d, p_{1 + k}, p_{+ 1 k})]}}, \end{array}

where $d = \sum_{k = 1}^{K} [\frac{({\bar{p}}_{0 + k} {\bar{p}}_{+ 0 k} + {\bar{D}}_{k}) ({\bar{p}}_{1 + k} {\bar{p}}_{+ 1 k} + {\bar{D}}_{k})}{({\bar{p}}_{0 + k} {\bar{p}}_{+ 1 k} - {\bar{D}}_{k}) ({\bar{p}}_{1 + k} {\bar{p}}_{+ 0 k} - {\bar{D}}_{k})} - 1] / \sum_{k = 1}^{K} \frac{1}{({\bar{p}}_{0 + k} {\bar{p}}_{+ 1 k} + {\bar{D}}_{k}) ({\bar{p}}_{1 + k} {\bar{p}}_{+ 0 k} - {\bar{D}}_{k})}$ , ${\bar{p}}_{0 + k} = 1 - {\bar{p}}_{1 + k}$ , ${\bar{p}}_{+ 0 k} = 1 - {\bar{p}}_{+ 1 k}$ , p_1+kand p_+1kare the solutions of the following equations

{\begin{matrix} {\bar{a}}_{0} + {\bar{a}}_{1} p_{+ 1 k} + {\bar{a}}_{2} p_{+ 1 k}^{2} + {\bar{a}}_{3} p_{+ 1 k}^{3} + {\bar{a}}_{4} p_{+ 1 k}^{4} = 0, \\ {\bar{b}}_{0} + {\bar{b}}_{1} p_{1 + k} + {\bar{b}}_{2} p_{1 + k}^{2} + {\bar{b}}_{3} p_{1 + k}^{3} + {\bar{b}}_{4} p_{1 + k}^{4} = 0, \end{matrix}

where

\begin{array}{l} {\bar{a}}_{0} & = & [{\bar{p}}_{+ 0 k} (p_{1 + k} - d) - {\bar{p}}_{10 k}] d^{2}, \\ {\bar{a}}_{1} & = & (1 + {\bar{p}}_{+ 0 k}) d p_{1 + k}^{2} - [2 (1 + {\bar{p}}_{+ 0 k}) d + 1 + 2 {\bar{p}}_{10 k}] d p_{1 + k} + \\ [d^{2} + (1 + 2 {\bar{p}}_{10 k}) d + {\bar{p}}_{10 k}] d, \\ {\bar{a}}_{2} & = & p_{1 + k}^{3} - [(4 + {\bar{p}}_{+ 0 k}) d + 1 + {\bar{p}}_{1 + k}] p_{1 + k}^{2} + [3 d^{2} + \\ (3 + 4 {\bar{p}}_{10 k} + 2 {\bar{p}}_{11 k}) d + {\bar{p}}_{1 + k}] p_{1 + k} - [(1 + {\bar{p}}_{1 + k}) d + 2 {\bar{p}}_{10 k} + {\bar{p}}_{11 k}] d, \\ {\bar{a}}_{3} & = & - 2 p_{1 + k}^{3} + [3 d + 2 (1 + {\bar{p}}_{1 + k})] p_{1 + k}^{2} - 2 [(1 + {\bar{p}}_{1 + k}) d + {\bar{p}}_{1 + k}] p_{1 + k} + {\bar{p}}_{1 + k} d, \\ {\bar{a}}_{4} & = & p_{1 + k}^{3} - (1 + {\bar{p}}_{1 + k}) p_{1 + k}^{2} + {\bar{p}}_{1 + k} p_{1 + k}, \\ {\bar{b}}_{0} & = & [{\bar{p}}_{0 + k} (p_{+ 1 k} - d) - {\bar{p}}_{01 k}] d^{2}, \\ {\bar{b}}_{1} & = & (1 + {\bar{p}}_{0 + k}) d p_{+ 1 k}^{2} - [2 (1 + {\bar{p}}_{0 + k}) d + n_{k} + 2 {\bar{p}}_{01 k}] d p_{+ 1 k} + \\ [d^{2} + (1 + 2 {\bar{p}}_{01 k}) d + {\bar{p}}_{01 k}] d, \\ {\bar{b}}_{2} & = & p_{+ 1 k}^{3} - [(4 + {\bar{p}}_{+ 1 k}) d + 1 + {\bar{p}}_{+ 1 k}] p_{+ 1 k}^{2} + [3 d^{2} + \\ (3 + 4 {\bar{p}}_{01 k} + 2 {\bar{p}}_{11 k}) d + {\bar{p}}_{+ 1 k}] p_{+ 1 k} - [(1 + {\bar{p}}_{+ 1 k}) d + 2 {\bar{p}}_{01 k} + {\bar{p}}_{11 k}] d, \\ {\bar{b}}_{3} & = & - 2 p_{+ 1 k}^{3} + [3 d + 2 (1 + {\bar{p}}_{+ 1 k})] p_{+ 1 k}^{2} - 2 [(1 + {\bar{p}}_{+ 1 k}) d + {\bar{p}}_{+ 1 k}] p_{+ 1 k} + {\bar{p}}_{+ 1 k} d, \\ {\bar{b}}_{4} & = & p_{+ 1 k}^{3} - (1 + {\bar{p}}_{+ 1 k}) p_{+ 1 k}^{2} + {\bar{p}}_{+ 1 k} p_{+ 1 k} . \end{array}

The desirable sample size n required to attain the power at 1 - β with ${\bar{D}}_{k}$ , ${\bar{p}}_{1 + k}$ and ${\bar{p}}_{+ 1 k}$ being the true parameter values for D_k, p_1+kand p_+1kunder the alternative H₁ at nominal level α can be found by the relation

χ_{K - 1, β}^{2} (Δ) = χ_{K - 1, (1 - α)}^{2},

(3)

where $χ_{K - 1, β}^{2} (Δ)$ is the 100 × β percentile point of the non-central chi-square distribution with K - 1 degrees of freedom and non-centrality parameter Δ. The sample size n can be readily obtained by solving the above equation.

Availability and requirements

We have implemented the test procedures for computing our score statistic X^2* in a Matlab project. Project name: gametic disequilibrium homogeneity score test (GDHST); Project home page: http://math.nenu.edu.cn/jhguo/program.htm; Operating system: Windows XP; Programming language: Matlab 6.1; Licence: GNU GPL.

Results

Simulation results

To evaluate the performance of our proposed homogeneity score test, we include the homogeneity test recommended by Weir [9] in our comparison study. The corresponding test statistic for homogeneity is given by

T^{2} = \sum_{k = 1}^{K} (n_{k} - 3) {(z_{k} - \bar{z})}^{2},

where K is the total number of strata, n_kis the total gamete number in stratum k, $z_{k} = \frac{1}{2} \ln (\frac{1 + r_{k}}{1 - r_{k}})$ is the Fisher's z transformation with $r_{k} = \frac{n_{k} x_{11 k} - x_{1 + k} x_{+ 1 k}}{\sqrt{x_{0 + k} x_{+ 0 k} x_{1 + k} x_{+ 1 k}}}$ and (x_00k, x_01k, x_10k, x_11k)' being the number of the gamete array in the k-th stratum, and $\bar{z}$ is the average of the z_kvalues.

We investigate the performance of X^2* and T² in terms of type I error rate and power. For type I error rates, we consider both equal and unequal allele probabilities varying from 0.1 to 0.5 across (K = 3 and 5) strata with equal sample sizes (n_k= 50, 100 and 200) for k = 1,...,K and common disequilibrium (D = $\frac{1}{2}$ D_min, 0 and $\frac{1}{2}$ D_max), where D_min= max{D_1,min,...,D_K,min}, D_max= min{D_1,max,...,D_K,max}.

Monte Carlo simulations with 5,000 repetitions at 0.05 nominal level are summarized in Table 1, 2, 3, 4. Table 1 shows the performance of empirical type I error rates for X^2* and T² with equal allele probabilities across K = 3 strata. We observe the following.

Table 1 Empirical type I error rates for X^2* and T² for equal allele probabilities across K = 3 strata under H₀

Full size table

Table 2 Empirical type I error rates for X^2* and T² for unequal allele probabilities across K = 3 strata under H₀

Full size table

Table 3 Empirical type I error rates for X^2* and T² for equal allele probabilities across K = 5 strata under H₀

Full size table

Table 4 Empirical type I error rates for X^2* and T² for unequal allele probabilities across K = 5 strata under H₀

Full size table

1. When D is large (i.e., $\frac{1}{2}$ D_max), both tests generally appear to be quite liberal (e.g., empirical size being 10 times of the nominal level), especially for small sample size (e.g., n_k= 50) and small allele probability (e.g., p₁₊ = p₊₁ = (0.1, 0.1, 0.1)'). Such liberty in empirical size is more severe in T² than in our asymptotic homogeneity test X^2* and is significantly alleviated in X^2* when sample size increases. However, sample size increase does not alleviate the liberty of T² much. In fact, even for n_k= 3200 for k = 1, 2, 3, T² is still very liberal for D = 0.045 with empirical type I errors rate being 0.456 (data are not shown).

2. For other settings, both tests perform quite satisfactorily in the sense that their empirical sizes are well controlled around the pre-chosen nominal level. In general, the larger the sample size, the closer the empirical type I error rate to the pre-chosen nominal level.

Table 2 reports the empirical size performance of X^2* and T² for unequal allele probabilities across K = 3 strata. We observe similar phenomena above. However, our asymptotic homogeneity test X^2* performs quite well in all settings under consideration for moderate to large sample sizes (i.e., n_k= 100 and 200) while it is not the case for T². For T², the resultant empirical type I error rate can be extremely inflated even for large sample design (e.g., more than 17 times of the nominal level when n_k= 200 (for k = 1, 2, 3), p₁₊ = p₊₁ = (0.5, 0.3, 0.1)', and D = 0.045).

Table 3 and 4 shows the empirical type I error rate performance of X^2* and T² for K = 5. The parameter settings are similar to Table 1 and 2. According to the simulation results, liberty issue becomes more serious and larger sample sizes are required to attain similar performance when K increases from 3 to 5 under similar parameter settings.

Since many type I error rates for X^2* and T² are liberal in Tables 1 to 4. The two-sided t-test is conducted to determined if an empirical type I error rate is significantly different from the nominal lever of 0.05. The t-test statistics is

\sqrt{m - 1} \frac{W - 0.05}{\sqrt{W (1 - W)}},

where m = 5000 and W represents the empirical type I error rate of X^2* or T². Here, the t-test is almost identical to the z-test for the sample size is very large. Those empirical type I error rates which are significantly different from the nominal level of 0.05 are underlined in Tables 1 to 4. In Table 1, the total number of significant difference from the nominal level of 0.05 for X^2* and T² is 28 and 38, respectively. The pair (28, 38) can be further decomposed to (14, 14), (8, 13) and (6, 11) according to n = 50, 100 and 200. The decreasing rate of the number of empirical type I error rates which is significant different from the nominal level of 0.05 for X^2* is 14/18-6/18 = 44.4% as n increases from 50 to 200. While the corresponding decreasing rate for T² is 14/18-11/18 = 16.7%. It is easy to see that our X^2* is less liberal than T² as sample size increases.

For Table 2, the total number of significant difference from the nominal level of 0.05 for X^2* and T² is 17 and 33, respectively. The pair (17, 33) can again be decomposed to (14, 14), (8, 13) and (6, 11) according to n = 50, 100 and 200. The decreasing rate of the number of empirical type I error rates which is significant different from the nominal level of 0.05 for X^2* is 10/18-1/18 = 50.0% as n increases from 50 to 200. While the decreasing rate for T² is 12/18-10/18 = 11.1%. The decreasing rate of our X^2* is again more significant than that of T².

In Table 3 to 4, the strata increases from 3 to 5. However, the decreasing rates of the number of empirical type I error rates which is significant different from the nominal level of 0.05 for Tables 3 and 4 is very close to that of Tables 1 and 2, respectively. Therefore, we have reason to believe that this decreasing rate is not greatly affected by the number of strata.

Table 5 summarizes the empirical powers for X^2* and T². Here, {D_k} are specified under H₁ and we set D_k= D₀ + δ(k - 1). For K = 3, we consider: (i) D₀ = -0.03, δ = 0.03 and (ii) D₀ = -0.05, δ = 0.05. For K = 5, we consider: (i) D₀ = -0.06, δ = 0.03 and (ii) D₀ = -0.1, δ = 0.05. From the simulation results, we observe both X^2* and T² perform similarly under the designed parameter settings. In general, powers increase with n and δ.

Table 5 Empirical powers for X^2* and T²

Full size table

In view of the above results, we prefer the proposed homogeneity test X^2* to the traditional T² which is based on the Fisher's test of homogeneity among correlation coefficient.

Real and hypothetical examples

It is reported that mutations at the cystic fibrosis transmembrane conductance regulator gene (CFTR) cause cystic fibrosis, the most prevalent severe genetic disorder in individuals of European descent. Mateu [15] conducted a worldwide genetic analysis of the CFTR region and analyzed normal allele and haplotype variation at two single-nucleotide polymorphisms (SNPs), namely the T854/Ava II (2694 T/G) and TUB20/PVU II (4006-200 G/A). The T854 and TUB20 markers can be used to define the core haplotypes since they are diallelic, have presumably much lower mutation rates than the other polymorphisms and the ancestral state can be inferred for them.

Mateu [15] reported the T854-TUB20 haplotype frequencies by 18 populations. After communicating with one of their coauthors (Prof. Kenneth, pers. comm. 1996), it was found that their reported gametic frequencies were actually the maximum likelihood estimates of the gametic probabilities obtained from HAPLO, a software which can be applicable to missing data. In other words, all individuals with results for at least one of the two markers were included to estimate the gametic frequencies and no actual gametic counts were available. To create the gametic counts for each population, we first estimate the total number of participants in each population by the number of individuals who yielded results for at least one of the two markers. The reported gametic frequencies of each population given in Mateu [15] are multiplied to the estimated number of participants of this population and the closest integers are then taken to be the estimated gametic counts. The estimated gametic counts across the 18 populations are reported in Table 6, which is adopted as the real data in all subsequent analysis.

Table 6 T854-TUB20 haplotype counts by 18 populations and some related statistics

Full size table

It is noticed that the gametic counts for the populations of Japanese (14th) and Surui (18th) are (0, 32, 0, 12)' and (0, 7, 0, 35)', respectively and their estimated gametic disequilibrium D_k, D_k,minand D_k,maxare all equal to zero. Therefore, we will exclude these two populations for subsequent homogeneity testings. We consider the following scenarios.

(i) Homogeneity of gametic disequilibrium among the 16 populations (i.e., excluding Japanese and Surui). The statistic value of our proposed X^2* is 121.35 with p-value being less than 0.0001 while that of T² yields 99.64 with p-value being less than 0.0001. In this case, both tests reject the homogeneity hypothesis at the 0.05 nominal level.

(ii) Homogeneity of gametic disequilibrium among those populations with the same numbers of participants for both markers T854 and TUB20 (i.e., Mbuti, Yemenites, Druze, Adygei, Catalans, Basques, Chinese, and Nasioi).

Our proposed statistic X^2* yields 50.56 with p-value being less than 0.0001 while T² gives 39.72 with p-value being less than 0.0001. Again, both tests suggest rejection of the homogeneity hypothesis at the 0.05 nominal level. Suppose that another research team wants to reconduct the same genetic analysis. In this regard, it is sensible to ask, "How large is the sample size for each population in order to achieve, say, 90% power at the 0.05 nominal level". Based on the present study, we have $\bar{D}$ = (-0.064, -0.066, -0.116, -0.125, -0.135, -0.087, -0.012, 0.012)', ${\bar{p}}_{1 +}$ = (0.576, 0.225, 0.222, 0.286, 0.325, 0.296, 0.488, 0.512)' and ${\bar{p}}_{+ 1}$ = (0.849 0.850, 0.810, 0.796, 0.747, 0.824, 0.977, 0.977)'. By solving equation (3), n = 157 subjects are required for each of the eight populations under the balanced design.

(iii) Homogeneity of gametic disequilibrium among those populations in Europe.

Our statistic X^2* yields 7.48 with p-value being 0.11 and T² yields 7.26 with p-value being 0.12. Both tests do not reject the homogeneity hypothesis at the 0.05 nominal level. In this case, we have evidence to believe that populations in Europe reach their gametic equilibrium.

To end this section, we analyze the hypothetical example of gametic disequilibrium between tow loci (A, B) in ten populations described in Zapata and Alvarez [8]. Here, the gametic counts are simply set by multiplying the haplotype frequencies given in Zapata and Alvarez [8] by 1000. The data are reproduced in Table 7. Obviously, the r values are homogeneous across the ten populations. For D' values, Zapata and Alvarez [8] utilized the bias-corrected nonparametic bootstrap method to obtain the 95% confidence interval for each D' values. Observing that the resultant confidence intervals have no intersection, they concluded that D' are heterogeneous. They suggested tests for homogeneity of gametic disequilibrium should be based on D', whose range is allele probability independent, rather than r. Although, the D values in Table 7 seem to be homogeneous, our homogeneity score test yields X^2* = 33.44 with p-value being less than 0.0001. Therefore, our test procedure also suggests the rejection of the homogeneity of gametic disequilibrium across the ten populations. In this case, our test reaches the same conclusion drawn by Zapata and Alvarez [8].

Table 7 Hypothetical example of gametic disequilibrium between two loci (A, B) with twoalleles (A₀, A₁ and B₀, B₁, respectively) across ten populations

Full size table

Discussion

Verification of the homogeneity assumption of gametic disequilibrium across several populations is crucial in gametic disequilibrium analysis. We note that traditional homogeneity test on gametic disequilibrium is based on the Fisher's test of homogeneity among correlation coefficients. However, our simulations demonstrate that this traditional test may not perform satisfactorily. Specifically, it can be very conservative or liberal, for almost all the cases in which the common true gametic disequilibrium D is bounded away from zero. Most importantly, these kinds of conservativeness and liberty can not effectively alleviated with increased sample sizes.

Our proposed large-sample homogeneity score test on gametic disequilibrium across several independent populations requires the count of haplotypes as input. In practice, only genotype data can be obtained in most situations. To employ our method, one can use some haplotyping software, such as PHASE, HAPLOTYPER, to resolve the genotype data as haplotype data. In this way, it separates haplotype phasing and gametic disequilibrium homogeneity test. Naturally, it is more promising to extend our method which can directly handle the genotype data. In this sense, model assumptions are based on genotype data. However, the haplotype phase uncertainty for the double heterozygotes makes the definition of gametic disequilibrium can not be directly expressed by the genotype data even assuming Hardy-Weinberg equilibrium holds. It may severely affect the further derivation of the corresponding score test. Thus, extending our method to handle genotype data is an avenue we intend to explore future.

Conclusion

In this article, we propose a large-sample homogeneity test on gametic disequilibrium across several independent populations based on the likelihood score theory generalized to nuisance parameters. Our simulation results show that our test is more reliable than the traditional test based on the Fisher's test of homogeneity among correlation coefficients. Although our test may also demonstrate conservativeness and liberty in some cases, unlike the traditional test these issues can be effectively resolved by increasing sample sizes. For design purpose, sample size formula that controls power is derived.

Appendix

Consistency and the condition to attain asymptotic efficiency for D*

Let n_k= nb_k, with b_k> 0 and k = 1, 2,...,K. The asymptotic property of D* is obtained under the assumptions that K is fixed and n approaches infinity (i.e., sufficiently large). The Mantel-Haenszel-type estimator of D* can be rewritten as

D^{*} = \sum_{k = 1}^{K} \frac{n_{k}^{2}}{x_{01 k} x_{10 k}} {\hat{D}}_{k} / \sum_{k = 1}^{K} \frac{n_{k}^{2}}{x_{01 k} x_{10 k}},

where ${\hat{D}}_{k} = x_{11 k} / n_{k} - x_{1 + k} x_{+ 1 k} / n_{k}^{2}$ . By the Central Limit Theorem, $\sqrt{n} (y_{k} - g_{k})$ has an asymptotic normal distribution N(0, Σ_k/b_k), where y_k= (x_00k, x_01k, x_10k, x_11k)/n_k, g_k= (p_00k, p_01k, p_10k, p_11k)', $Σ_{k} = d i a g (g_{k}) - g_{k} {g^{'}}_{k}$ . Let $c_{k} = \frac{\partial {\hat{D}}_{k}}{\partial y_{k}} |_{y_{k} = g_{k}}$ . By δ method, $\sqrt{n} ({\hat{D}}_{k} - D_{k})$ follows an asymptotic normal distribution $N (0, {c^{'}}_{k} Σ_{k} c_{k} / b_{k})$ . It is easy to calculate that ${c^{'}}_{k} Σ_{k} c_{k} = w_{k} (D_{k}, p_{1 + k}, p_{+ 1 k})$ . Since D_k≡ D under H₀ for k = 1, 2,...,K, we can conclude that D* is a consistent estimate of D. Let w_k= w_k(D, p_1+k, p_+1k), v_k= 1/(p_01kp_10k). Thus, the asymptotic variance of D* under H₀ is given by

A s y V a r (D^{*}) = \frac{(\sum_{k = 1}^{K} w_{k} v_{k}^{2} / b_{k})}{n {(\sum_{k = 1}^{K} v_{k})}^{2}} .

Let the information matrix with respect to D, p₁₊ and p₊₁ under H₀ be

I = (\begin{matrix} \sum_{k = 1}^{K} I_{k D D} & I_{1 D p_{1 + 1}} & I_{1 D p_{+ 11}} & \dots & I_{K D p_{+ 1 K}} \\ I_{1 D p_{1 + 1}} & I_{1 p_{1 + 1} p_{1 + 1}} & I_{1 p_{1 + 1} p_{+ 11}} & \dots & I_{K p_{1 + 1} p_{+ 1 K}} \\ I_{1 D p_{+ 11}} & I_{1 p_{1 + 1} p_{+ 11}} & I_{1 p_{+ 11} p_{+ 11}} & \dots & I_{K p_{+ 11} p_{+ 1 K}} \\ ⋮ & ⋱ & ⋱ & ⋮ \\ I_{K D p_{+ 1 K}} & \dots & \dots & \dots & I_{K p_{+ 1 K} p_{+ 1 K}} \end{matrix}) .

By inverting the information matrix I, we can obtain the asymptotic variance of $\bar{D}$ , that is,

A s y V a r (\hat{D}) = \frac{1}{n} {(\sum_{k = 1}^{K} b_{k} / w_{k})}^{- 1} .

By Cauchy-Schwarz inequality ${(\sum_{k = 1}^{K} v_{k})}^{2} \leq (\sum_{k = 1}^{K} b_{k} / w_{k}) (\sum_{k = 1}^{K} w_{k} v_{k}^{2} / b_{k})$ , we have AsyVar( $\bar{D}$ ) = AsyVar(D*). To this end, we obtain the sufficient and necessary condition for the asymptotic efficiency of D*, that is, w_kv_k= c, k = 1, 2,...,K, where c is a constant independent of all parameters. When D = 0, the condition is satisfied. From this, we know that D* is inefficient for general cases.

A simple expression for $I_{k D | p_{1 + k} p_{+ 1 k}}$

For the k-th stratum, denote the information matrix with respect to D_k, p_1+kand p_+1kby

I_{k} = (\begin{matrix} I_{k D_{k} D_{k}} & I_{k D_{k} p_{1 + k}} & I_{k D_{k} p_{+ 1 k}} \\ I_{k D_{k} p_{1 + k}} & I_{k p_{1 + k} p_{1 + k}} & I_{k p_{1 + k} p_{+ 1 k}} \\ I_{k D_{k} p_{+ 1 k}} & I_{k p_{1 + k} p_{+ 1 k}} & I_{k p_{+ 1 k} p_{+ 1 k}} \end{matrix}) .

According to the property of inverse matrix, $I_{k D | p_{1 + k} p_{+ 1 k}}$ (D_k, p_1+k, p_+1k) is equal to the reciprocal of the (1, 1) element of $I_{k}^{- 1}$ . By the property of MLEs, we have

\sqrt{n_{k}} ({\hat{D}}_{k} - D_{k}, {\hat{p}}_{1 + k} - p_{1 + k}, {\hat{p}}_{+ 1 k} - p_{+ 1 k})^{'} \to_{}^{d} N (0, n_{k} I_{k}^{- 1} (D_{k}, p_{1 + k}, p_{+ 1 k})),

where ${\hat{D}}_{k}, {\hat{p}}_{1 + k} = x_{1 + k} / n_{k}$ and ${\hat{p}}_{+ 1 k} = x_{+ 1 k} / n_{k}$ are the MLEs of D_k, p_1+kand p_+1k, respectively. Hence, the asymptotic variance of $\sqrt{n_{k}} {\hat{D}}_{k}$ is $n_{k} / I_{k D | p_{1 + k} p_{+ 1 k}} (D_{k}, p_{1 + k}, p_{+ 1 k})$ . On the contrary, by the Central Limit Theorem, $\sqrt{n_{k}} (y_{k} - g_{k})$ follows an asymptotic normal distribution N(0, Σ_k). By δ method, we immediately get that $\sqrt{n_{k}} ({\hat{D}}_{k} - D_{k})$ follows an asymptotic normal distribution N(0, w_k(D_k, p_1+k, p_+1k)). Therefore, we can obtain the exact expression $I_{k D | p_{1 + k} p_{+ 1 k}}$ (D_k, p_1+k, p_+1k) = n_k/w_k(D_k, p_1+k, p_+1k). Naturally, the expression of $I_{k D | p_{1 + k} p_{+ 1 k}}$ (D, p_1+k, p_+1k) is just $I_{k D | p_{1 + k} p_{+ 1 k}}$ (D_k, p_1+k, p_+1k) by substituting D for D_k.

References

Lewontin RC: The genetic basis of evolutionary change. 1974, New York: Columbia University Press
Google Scholar
Jorde LB: Linkage disequilibrium as a gene mapping tool. Am J Hum Genet. 1995, 56: 11-14.
PubMed Central CAS PubMed Google Scholar
Hedrick PW, Jain S, Holden L: Multilocus systems in evolution. Evol Biol. 1978, 11: 101-182.
Article Google Scholar
Weir BS: Inferences about linkage disequilibrium. Biometrics. 1979, 35: 235-254. 10.2307/2529947.
Article CAS PubMed Google Scholar
Hedrick PW: Gametic disequilibrium measures: proceed with caution. Genetics. 1987, 117: 331-341.
PubMed Central CAS PubMed Google Scholar
Mueller JC: Linkage disequilibrium for different scales and applications. Brief Bioinform. 2004, 5: 355-364. 10.1093/bib/5.4.355.
Article CAS PubMed Google Scholar
Lewontin RC, Kojima K: The evolutionary dynamics of complex polymorphisms. Evolution. 1960, 14: 458-472. 10.2307/2405995.
Article Google Scholar
Zapata C, Alvarez G: Testing for homogeneity of gametic disequilibrium among populations. Evolution. 1997, 51: 606-607. 10.2307/2411132.
Article Google Scholar
Weir BS: Genetic Data Analysis II. 1996, Sunderland, Massachusetts: Sinauer Associates
Google Scholar
Fisher RA: Statistical methods for research workers. 1925, New York: Oliver and Boyd
Google Scholar
Lewontin RC: The interaction of selection and linkage. I. General considerations; heterotic models. Genetics. 1964, 49: 49-67.
PubMed Central CAS PubMed Google Scholar
Tarone RE: Homogeneity score tests with nuisance parameters. Commun Stat-Theor M. 1988, 17: 1549-1556. 10.1080/03610928808829697.
Article Google Scholar
Mantel N, Haenszel W: Statistical aspects of the analysis of data from retrospective studies of disease. J Natl Cancer Inst. 1959, 22 (4): 719-748.
CAS PubMed Google Scholar
Guo JH, Ma YP, Shi NZ, Lau TS: Testing for homogeneity of relative difference under inverse sampling. Comput Stat Data An. 2004, 44: 613-624. 10.1016/S0167-9473(02)00262-1.
Article Google Scholar
Mateu E, Calafell F, Lao O, Batsheva BT, Kidd JR, Pakstis A, Kidd KK, Bertranpetit J: Worldwide genetic analysis of the CFTR region. Am J Hum Genet. 2001, 68: 103-117. 10.1086/316940.
Article PubMed Central CAS PubMed Google Scholar

Download references

Acknowledgements

This research was supported by the National Natural Science Foundation of China (Grant Numbers 10431010 and 10701022), National 973 Key Project of China (2007CB311002), NCET-04-0310, EYTP, the Jilin Distinguished Young Scholars Program (Grant Number 20030113) and the Program Innovative Research Team (PCSIRT) in University (#IRT0519). The work of ML Tang was fully supported by a grant from the Research Grant Council of the Hong Kong Special Administration (Project no. HKBU261007).

Author information

Authors and Affiliations

Key Laboratory for Applied Statistics of MOE and School of Mathematics and Statistics, Northeast Normal University, Changchun, 130024, China
Xiaolin Yin, Wenqing Ma & Jianhua Guo
Department of Mathematics, Hong Kong Baptist University, Hong Kong, China
Manlai Tang

Authors

Xiaolin Yin
View author publications
You can also search for this author in PubMed Google Scholar
Wenqing Ma
View author publications
You can also search for this author in PubMed Google Scholar
Manlai Tang
View author publications
You can also search for this author in PubMed Google Scholar
Jianhua Guo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jianhua Guo.

Additional information

Authors' contributions

JG initiated the study of homogeneity score test of gametic disequilibrium across strata. XLY drafted the manuscript and conducted the simulation. WQM simplified the proof in the Appendix section and made discussions extensively with XLY. MLT found a real example to apply the proposed method, proposed many constructive comments and widely polished the manuscript. All authors have read and approved the final version of this paper.

Rights and permissions

Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Yin, X., Ma, W., Tang, M. et al. Testing for homogeneity of gametic disequilibrium across strata. BMC Genet 8, 85 (2007). https://doi.org/10.1186/1471-2156-8-85

Download citation

Received: 25 April 2007
Accepted: 20 December 2007
Published: 20 December 2007
DOI: https://doi.org/10.1186/1471-2156-8-85

Testing for homogeneity of gametic disequilibrium across strata

Abstract

Background

Results

Conclusion

Background

Methods

Homogeneity test

Asymptotic power and sample size

Availability and requirements

Results

Simulation results

Real and hypothetical examples

Discussion

Conclusion

Appendix

Consistency and the condition to attain asymptotic efficiency for D*

A simple expression for $I_{k D | p_{1 + k} p_{+ 1 k}}$

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Authors' contributions

Rights and permissions

About this article

Cite this article

Keywords

BMC Genomic Data

Contact us

Testing for homogeneity of gametic disequilibrium across strata

Abstract

Background

Results

Conclusion

Background

Methods

Homogeneity test

Asymptotic power and sample size

Availability and requirements

Results

Simulation results

Real and hypothetical examples

Discussion

Conclusion

Appendix

Consistency and the condition to attain asymptotic efficiency for D*

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Authors' contributions

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Genomic Data

Contact us