Interval estimation of genetic susceptibility for retrospective case-control studies

Background This article describes classical and Bayesian interval estimation of genetic susceptibility based on random samples with pre-specified numbers of unrelated cases and controls. Results Frequencies of genotypes in cases and controls can be estimated directly from retrospective case-control data. On the other hand, genetic susceptibility defined as the expected proportion of cases among individuals with a particular genotype depends on the population proportion of cases (prevalence). Given this design, prevalence is an external parameter and hence the susceptibility cannot be estimated based on only the observed data. Interval estimation of susceptibility that can incorporate uncertainty in prevalence values is explored from both classical and Bayesian perspective. Similarity between classical and Bayesian interval estimates in terms of frequentist coverage probabilities for this problem allows an appealing interpretation of classical intervals as bounds for genetic susceptibility. In addition, it is observed that both the asymptotic classical and Bayesian interval estimates have comparable average length. These interval estimates serve as a very good approximation to the "exact" (finite sample) Bayesian interval estimates. Extension from genotypic to allelic susceptibility intervals shows dependency on phenotype-induced deviations from Hardy-Weinberg equilibrium. Conclusions The suggested classical and Bayesian interval estimates appear to perform reasonably well. Generally, the use of exact Bayesian interval estimation method is recommended for genetic susceptibility, however the asymptotic classical and approximate Bayesian methods are adequate for sample sizes of at least 50 cases and controls.


Background
Association mapping of complex phenotypes in case-control samples involves analysis of tables of genotype/allele counts collected at a large number of genetic loci. Relating single-locus genotype and allele frequencies to the outcome is a basic analysis step even if complex interactions among loci are expected. Indeed, biologically realistic models that involve multiple interacting polymorphisms may induce considerable "marginal effects" associated with individual loci. Often the numbers of cases and controls are fixed in advance by the experimental design, and the multiple markers are typed. Then case/control proportions remain the same for all markers, whereas genotype and allele numbers in cases and controls are subject to the random sampling variation. The reverse is however of greater interest: what is the genetic susceptibility, or the probability that a random individual will have a particular outcome given that a particular genotype or an allele is observed at a locus? Unlike the odds ratio, this parameter is not invariant with respect to the prospective vs. retrospective sampling schemes. Therefore, the point estimate of this probability can only be obtained from genotype counts in cases and controls by assuming a particular value of the population prevalence. Another issue is the degree of uncertainty in the estimate, which is being investigated in this article via interval estimation. Classical, or "frequentist" confidence interval (CI) is a well established framework for interval estimation. Such an interval is a random quantity, and the probability statements are made about proportions of times a random CI covers the fixed population parameter. A more relevant question is often about the variability, or uncertainty associated with the estimate of the population value. In other words, we would like to be able to make statements about the lower and upper bounds for genetic susceptibility, thus interpreting the interval as fixed, and the susceptibility as random, where randomness is due to the limited amount of data about the parameter. Bayesian (e.g. "credible") intervals provide such interpretation, but sometimes are criticized for subjectivity associated with the choice of prior distributions. It is not generally the case that classical and Bayesian intervals should correspond to each other, however such correspondence is possible for certain combinations of the likelihood and prior distributions. For example, P. Altham [1] showed that when testing for the difference of two binomial proportions, frequentist Fisher's exact test can be viewed as a Bayesian test when assuming binomial likelihoods and Beta priors for the distributions of proportions. Thus, although no priors are explicitly assumed by Fisher's test, one can recover prior distributions that are indirectly implied by the test. Datta and Mukerjee [2] provide an extensive review of such matching probability problems and show that in most parametric cases suitable priors can be constructed that would match the Bayesian credible intervals of a given size to that of a frequentist confidence interval up to second order. As we show here, under a beta-binomial model there is a close connection between the two types of intervals for genetic susceptibility, which allows flexibility in the interpretation.

Susceptibility intervals
Association studies are often "retrospective" in the sense that samples of cases and controls are determined by recruitment ("fixed"), and the genetic variants, e.g. genotypes AA and (not-AA) at a genetic marker, are random. Estimated genotype frequencies among cases or controls, e.g. are obtained directly from the following table as .
Assume AA is the risk genotype. If the samples were random with respect to both rows and columns, then the estimated susceptibility (penetrance, or the positive predictive value), φ = Pr( | AA), would simply be estimated by x/(x + y). Because the rows are fixed, we need the population prevalence, w. Then, where w = Pr( ), p = Pr(AA | ), and q = Pr(AA | ). If the rows are assumed to arise independently from binomial distributions, the maximum likelihood estimates of the parameters p and q are given by = x/n, and = y/m, respectively. The population prevalence, w, has to be estimated externally, rather than from the data provided by retrospective case-control samples. Let denote such an estimate of w. Hence, a point estimate of φ can be obtained as .
The expression given above involves the ratio of random variables, y/x, and hence obtaining the exact sampling variance is problematic. Approximate methods have been suggested for the ratio of binomial variables [3,4]. An approximate variance using the logit transformation allows us to obtain a simple expression as well as to separate the terms for p, q and w in the variance expression. To see this, define

Classical interval
The first-order Taylor series approximation gives the variance of as where the first two terms refer to the variance of the relative risk of AA on the log scale. Pepe [5] presented extensive discussion on estimation of similar quantities in the context of biomedical research. The last term of (3) can be further approximated as and requires the knowledge of which must come from an external source. When the range (r = w u -w l ) of w is known, we may define , the range on the logit scale. Then can be approximated by (r*/6) 2 , since three standard deviations from the mean cover over 99% of the (centered) normal distribution, f (x, 0, σ), i.e.
, where f(x, 0, σ) denotes the probability density function of a Normal distribution with mean 0 and standard deviation σ.
Henceforth, unless otherwise mentioned we use . Browne [6] discussed this issue and similar approaches of relating the range to the standard deviation.
The CI for susceptibility is obtained by inverting the endpoints of the asymptotic normal interval for η, e.g. the upper point is inverted as . As we know that , and , the estimated variance becomes

Approximate Bayesian interval
The variance given by (6) is infinite when the observed value of either x or y is zero. Before describing a method of dealing with this, we note that the usual asymptotic variance formula for the log odds ratios, based on the sum of reciprocals of the 2 × 2 table counts, {n 11 , n 12 , n 21 , n 22 }, has a similar deficiency, i.e. it results in an infinite variance when one of the observations is zero. To avoid this problem, it is common to add 1/2 to each cell. Haldane [7], Gart and Zweifel [8] justified this on the basis of minimizing the bias in the estimation of the logit variance (also see discussion in [9]). The variance becomes We propose a similar modification of (6) and justify it by showing that it is an approximate Bayesian variance estimator. In passing, we note that (7) can also be obtained by using the same argument as in the derivation of the approximate posterior variance of that follows.
When the sampling distribution of genotypes is binomial, i.e. x|p ~ Bin(p, n) and y|q ~ Bin(q, m) and the prior distributions for p and q are independent Beta(γ 1 , β 1 ) and Beta(γ 2 , β 2 ) respectively, the posterior distributions of p and q are independent Beta's with p | x ~ Beta(x + γ 1 , n -x + β 1 ) q | y ~ Beta(y + γ 2 , m -y + β 2 ) (8) The binomial likelihood arises as a consequence of independence between genotypes in a random sample. Although a Beta prior for allele frequencies is sometimes justified from purely mathematical convenience, it does provide a sufficiently flexible shape. More importantly, certain population-genetic models result in Beta, or more generally Dirichlet distributions for allele frequencies. For example, population frequencies at mutation-drift equilibrium follow this distribution [10]. A reasonable assumption is that a typical value from the susceptibility distribution is around the prevalence value, which implies similarity in the prior distributions of p and q, i.e. γ 1 = γ 2 , β 1 = β 2 .
Different parameters would generally assume prior deviation of the typical susceptibility value away from the population prevalence.
Posterior expectations and variances of p and q are respectively given by,  (3), we obtain Again, the approximation (4) is appropriate when is available.
When γ 1 = γ 2 = β 1 = β 2 = 1/2 (Jeffreys' prior, [11]), Welch and Peers [12] established that in general Jeffreys' priors provide mathematical correspondence between frequentist and Bayesian intervals. In particular, these authors proved that under some regularity conditions (which are trivially satisfied for this problem) Jeffreys' prior is the unique prior for a parameter for which a Baye-sian credible interval and a frequentist confidence interval match up to the second order.
The asymptotic normal credible interval based on can be obtained as: where , and the endpoints, l and u, are inverted as and , respectively to produce an approximate (1 -α)% interval estimate of φ.

Exact Bayesian interval
It is also possible to obtain an "exact" Bayesian credible interval using Monte Carlo samples generated from the posterior distribution of φ. The sample is obtained by repeatedly drawing p and q from their posterior distributions given by (8) and calculating φ via (1) for each realization of p, q. This generates an empirical posterior distribution for φ and a (1 -α)% interval is given by (α/2), (1 -α/2) quantiles of this distribution. Values w can be drawn uniformly from the range of its possible values. A bell-shaped Beta distribution can also be assumed with parameters based on the reported range, w l -w u , as described in Methods, section 3. An algorithm is as follows: Then the α/2 and 1 -α/2 quantiles of the generated empirical posterior distribution of φ provide the (1 -α)% credible interval for the susceptibility. This algorithm is easily implemented in R, a language and environment for statistical computing and graphics available as Free Software at http://www.r-project.org/. For example, setting γ 1 = γ 2 = β 1 = β 2 = 0.5, the entire code using 100,000 samples, 95% credible interval, and w distributed uniformly between 0.04 and 0.06 is p <-rbeta(100000, x + 0.5, n -x + 0.5) q <-rbeta(100000, y + 0.5, m -y + 0.5) cat(c("Interval estimate", phiL, phiU), fill = T) Common beta priors on p and q (γ 1 = γ 2 , β 1 = β 2 ) may be interpreted as vague with respect to the prior distribution for φ. On the logit scale, this distribution is bell-shaped symmetric, and centered around ln [w/(1 -w)]. Its variance decreases as the common (γ, β) parameters for p and q increase. For large values of γ, β this variance can be obtained by the Taylor series approximation, 2β/[γ (γ + β + 1)]. A mixture of beta distributions can take an arbitrary shape with the great advantage that the resulting distribution is bounded between the lower and upper valueswhich is the case for p, q, and φ parameters. In Methods (section 1) we outline the sampling scheme for the mixture.

Interval estimation of allele or haplotype susceptibilities
Methods described so far could be applied without modifications to the estimation of allele or haplotype susceptibilities when the probability of obtaining a sample of alleles follows the multinomial likelihood. However, such usage would require the assumption of independence between alleles. This can be justified under the assumptions of Hardy-Weinberg equilibrium (HWE) in the population, as well as the multiplicative effects of haplotype susceptibilities [13]. When these assumptions are reasonable, the intervals are obtained in the same way as described above using counts of alleles/haplotypes instead of genotypes. Otherwise the variance of allele frequencies, is no longer binomial (x, y are now the counts of alleles A in cases and controls respectively, and n, m are the total numbers of alleles). It is instead given by where D p , D q are the Hardy-Weinberg disequilibrium (HWD) coefficients [14], defined as the deviation of the observed frequencies of AA from the expected under random union of alleles. For example, , and is estimated by using the observed frequencies, . These variances are most easily derived by re-coding the genotypes (AA, , ) as G = (-1, 0, 1), which keeps track of the number of alleles A minus one in a corresponding genotype, and taking expectations in V(G) = E(G 2 ) -(E(G)) 2 (see Appendix 1 in [15], or two alternative derivations in [14]).
These coefficients (quantifying HWD among cases or controls) may be non-zero even if the population is in HWE [13]. Then the variance (6) becomes with two new additional terms to account for HWD. Consider the effect of . If is negative, the resulting variance becomes smaller. If it is positive, the maximum possible value of [14] is The maximum values for the terms with HWD in (15)

Coverage probabilities
The intervals using (6, 10) and the exact Bayesian interval have good frequentist coverage probabilities. To illustrate this, we simulated population values of p and q as p B eta(1, 1) and q ~ Beta(1, 1) for each of 10,000 simulation runs. We used γ = β = 1/2 for the interval calculations and / ,ˆ/ p x n q y m assumed that w = 0.5 or w = 0.04 is known and fixed. To compare intervals based on (6) and (12) we only considered samples with x, y > 0, obtained as x ~ Bin(p, n), y B in(q, m). In addition, we considered binomial samples with at least one of x, y equal to zero to check coverage properties of (12). We used m = n = 10,50,100,500 and 90% coverage intervals. Frequentist intervals cannot be used with zero counts of n 11 or n 21 , which may be problematic only for sample sizes of 10 (17% of the samples). In this small percentage of cases, we resort to adding 1/2 to the counts as in (10). Agresti [9] considered a similar approach for calculation of CIs for odds ratios, that is to use Gart's formula (7) only when zero counts are encountered. Results for coverage probabilities are shown in Table 1. Note that the coverage of all three methods is around the nominal 90% for all sample sizes and both values of population prevalence. Results in Table 1 are averages across the distribution of p and q, however simi-lar coverage results are obtained when the population values of p, q are set to specific fixed values. We considered the following fixed sets of parameters: The coverage probabilities for all three methods were found to be sufficiently close to the nominal values of 90% and 95% for sample sizes of 50 or larger (Table 2).   Table 3 shows the mean length and the standard deviation of the interval length across 10,000 simulations used to produce Table 1. This standard deviation reflects estimated variability of the interval length and decreases with sample size, but not with the number of simulations. The corresponding standard error could be obtained by dividing the standard deviation by the square root of the number of simulations. Table 4 shows the same results among the intervals that include the population value of w. Numbers for each parameter combination that contrast three intervals in a given table are obtained using the same data sets. For example, once a sample of size n is obtained for a specific value of w, all three intervals are calculated using this sample. Therefore, comparisons between methods are quite precise given the number of simulations. Again, the mean and standard deviation are similar among three methods. Nevertheless, the classical CI is never the shortest of the three intervals for the parameter values that we have considered and typically it has the highest length variability. Using data summarized in Table 3, the exact interval was found to have the smallest length 5 times out of 8 considered parameter combinations and has the lowest variability 5 times out of 8. These numbers are respectively 5 out of 8 and 4 out of 8 for data used to construct Table 4. There is some dependency between results from Tables 3 and 4 due to the same combination of parameters.
We do not present numerical results based on mis-specification of w or the uncertainty in it, because the behavior of intervals is clear. When w is mis-specified, the intervals actively worsen as the sample size increases (in the sense of the probability of including the population parameter), since they shrink around the wrong value. On the other hand, the uncertainty in w widens the interval length. As the sample size becomes large, the minimum length simply reflects the variability in w. We conducted simulation experiments to illustrate this point, using the population value of w = 0.045 assuming the range 0.03 to 0.06 is reported. For the exact Bayesian interval, the distribution of w was modelled as Beta(ψ 1 = 77.31, ψ 2 = 1640.69) with parameters estimated by (29). These values are taken to reflect the reported prevalence of the hypersensitivity reaction to an antiretroviral drug abacavir, for which genetic predisposition has been described [16]. Table 5 reports results for the coverage probabilities and the interval length of 10% intervals, assuming p = 0.8 and q = 0.2, which corresponds to the population susceptibility value

Pharmacogenetic application
Hypersensitivity reaction to the antiretroviral drug abacavir affects approximately 4.5% of patients. The hypersensitivity symptoms are varied with fever and rash among the most common. A small number of fatalities have also been reported. The exact mechanism of hypersensitivity is not established, but the pattern of symptoms suggests that it is an immunological reaction, triggered by specific genetic polymorphisms. Hetherington et al. [16] studied the association of HLA-B57 haplotype with hypersensitivity. One or two copies of HLA-B57 haplotype were present in x = 39 of n = 84 individuals with hypersensitivity (cases) and in y = 4 out of m = 113 controls. The prevalence of the hypersensitivity reaction is estimated to be between 0.03 and 0.06 [16]. The estimate of φ using (1) and the midrange w (that is equal to 0.045) is 0.382. . Then the credible interval was obtained from this (empirical) distribution of φ by determining its 2.5% and 97.5% quantiles. It is not necessary to assume the uniformity of w. Indeed, it may be reasonable to assume that the midrange value is more plausible than the endpoints and a realistic distribution can be described as bell-shaped. Taking w B eta(57.8572,1227.86) generates a bell-shaped distribution with the mean at 0.045 and 99% of this distribution is contained between 0.03 and 0.06 (Methods, section 3). The resulting interval is somewhat smaller, 0.195 -0.653. Results for this example are summarized in Table 6.

Discussion
Asymptotic frequentist and Bayesian intervals for genetic susceptibility as well as the exact Bayesian interval are quite similar in properties. Given that the sampling from the posterior distribution can be done directly, there is no particular reason to resort to approximations and the exact Bayesian interval can be recommended. Nevertheless, approximations are useful in that they reveal connections between the confidence and credible intervals as well as allow for sample size and power calculations as we outline in Methods, section 2. The mean length and the interval variability appear somewhat smaller for the Bayesian intervals. The frequentist coverage of Bayesian intervals is satisfactory. One may even argue that the operational usage of CIs is more often Bayesian than it is not, because the practical issue is ultimately about the confi-  Caution should be taken when the inference is about allele or haplotype susceptibilities. When assumption of population HWE does not hold, or when there are substantial deviations from multiplicativity, the variance of susceptibility contains additional terms that account for population or model-induced deviations from Hardy-Weinberg equilibrium.

Conclusions
We found that the classical interval for genetic susceptibility can be considered as an approximate Bayesian interval under the beta-binomial model. This algebraic similarity between the proposed classical and approximate Bayesian intervals allows an appealing Bayesian interpretation of the usual confidence limits. Simulation studies also confirm similarities in coverage probabilities and interval lengths among asymptotic classical and approximate and exact Bayesian intervals.

Posterior sampling using the mixture of beta distributions
We derive explicitly the posterior distribution of θ under a Binomial sampling using a mixture of Beta priors. As Beta is a conjugate prior for the binomial likelihood, we show that mixture of Beta's is also conjugate. To see this, consider the problem where p(x| θ) denotes the sampling density of X = x given θ. Now consider a prior for θ given by a mixture of k beta distributions Beta(γ j , β j ) with weights v j , such that and v j ≥ 0 for all j. More explicitly, the prior density π(θ), of θ is given by,  the ratio under the alternative are denoted as p A , q A , and ∆ A , respectively. Once these values are specified, we can calculate the power of detecting the difference of the susceptibility from the population prevalence as well as the sample size required to achieve the required power under an acceptable type I error rate. We will illustrate the sample size calculation assuming the equal number of cases and controls (m = n). Power calculation given the fixed sample size is obtained similarly. We have where is the non-centrality parameter corresponding to 1 -β power under a two-sided level-α test. Therefore, the necessary sample size is

Estimating parameters of the Beta distribution for the prevalence from the range of reported values
Suppose the range of the prevalence values, w l -w u , is reported. We would like to model the uncertainty with a bell-shaped Beta(ψ 1 > 1, ψ 2 > 1) distribution, such that a certain percentage of the distribution (e.g. 99%) is between w l and w u with the expected value given by w l + (w u -w l )/2. The parameters of this distribution can be found as the solution for ψ 1 of where denotes the density of a Beta(ψ 1 , ψ 2 ) distribution and is the Beta function.
The second parameter for b(·) in (28) is given by the constraint that the expectation for w is ψ 1 /(ψ 1 + ψ 2 ).
As a sample calculation, consider an antiretroviral drug abacavir for which the hyper-sensitivity reaction range is reported as w l = 0.03 to w u = 0.06 [16], so that we can take = w l + (w u -w l )/2 = 0.045. If we assume that these bounds cover 99% of the prevalence distribution that results from sampling errors in estimation, possibly confounded with other factors such as genuine population heterogeneity of samples, then the resulting distribution reflecting the uncertainty in w is Beta(57.8572, 1227.86), using (28).
Much more simply calculated parameter estimates can be obtained by the normal approximation to the beta distribution. Let = w l + (w u -w l )/2 and approximate the vari- For the abacavir example, these estimates are found as = 77.31 and = 1640.69. Using these estimates, 99.7% of the resulting Beta(77.31, 1640.69) distribution is contained within the range 0.03 -0.06. Thus, the normal approximation yields a somewhat more condensed distribution.