Multinomial logistic regression approach to haplotype association analysis in population-based case-control studies

Background The genetic association analysis using haplotypes as basic genetic units is anticipated to be a powerful strategy towards the discovery of genes predisposing human complex diseases. In particular, the increasing availability of high-resolution genetic markers such as the single-nucleotide polymorphisms (SNPs) has made haplotype-based association analysis an attractive alternative to single marker analysis. Results We consider haplotype association analysis under the population-based case-control study design. A multinomial logistic model is proposed for haplotype analysis with unphased genotype data, which can be decomposed into a prospective logistic model for disease risk as well as a model for the haplotype-pair distribution in the control population. Environmental factors can be readily incorporated and hence the haplotype-environment interaction can be assessed in the proposed model. The maximum likelihood estimation with unphased genotype data can be conveniently implemented in the proposed model by applying the EM algorithm to a prospective multinomial logistic regression model and ignoring the case-control design. We apply the proposed method to the hypertriglyceridemia study and identifies 3 haplotypes in the apolipoprotein A5 gene that are associated with increased risk for hypertriglyceridemia. A haplotype-age interaction effect is also identified. Simulation studies show that the proposed estimator has satisfactory finite-sample performances. Conclusion Our results suggest that the proposed method can serve as a useful alternative to existing methods and a reliable tool for the case-control haplotype-based association analysis.


Background
Genetic association analysis aims to detect gene-disease association through linkage-disequilibrium of a disease susceptibility gene with adjacent genetic markers. Historically, association analysis was limited to single markers. It is anticipated that greater power may be gained by utilizing linkage-disequilibrium information from multiple markers simultaneously. This anticipation, together with recent advances of the availability of high-resolution genetic markers, in particular the single-nucleotide polymorphisms (SNPs), has motivated the use of haplotypes, which are specific combinations of closely linked genetic markers on a chromosome, as the basic genetic units for association analysis. In addition, the biological advantage for haplotype-based analysis is that it can directly identify unique chromosomal segments that contain disease sus-ceptibility genes by assessing the haplotype-specific risk for disease. Schaid [1] provided a detailed and excellent review for haplotype-based association analysis.
The population-based case-control study design has been popular for genetic association analysis due to its cost-efficiency in collecting the data. If the haplotype pair for each individual is directly observable, traditional logistic regression analysis can be applied to assess the haplotypedisease association, possibly adjusting for environmental/ demographical factors, and to evaluate the haplotypeenvironment interactions. According to Prentice and Pyke [2], with case-control data, maximum likelihood estimation of the association (odds-ratio) parameters in logistic regression model can be simply carried out by fitting a prospective logistic model and ignoring the case-control design. Note that, in traditional logistic regression analysis, no modeling assumptions are made on the distribution of the covariates (haplotype and/or environmental factors), that is, the covariate distribution is treated fully nonparametrically.
Usually, the haplotype information is not directly observed and is subject to ambiguity because we can only observe the "unphased" genotype data where the "phase information", i.e., the arrangement of alleles on each of the two chromosomes, is unavailable. There has been rich literature of haplotype inference in general populations, see for example the EM algorithms by Excoffier and Slatkin [3] and Li et al. [4], and the Bayesian methods in Stephens et al. [5] and Niu et al. [6]. To recover the phase information and to ensure the identifiability of the association parameters, in general we need to impose some modeling assumptions on the distribution of haplotype pairs; see Epstein and Satten [7] for related issues when no environmental covariates are considered. One common assumption for such a model is Hardy-Weinberg equilibrium (HWE) in the general population. When environmental covariates are considered, further assumptions regarding relationship between haplotype pairs and environmental factors may be required. One convenient and generally reasonable assumption for this is the haplotypeenvironment independence [8,9], which assumes that a subject's haplotype-pair features are independent of his/ her environmental exposures.
In this work, to assess the haplotype-environment associations with disease phenotype, we will first propose a novel modeling setup that is based on a multinomial logistic regression model, where the various combinations of disease and haplotype-pair categories are treated as multinomial outcomes, and the environmental/demographical factors are used as covariates. We show that the proposed multinomial logistic model can be decomposed into a prospective logistic disease model relating the hap-lotype and environmental factors with disease, as well as a parametric model for the haplotype-pair distribution, conditional on the environmental covariates, among the control population. Compared to some existing methods such as the one proposed by Spinka et al. [9], our proposal differs from theirs in the way to model the haplotype-pair distribution: In our proposal the model for the haplotypepair distribution is specified only for the control population, while in the method of Spinka et al. such a model is specified for the whole population. Epstein and Satten [7] uses the same modeling strategy as ours, but their proposal cannot allow for environmental covariates. Our proposal is equivalent to the method of Epstein and Satten in the absence of environmental covariates and extends their method to incorporate environmental covariates.
The major advantage of the proposed approach lies in its computational convenience; it can be simply implemented by an iterative reweighted fit of a multinomial logistic regression model. Note that, when the model for the haplotype-pair distribution is specified for the whole population as in the method of Spinka et al. [9], the true intercept parameter in the prospective disease model, which quantifies the baseline population disease risk, appears as a separate parameter and needs to be estimated. Since usually there is little information on this parameter in a case-control sample, Spinka et al. [9] suggested using external information on the population disease prevalence or the rare-disease approximation to avoid estimating the true intercept parameter. In contrast, in our proposal the true intercept parameter does not appear as a separate parameter and hence needs not be estimated even when the disease is common and the population disease prevalence is unknown. Hence the proposed method has wide applicability. Simulation results reveal that the finite sample properties of the proposed estimates are satisfactory.

Model
Let D denote the disease phenotype with D = 1 indicating disease and D = 0 indicating no disease, G the unphased multilocus genotype for a series tightly linked SNPs, and X the demographical/environmental covariates. Suppose that there are J possible haplotypes denoted by { 0 ,..., J-1 } with 0 indicating the "reference" haplotype (usually the most frequent haplotype). Let (H 1 , H 2 ) be the haplotype pair (diplotype) for a subject and label all possible haplotype pairs by H = 0, 1, ..., M -1 so that "0" represents the reference haplotype pair ( 0 , 0 ) and M is the number of all possible haplotype pairs in the sample. The multinomial logistic model is a useful tool for regression analysis with multinomial responses [10,11]. Considering the various combinations of (D, H) as multinomial responses, we propose to base the haplotype association analysis on the following multinomial logistic model: with β 0 = β 0X = 0 and m(0, X; γ) = 0 for all X, where m(H, X; γ) is a known but arbitrary function of (H, X). Throughout the paper, a prime denotes matrix transposition.
To see the meanings of the parameters involved, rewrite (1)-(3) as β 0 = β 0X = 0 and m(0, X; γ) = 0 for all X. We can see that the parameters α and β X quantify respectively the baseline disease risk and the effect of environmental exposures on the disease risk for subjects with the reference haplotype pair.
The function m(H, X; γ) specifies the distribution of the haplotype pairs among the control population given the covariates X. The parameters β h and β hX (h = 1,..., M -1) measure, in the retrospective odds-ratio scale, respectively the main effect of the haplotype pair h (relative to the reference pair) and the interaction between the haplotype pair h and the covariates X. Note that although β h and β hX are specified through the retrospective model Pr(H|X, D), they are equivalent to the corresponding parameters specified in the prospective disease model Pr(D|X, H). In fact, according to (1)-(3), the joint probability of (D, H) given X is obtained as where Λ ih (X; θ) = i (α + β h + X' β X + X' β hX ) + m(h, X; γ), θ = (α, β X , β h , β hX , γ; h = 0,..., M -1) with β 0 = β 0X = 0 and m(0, X; γ) = 0 for all X. The prospective disease model Pr(D|H, X) is then given by Therefore, all the association (odds ratio) parameters regarding the associations between haplotype-environment factors and disease phenotype in the proposed model are equivalent to those specified in a prospective disease risk model. Note that the models (4)-(6) are equivalent to the prospective disease risk model (8) together with the model (5) for the haplotype-pair distribution in the controls. More examples for the specification of the genetic effects can be found in Epstein and Satten [7] and Zhao et al. [12]. Similarly, by taking = X * β int we can evaluate the interactions β int between haplotypes and a chosen environmental covariate X * . Although the above examples focus on effects associated with one single haplotype, using a collection of Z h variables for multiple causal haplotypes we can fit extensive models with multiple causal haplotypes; see Results, Application to the hypertriglyceridemia study for an illustration.
Various models for the haplotype-pair distribution in the control population can also be incorporated into the proposed model with suitable specifications of the function m(H, X; γ). Note that a saturated model for the haplotypepair distribution is not identifiable from the unphased genotype data [7], so certain modeling constraints must be imposed to ensure identifiability. One convenient specification is to assume that, in the control population, the haplotype-pair distribution is independent of the covariates X and satisfies the Hardy Weinberg equilibrium (HWE), namely. log where π j is the marginal frequency of haplotype j in the control population. Such a distribution corresponds to the specification m(H, X; γ) = γ, where W H is a J-vector with the jth component given by I (H 1 = j ) + I (H 2 = j ), and the jth component of γ is related to π by γ j = log (π j /π 0 ), j = 0,..., J -1. A more general assumption is to allow the HWE holds only within each of the strata defined by some categorical covariate S, and the corresponding specifica- where W H is denned as above and γ S is a stratum-specific parameter with the strata determined by S. For example, when population stratification exists, then the strata S can be defined by the subpopulations or ethnicity groups to account for the violation of HWE caused by population stratification. Another way to relax the HWE assumption is to introduce the fixation parameter [13] into the model m(H, X) for Pr(H|X, D = 0); see Satten and Epstein [14] for details.

Maximum likelihood estimation
Let N 1 and N 0 be respectively the number of cases and controls in the sample, and N = N 1 + N 0 the total sample size. For the uth subject in the case-control sample, u = 1,..., N, suppose that the observed data are (D u , G u , X u ), including the disease status D u , the unphased genotype G u , and the environmental covariates X u . The haplotype data H u cannot be uniquely determined from the unphased genotype data G u if the uth subject is heterozygous at more than one SNP.
Let S(G) be the set of labels of haplotype pairs that are consistent with unphased genotype G. Using (7), the probability of (D, G) given X can be expressed as In the following discussions we will assume that models for β H , β HX and m(H, X; γ) are specified in such a way that all the parameters involved are identifiable from prospective studies. In Appendix 1 we provide identifiability conditions of β H and β HX for a given model m(H, X; γ).
The loglikelihood for the observed data in a case-control sample is given by where Pr(D, H|X; θ) is given by (7), p(X) denotes the marginal density function of X, and We leave p(X) fully unspecified. Using the profile likelihood approach to profile out the nuisance parameter p(X), the retrospective loglikelihood (10) can be translated into a prospective loglikelihood.

The profile likelihood
Define θ * as θ with α replaced by α * . Under the multinomial logistic model Pr(D, H|X; θ) given in (7), the retrospective loglikelihood for the unphased genotype data in a case-control sample can be shown to be equivalent to the prospective loglikelihood where Pr * (D, H|X; θ * ) is defined as Pr(D, H|X; θ) with α replaced by α * . The derivation is relegated to Appendix 2.
This result is parallel to the classic one given by Prentice and Pyke [2] when haplotype data are directly observed. Note that in the prospective likelihood L P the original intercept α is absorbed into α * and does not appear as a separate parameter. Hence α is not identifiable and estimable, unless supplementary information on the population disease prevalence Pr(D = 1) is utilized to separate α from α * . Further, following the derivation in Prentice and Pyke [2] or Scott and Wild [15], we can show that the estimated covariance matrix for all the parameters except α * can be obtained from the corresponding submatrix of the observed information matrix based on L P . Therefore, under the proposed modeling setup the maximum likelihood inference on all the parameters except α can be based on the prospective loglikelihood L P , with the casecontrol sampling design being ignored. Another consequence is that, using the result of Scott and Wild [15], using external information on Pr(D = 1) in our proposal only affects the inference of the intercept parameter and does not affect the efficiency of estimates for all the other parameters.

EM algorithm
The maximum likelihood estimation based on L P can be simply implemented by the EM algorithm. Write ∂L P /∂θ * = ∑ u Ψ u (θ * ), and let which is the uth subject's "complete-data" score function based on Pr * (D, H|X; θ * ); see Appendix 3 for its explicit expression. It can be seen that where the expectation in the last equation is with respect to the conditional probability Pr * (H u = h|D u , G u , X u ; θ * ) ≡ ρ hu (θ * ) that is defined as Note that with the proposed model Pr(H = h|D, G, X; θ) does not depend on α, hence can be readily evaluated even if α cannot be reliably estimated, which is usually the case in case-control studies. This property is not shared by other modeling strategies such as those in Spinka et al. [9] and Zhao [12], unless the rare-disease assumption is made.
Equation (11) suggests that the maximum likelihood estimate of θ * can be obtained by the following EM algorithm. Given the estimate of θ * from the rth iteration, denoted by *(r) , we first evaluate the posterior haplotype-pair distribution ρ hu (θ * ) at θ * = *(r) . Then we obtain the updated estimate *(r+1) by solving θ * from where ρ hu = ρ hu (θ *(r) ). The process is repeated until some convergence criterion is met. Thus, with the proposed EM algorithm, maximum likelihood estimation with unphased genotype data under the proposed model can be readily implemented by iteratively fitting a weighted multinomial logistic regression model, with the iterativeupdated weights ρ hu (θ * ) given by (12). It can be seen that the EM algorithm above can accommodate not only the unphased genotype but also the missing genotype data.
When solving (13), we apply the Newton-Raphson algo- with respective to θ * approximated by the simple positivedefinite matrix where V * (·|X) denotes the variance with respect to Pr * (D, H|X; θ * ); see Appendix 3 for the derivation. Note that C is also an approximation for the "complete-data" information matrix based on Pr * (D, H|X; θ * ).
Let * be the final estimate given by the EM algorithm, provided convergence is achieved. Then * is the maximum likelihood estimate of θ * , which is asymptotically normal. The covariance matrix for the maximum likelihood estimates of all the parameters, except the intercept parameter α * , can be estimated by the corresponding submatrix of the inverse of the observed information matrix based on L P . Following Louis [16], the observed information matrix based on L P can be obtained by evaluated at θ * = * .
The EM algorithm described above is found to be sensitive to the initial estimate of γ, the parameter associated with the haplotype-pair distribution in the control population, in that the algorithm may fail to converge when using inappropriate initial estimates of γ. To obtain an adequate initial estimate of γ, we can adopt an approach similar to the proposal in Zhao et al. [12] to obtain the initial estimate for γ based on the control data only. For example, when the haplotype distribution in the controls follows HWE and is independent of environmental covariates, we can obtain initial estimate for γ, or equivalently the haplotype frequencies π in the controls, by solving π j , j = 0,...,  and π (-1) denotes the solution of π in the previous iteration. Our numerical experiments reveal that, starting with assigning equal frequency to each haplotype, two iterations in (14) are sufficient to yield an adequate initial estimate of γ.
To ensure stable estimation during the EM algorithm, for a haplotype with very small estimated frequency (e.g. <e -10 ), we will fix its frequency to be 0 and drop the corresponding parameter from γ.
We have developed a general-purposed SAS Macro for implementing the proposed method, which is available upon request from the corresponding author.

Application to the hypertriglyceridemia study
The proposed methodology is applied to the hypertriglyceridemia study conducted at National Taiwan University Hospital (Kao el al. [17], Tzeng et al. [18]), where 303 healthy controls and 290 cases, defined as serum triglyceride level > 400 mg/dl, were recruited. One primary objective of this study is to assess the association between the haplotypes in apolipoprotein A5 gene (APOA5) and hypertriglyceridemia in humans, adjusting for the environmental covariates Age, Sex, and BMI, and to explore the potential haplotype-environment interactions. 2), the indicator of BMI being larger than its mean in the controls. By this model we allow the dependence between BMI and haplotypes; the dependence between other environmental factors (Age, Sex) and haplotypes is less significant in light of preliminary analysis, and hence is not considered in the final analysis.
The analysis results are displayed in Table 1. Relative to the common haplotype GGGCT, the three hapiotypes AGGCC, GGTCT, and GAGTT are associated with higher risk of hypertriglyceridemia; the former two were also identified elsewhere by haplotype-specific analysis (Pennacchio et al. [19], Kao et al. [17]), and here by joint analysis of multiple haplotype effects we further identify GAGTT as a potential risk haplotype. The significant interaction term suggests that the effect associated with the haplotype GGTCT is modified by age: older carriers of the GGTCT haplotype have decreased risk for hypertriglyceridemia than younger carriers. The estimates of γ 1 (data not shown) reveal that BMI and the haplotypes GGTCT and GGGCT are dependent in the control population.

Simulation studies
The first simulation study is to examine the finite sample performance of the proposed method. In each replication, data on the environmental variable X are simulated from a standard normal distribution. Given X, the haplotypepair distribution in the control population is assumed to be Pr(H 1 = j , H 2 = k |X, D = 0) = Pr( j |S, D = 0) Pr( k |S, D = 0) with S = I(X > 0), namely the distribution of H follows HWE within each of the strata defined by whether X > 0 or not. This specification corresponds to a situation where the hapiotypes and environment covariates are dependent and hence the gene-environment independence assumption does not hold in the control population; see Chatterjee and Carroll [8] for some examples where gene and environmental factors may be dependent. Following Satten and Epstein [14], we assume the haplotypes contain 5  tightly-linked SNPs, and the associated haplotype frequencies {Pr( j |S,D = 0), j = 0,..., J-1} in stratum S are listed in Table 2 for S = 0, 1. By convention, we choose the most common haplotype 0 = "10011" as the reference haplotype. Further, we choose the haplotype 4 = "01100" as a putative susceptibility haplotype. Data on the haplotype pair and disease phenotype are then generated according to the multinomial logistic model (7) A case-control sample with 415 controls and 796 cases is then selected from a larger set of data. When analyzing the data, we ignore the phase information for haplotype data and use only the unphased genotype data. Further, we allow the genotype data to be missing independently and randomly for SNPs 1-5 with probabilities 2.9, 5.6, 5.4, 4.5, and 2.3%, respectively.
In each setting considered, we set α = -3, β X = 0.3 and β hap = 0.1. The haplotype-environment interaction parameter β int , which is the main focus here, is set to 0, 0.15, or 0.3.
The results based on 500 replications are displayed in Table 3. The point and standard error estimates for the association parameters are essentially unbiased, and the coverage of the 95% confidence intervals is close to the nominal value. In particular, the size of the Wald test for testing H 0 : β int = 0 essentially attains its nominal value, and the associated power for detecting haplotype-environment interaction is satisfactory when β int = 0.3. The point and standard error estimates for the haplotype frequencies also match the true values well (results not shown). We also conduct analysis where the effects of rare haplotypes (with frequency 1%-5%) are estimated (data not shown). The point estimate essentially remains unbiased, while the standard-error estimate underestimates the true value. As commented by a referee, here a permutation-based procedure may be required for proper inference.
In the second simulation study, we compare the proposed method with some existing alternatives, including the methods by Zhao et al. [12] and by Spinka et al. [9]. To facilitate the comparison among these methods that are based on different assumptions, we conduct the simulation under a setting where the disease is rare, and the haplotype-pair distribution in the whole population follows HWE and is independent of the environmental covariate. All the methods considered can apply well under this setting. Specifically, in each simulation we first simulate the environmental covariate X from a standard normal distribution, and then simulate the haplotype pairs in the whole population according to HWE with the marginal haplotype frequencies given in the third column (S = 0) of Table 2. The disease phenotype is then generated by the logistic regression model information is ignored and only the unphased genotype information is used when analyzing the simulated data. We also allow the genotype data for SNPs 1-5 to be missing independently and randomly with probabilities as in the first simulation study.
When applying the methods of Zhao et al. [12] and Spinka et al. [9], we employ the same models used to generate the data, hence the model specification is fully correct. The method of Spinka et al. is implemented using either a grid-search method or the known value of Pr(D = 1) to estimate α. When applying our proposal, we simply assume the control haplotype-pair distribution is independent of X and satisfies HWE, which is in fact a moderately wrong specification in the current simulation setting. Table 4 exhibits the simulation results based on 500 replications. Although applied with a moderate model misspecification, the estimates from our proposal have small bias, and are more efficient than estimates from other approaches. The method of Spinka et al. [9] is less efficient than our proposal, even if the former further incorporates supplementary information on population disease prevalence. It is worth noting that, although the method of   [12], consistent with the finding of Satten and Epstein [14] that fully prospective analysis such as the method of Zhao et al. may lose considerable efficiency for haplotype association analysis in case-control studies.
To examine the sensitivity of the proposed method to the model assumptions, we conduct a simulation study to assess the bias of the estimates when the haplotype-pair distribution is wrongly assumed to follow HWE in the proposed method. Specifically, here we generate the haplotype data in the controls from the model where the fixation index parameter f = 0.1 or 0.2, and π j , j = 0,...,16, are haplotype frequencies given in the third column (S = 0) of Table 2; namely, the control haplotype-pair distribution does not follow HWE. However, we wrongly assume that HWE holds for this distribution in our analysis model. The environmental covariate X is generated from standard normal distribution, and the disease phenotype is generated according to (15). From the results shown in Table 5, we observe remarkable biases for the estimates of the main haplotype effect β hap when the genetic law is recessive or dominant, though the estimates for the interaction parameter β int is rather robust. This observation is consistent with that made by Satten and Epstein [14]: the methods based on the retrospective likelihood, such as our proposal, is generally less robust to the model assumptions such as HWE.

Discussion
In the absence of environmental covariates, Epstein and Satten [7] (see also Satten and Epstein [14]) constructed their likelihood using a fully retrospective parameterization based only on (5) and (6). Therefore, in the absence of environmental covariates, the retrospective likelihood in Epstein and Satten [7] is exactly equivalent to that considered in our proposal. Our proposal, though based on a retrospective likelihood, can be implemented as if it were a prospective likelihood by introducing a multinomial logistic model and applying the profile livelihood approach given in Methods section. This is a preferred fea-   Recently, based on the work of Chatterjee and Carroll [8], Spinka et al. [9] developed a profile likelihood approach to genetic association analysis with missing genetic information. In particular, for haplotype association analysis with unphased genotype data, their approach is based on a prospective logistic disease model Pr(D|H, X) together with a parametric model for Pr(H|X), the haplotype-pair distribution in the whole population given the environmental covariates. The implementation of the method of Spinka et al. generally requires estimating the true intercept parameter α in the prospective disease model (8).
Since in a case-control sample there would be little information on this parameter due to the nature of biased sampling, the information matrix would he nearly singular, causing computational problems. Spinka et al. [9] suggested using either a grid-search method or supplementary information on the population disease risk Pr(D = 1) to estimate α. They also suggested a rare-disease approximation thereby estimation of α is not needed. Note that    Although both the two methods have a prospective logistic model Pr(D|H, X) for the disease risk, our proposal specifies a model for Pr(H|X, D) = 0), the distribution of haplotype pairs in the control population given the covariates, while the method of Spinka et al. specifies the whole-population counterpart Pr(H|X).
In practice, although the assumption of HWE in the whole population may usually imply the same assumption hold in the controls, on the contrary, HWE in the controls may not necessarily imply HWE in the whole population when the disease is not rare. Owing to the nature of biased sampling, in a case-control sample it would be more plausible to check distributional assumptions for the control population than for the whole population, since in case-control studies estimating the control distribution Pr(H|X, D = 0) is more straightforward than estimating the population distribution Pr(H|X). Consequently, the proposed modeling strategy seems more suited to the case-control design.
When the disease is rare, Spinka et al. [9] suggested a raredisease approximation when the disease prevalence is unknown. Their approximated likelihood has the same form as our proposal. Note that the validity of our likelihood does not rely on any rare-disease assumption; it serves as an exact likelihood in its own right under the modeling setup we consider. This implies that, for a common disease with unknown disease prevalence, our proposal can still work well when a suitable model can be specified for the haplotype-pair distribution in the controls, while the proposal of Spinka et al. with the rare-disease approximation can result in severe bias in this case.
To illustrate this, we conduct a simulation with the same setting as in the second simulation study described in the previous section, except that a is now set to -0.5, corresponding to a common-disease situation. Zhao et al. [12] proposed an estimating equation approach to haplotype association analysis, which is based on the score of a prospective likelihood; a similar prospective-likelihood approach for haplotype-environment interaction is also proposed by Lake et al [21]. These approaches are very easy to implement, and are particularly suitable for the case where a larger number of SNPs are involved. Also, they are found to be quite robust to mis-specification of the haplotype-pair distribution. The main drawback for such approaches is that they may be remarkably inefficient as compared to the retrospective likelihood-based methods such as our proposal and the method of Spinka et al. [9]; see Satten and Epstein [14] for a comprehensive efficiency comparison.

Conclusion
To assess the association of haplotype and environmental factors with disease, we have proposed a new modeling setup that can be integrated into a multinomial logistic model, and can also be decomposed into a prospective logistic model relating the haplotype and environmental factors with disease, as well as a parametric model for the haplotype-pair distribution in the control population given the environmental covariates. The new proposal amounts to a natural extension of the method of Epstein and Satten [7] to further incorporate environmental covariates. The modeling strategy we adopt is very suited to the case-control design in the sense that, in contrast to the procedure proposed by Spinka et al. [9], the maximum likelihood estimation for the proposed model does not require any information on the population disease risk, which is usually lacking in a case-control sample. In fact, the maximum likelihood estimation for the proposed model with case-control data can be readily performed by applying a typical EM algorithm to a prospective multinomial logistic regression model. A SAS Macro implementing the proposed method is available from the corresponding author.
Note that the proposed method does not rely on specific modeling assumptions such as rare-disease, gene-environmental independence, and Hardy-Weinberg equilibrium assumptions, hence is applicable in very general settings, as long as the models involved are identifiable and appropriately specified. In addition, simulation results show that our proposal can achieve satisfactory efficiency. Accordingly, our proposal may serve as a useful tool for assessing the haplotype-environment associations with disease in population-based case-control studies. One limitation is that, unlike the prospective analysis, the proposed method, which is based on a retrospective likelihood, is sensitive to the model assumptions; namely, model mis-specification may lead to substantial bias, as commented by Epstein and Satten [7]. Therefore, the model specification is crucial in the proposed method to warrant valid analysis.
Some additional work is needed to strengthen the utility of the proposed method. First, if the main interest is in the association parameters and the haplotype-pair distribution is regarded as nuisance, then it would be desirable to improve the proposed method so that it can still yield valid estimates for the association parameters while allowing the model for the haplotype-pair distribution to be slightly misspecified.
Another important issue for haplotype-based association analysis involves the variable selection. When dealing with a larger number of haplotypes, efficient and effective methodologies for variable selection is crucial for finding the haplotypes contributing to liability for complex diseases [24,1]. Promising strategies include the step-wise selection [22], Lasso [23,24] and the false discovery rate (FDR) procedure [25,24]. How to incorporate appropriate variable selection procedures in the proposed multinomial logistic model with unphased genotype data deserves further investigation.