Model
Let D denote the disease phenotype with D = 1 indicating disease and D = 0 indicating no disease, G the unphased multilocus genotype for a series tightly linked SNPs, and X the demographical/environmental covariates. Suppose that there are J possible haplotypes denoted by {ħ0,..., ħJ-1} with ħ0 indicating the "reference" haplotype (usually the most frequent haplotype). Let (H1, H2) be the haplotype pair (diplotype) for a subject and label all possible haplotype pairs by H = 0, 1, ..., M - 1 so that "0" represents the reference haplotype pair (ħ0, ħ0) and M is the number of all possible haplotype pairs in the sample.
The multinomial logistic model is a useful tool for regression analysis with multinomial responses [10, 11]. Considering the various combinations of (D, H) as multinomial responses, we propose to base the haplotype association analysis on the following multinomial logistic model:
with β0 = β0X= 0 and m(0, X; γ) = 0 for all X, where m(H, X; γ) is a known but arbitrary function of (H, X). Throughout the paper, a prime denotes matrix transposition.
To see the meanings of the parameters involved, rewrite (1)-(3) as
β0 = β0X= 0 and m(0, X; γ) = 0 for all X. We can see that the parameters α and β
X
quantify respectively the baseline disease risk and the effect of environmental exposures on the disease risk for subjects with the reference haplotype pair. The function m(H, X; γ) specifies the distribution of the haplotype pairs among the control population given the covariates X. The parameters β
h
and β
hX
(h = 1,..., M - 1) measure, in the retrospective odds-ratio scale, respectively the main effect of the haplotype pair h (relative to the reference pair) and the interaction between the haplotype pair h and the covariates X. Note that although β
h
and β
hX
are specified through the retrospective model Pr(H|X, D), they are equivalent to the corresponding parameters specified in the prospective disease model Pr(D|X, H). In fact, according to (1)-(3), the joint probability of (D, H) given X is obtained as
where
Λ
ih
(X; θ) = i (α + β
h
+ X' β
X
+ X' β
hX
) + m(h, X; γ),
θ = (α, β
X
, β
h
, β
hX
, γ; h = 0,..., M - 1) with β0 = β0X= 0 and m(0, X; γ) = 0 for all X. The prospective disease model Pr(D|H, X) is then given by
Therefore, all the association (odds ratio) parameters regarding the associations between haplotype-environment factors and disease phenotype in the proposed model are equivalent to those specified in a prospective disease risk model. Note that the models (4)–(6) are equivalent to the prospective disease risk model (8) together with the model (5) for the haplotype-pair distribution in the controls.
The proposed modeling setup can allow flexible modeling of the effects of haplotype pairs. Take β
h
= β
hap
where Z
h
is a vector of coded values for the haplotype pair h according to a certain genetic law, and β
hap
is the vector of associated regression coefficients quantifying the marginal haplotype effects relative to the reference haplotype. For example, to evaluate the effect of a specific haplotype ħ*, we can set Z
h
= I (h1 = ħ *) I (h2 = ħ*) for a recessive genetic law, Z
h
= I (h1 = ħ*)+ I (h2 = ħ*) - I (hl= ħ*) I (h2 = ħ*) for a dominant law, and Z
h
= I (hl= ħ*) + I (h2 = ħ*) for a multiplicative law, where I (A) is the indicator function which equals one if the event A occurs and equals zero otherwise. More examples for the specification of the genetic effects can be found in Epstein and Satten [7] and Zhao et al. [12]. Similarly, by taking = X*β
int
we can evaluate the interactions β
int
between haplotypes and a chosen environmental covariate X*. Although the above examples focus on effects associated with one single haplotype, using a collection of Z
h
variables for multiple causal haplotypes we can fit extensive models with multiple causal haplotypes; see Results, Application to the hypertriglyceridemia study for an illustration.
Various models for the haplotype-pair distribution in the control population can also be incorporated into the proposed model with suitable specifications of the function m(H, X; γ). Note that a saturated model for the haplotype-pair distribution is not identifiable from the unphased genotype data [7], so certain modeling constraints must be imposed to ensure identifiability. One convenient specification is to assume that, in the control population, the haplotype-pair distribution is independent of the covariates X and satisfies the Hardy Weinberg equilibrium (HWE), namely.
Pr(H1 = h
j
, H2 = ħ
k
|X, D = 0) = Pr(H1 = h
j
, H2 = ħ
k
|D = 0) = π
j
π
k
, j, k = 0, ..., J - 1,
where π
j
is the marginal frequency of haplotype ħ
j
in the control population. Such a distribution corresponds to the specification m(H, X; γ) = γ, where W
H
is a J-vector with the j th component given by I (H1= ħ
j
) + I (H2 = ħ
j
), and the j th component of γ is related to π by γ
j
= log (π
j
/π0), j = 0,..., J - 1. A more general assumption is to allow the HWE holds only within each of the strata defined by some categorical covariate S, and the corresponding specification of m(H, X; γ) can be expressed as m(H, X; γ) = m(H, S; γ) = γ
S
, where W
H
is denned as above and γ
S
is a stratum-specific parameter with the strata determined by S. For example, when population stratification exists, then the strata S can be defined by the subpopulations or ethnicity groups to account for the violation of HWE caused by population stratification. Another way to relax the HWE assumption is to introduce the fixation parameter [13] into the model m(H, X) for Pr(H|X, D = 0); see Satten and Epstein [14] for details.
Maximum likelihood estimation
Let N1 and N0 be respectively the number of cases and controls in the sample, and N = N1 + N0 the total sample size. For the u th subject in the case-control sample, u = 1,..., N, suppose that the observed data are (D
u
, G
u
, X
u
), including the disease status D
u
, the unphased genotype G
u
, and the environmental covariates X
u
. The haplotype data H
u
cannot be uniquely determined from the unphased genotype data G
u
if the u th subject is heterozygous at more than one SNP.
Let S(G) be the set of labels of haplotype pairs that are consistent with unphased genotype G. Using (7), the probability of (D, G) given X can be expressed as
In the following discussions we will assume that models for β
H
, β
HX
and m(H, X; γ) are specified in such a way that all the parameters involved are identifiable from prospective studies. In Appendix 1 we provide identifiability conditions of β
H
and β
HX
for a given model m(H, X; γ).
The loglikelihood for the observed data in a case-control sample is given by
where Pr(D, H|X; θ) is given by (7), p(X) denotes the marginal density function of X, and We leave p(X) fully unspecified. Using the profile likelihood approach to profile out the nuisance parameter p(X), the retrospective loglikelihood (10) can be translated into a prospective loglikelihood.
The profile likelihood
Let α* = α + log(ν
1
/ν0) with ν
i
= N
i
/{N Pr(D = i)}, i = 0,1. Define θ* as θ with α replaced by α*. Under the multinomial logistic model Pr(D, H|X; θ) given in (7), the retrospective loglikelihood for the unphased genotype data in a case-control sample can be shown to be equivalent to the prospective loglikelihood
where Pr*(D, H|X; θ*) is defined as Pr(D, H|X; θ) with α replaced by α*. The derivation is relegated to Appendix 2.
This result is parallel to the classic one given by Prentice and Pyke [2] when haplotype data are directly observed. Note that in the prospective likelihood L
P
the original intercept α is absorbed into α* and does not appear as a separate parameter. Hence α is not identifiable and estimable, unless supplementary information on the population disease prevalence Pr(D = 1) is utilized to separate α from α*. Further, following the derivation in Prentice and Pyke [2] or Scott and Wild [15], we can show that the estimated covariance matrix for all the parameters except α* can be obtained from the corresponding submatrix of the observed information matrix based on L
P
. Therefore, under the proposed modeling setup the maximum likelihood inference on all the parameters except α can be based on the prospective loglikelihood L
P
, with the case-control sampling design being ignored. Another consequence is that, using the result of Scott and Wild [15], using external information on Pr(D = 1) in our proposal only affects the inference of the intercept parameter and does not affect the efficiency of estimates for all the other parameters.
EM algorithm
The maximum likelihood estimation based on L
P
can be simply implemented by the EM algorithm. Write ∂ L
P
/∂ θ* = ∑
u
Ψ
u
(θ*), and let
which is the u th subject's "complete-data" score function based on Pr*(D, H|X; θ*); see Appendix 3 for its explicit expression. It can be seen that
where the expectation in the last equation is with respect to the conditional probability Pr*(H
u
= h|D
u
, G
u
, X
u
; θ*) ≡ ρ
hu
(θ*) that is defined as
Note that with the proposed model Pr(H = h|D, G, X; θ) does not depend on α, hence can be readily evaluated even if α cannot be reliably estimated, which is usually the case in case-control studies. This property is not shared by other modeling strategies such as those in Spinka et al. [9] and Zhao [12], unless the rare-disease assumption is made.
Equation (11) suggests that the maximum likelihood estimate of θ* can be obtained by the following EM algorithm. Given the estimate of θ* from the r th iteration, denoted by *(r), we first evaluate the posterior haplotype-pair distribution ρ
hu
(θ*) at θ* = *(r). Then we obtain the updated estimate *(r+1)by solving θ* from
where ρ
hu
= ρ
hu
(θ*(r)). The process is repeated until some convergence criterion is met. Thus, with the proposed EM algorithm, maximum likelihood estimation with unphased genotype data under the proposed model can be readily implemented by iteratively fitting a weighted multinomial logistic regression model, with the iterative-updated weights ρ
hu
(θ*) given by (12). It can be seen that the EM algorithm above can accommodate not only the unphased genotype but also the missing genotype data.
When solving (13), we apply the Newton-Raphson algorithm with the negative derivative of ∑
u
∑
h
ρ
hu
(θ*) with respective to θ* approximated by the simple positive-definite matrix
where V*(·|X) denotes the variance with respect to Pr*(D, H|X; θ*); see Appendix 3 for the derivation. Note that Cis also an approximation for the "complete-data" information matrix based on Pr*(D, H|X; θ*).
Let * be the final estimate given by the EM algorithm, provided convergence is achieved. Then * is the maximum likelihood estimate of θ*, which is asymptotically normal. The covariance matrix for the maximum likelihood estimates of all the parameters, except the intercept parameter α*, can be estimated by the corresponding submatrix of the inverse of the observed information matrix based on L
P
. Following Louis [16], the observed information matrix based on L
P
can be obtained by
evaluated at θ* = *.
The EM algorithm described above is found to be sensitive to the initial estimate of γ, the parameter associated with the haplotype-pair distribution in the control population, in that the algorithm may fail to converge when using inappropriate initial estimates of γ. To obtain an adequate initial estimate of γ, we can adopt an approach similar to the proposal in Zhao et al. [12] to obtain the initial estimate for γ based on the control data only. For example, when the haplotype distribution in the controls follows HWE and is independent of environmental covariates, we can obtain initial estimate for γ, or equivalently the haplotype frequencies π in the controls, by solving π
j
, j = 0,..., J - 1 from
where
and π(-1) denotes the solution of π in the previous iteration. Our numerical experiments reveal that, starting with assigning equal frequency to each haplotype, two iterations in (14) are sufficient to yield an adequate initial estimate of γ.
To ensure stable estimation during the EM algorithm, for a haplotype with very small estimated frequency (e.g. <e-10), we will fix its frequency to be 0 and drop the corresponding parameter from γ.
We have developed a general-purposed SAS Macro for implementing the proposed method, which is available upon request from the corresponding author.