Estimating effects of rare haplotypes on failure time using a penalized Cox proportional hazards regression model

Background This paper describes a likelihood approach to model the relation between failure time and haplotypes in studies with unrelated individuals where haplotype phase is unknown, while dealing with the problem of unstable estimates due to rare haplotypes by considering a penalized log-likelihood. Results The Cox model presented here incorporates the uncertainty related to the unknown phase of multiple heterozygous individuals as weights. Estimation is performed with an EM algorithm. In the E-step the weights are estimated, and in the M-step the parameter estimates are estimated by maximizing the expectation of the joint log-likelihood, and the baseline hazard function and haplotype frequencies are calculated. These steps are iterated until the parameter estimates converge. Two penalty functions are considered, namely the ridge penalty and a difference penalty, which is based on the assumption that similar haplotypes show similar effects. Simulations were conducted to investigate properties of the method, and the association between IL10 haplotypes and risk of target vessel revascularization was investigated in 2653 patients from the GENDER study. Conclusion Results from simulations and real data show that the penalized log-likelihood approach produces valid results, indicating that this method is of interest when studying the association between rare haplotypes and failure time in studies of unrelated individuals.


Background
In recent years there has been a great interest in associating haplotypes with complex disease phenotypes, and many statistical models have been described. These models are complicated by individuals that are heterozygous on two or more of these SNPs, because their haplotypes cannot be determined with certainty. Consider for instance two SNPs with alleles A or a, and B or b. Individuals that are heterozygous for both SNPs, have genotypes Aa and Bb, and they inherited either haplotype AB from one parent and ab from the other, or they inherited haplotypes Ab and aB. Hence, it is unknown whether these individuals have haplotype pair AB/ab or haplotype pair Ab/aB. This uncertainty complicates statistical inference and if the number of biallelic single nucleotide polymorphisms (SNPs) is large, or when allele frequencies of the SNPs are close to 50%, many individuals will be multiple heterozygous.
Most of the current methods focus on continuous or dichotomous outcome data [1][2][3][4][5][6][7][8][9], while only few can be applied in cohort studies [10,11]. Another concern is related to the presence of rare haplotypes, which is a very common problem in genetic association studies. In the present paper we adopt the suggestion of Tanck et al. [12] to use a weighted penalized likelihood method to estimate the association between a phenotype and the set of haplotypes, which may include rare haplotypes. We consider a model for the relation between a failure time T measured in N unrelated individuals and the haplotypes of these individuals formed by m SNPs measured in a single gene. Previously, Lin [10] has described a similar method for haplotype analysis in cohort studies, but this method did not include a penalty function for dealing with the unstable estimates of rare haplotypes.
In the sequel we will first describe the kind of data that we analyze, then we will described our statistical model, and the algorithm to estimate the parameters of our model. We will use simulated data to illustrate some characteristics of the estimators, and finally we will analyze real data from the GENDER study on cardiovascular disease [13].

Data and model
We consider a sample of i = 1,..., N unrelated individuals with failure-or censored-time t i . The indicator d i is used to indicate whether t i is an event-time (d i = 1), or a censoredtime (d i = 0). Let g i be a vector of m biallelic SNPs measured in individual i. With m biallelic SNPs there are j = 1,.., Nhap = 2 m different haplotypes possible with population frequencies p 1 ,..., p j ,... p Nhap .
Suppose all haplotypes were observed in all patients, then these could be represented with the vector x i of length Nhap, where x ij equals 0, 1, or 2, depending on the number of haplotypes of type j observed in patient i. (Notice that ∑ j x ij = 2, meaning that only Nhap -1 contrasts are identifiable.) The conditional hazard function for failure at t i given x i can then be specified as where h 0 (t) is an unspecified baseline hazard function, and β a vector of regression parameters. The survival func- where H 0 (t) is the unspecified baseline cumulative hazard function.
In our case, haplotypes were not observed in all patients, and therefore we model the survival function conditional on the observed genotypes as a weighted mixture over all z i possible haplotype pairs given genotypes g i : where S q (t i |x iq ) is the survival function specified as in equation (2), and x iq is the q th haplotype pair that is possible given genotypes g i , and w iq is the probability that individual i has haplotype pair q given genotypes g i : w iq = Pr(X i = x iq |g i ).
In most circumstances the population haplotype frequencies p 1 ,..., p Nhap are unknown, and must be estimated from the data at hand. But suppose these are known, then under the assumption of Hardy-Weinberg equilibrium the probability that an individual is carrying haplotype pair q = (h, r) is calculated as ph * pr (h, r = 1,..., Nhap). Consequently, when observing genotype vector g i the probability that individual i has haplotype pair q = (h, r) equals: where the summation takes place over all haplotype pairs that are compatible with genotypes g i and d hri is an indicator function, which = 1 when the haplotype pair (h, r) is compatible with g i and 0 otherwise.
Given the weights w iq the full likelihood of the data given the model in equation (3) equals and the regression parameters β, and the baseline cumulative hazard function H 0 (t) may be estimated by maximizing (5). If H 0 (t) is a parametric function, maximization of (5) is uncomplicated. If H 0 (t) is a nonparametric function, we follow a similar argument as Breslow [14]. If events occurred at times τ 1 ,..., τ j ,..., τ k , we shift all censored observations in the interval [τ j-1 , τ j ) to τ j-1 , and assume (piecewise) constant hazard in the intervals. In that case the logarithm of the likelihood (5) reduces to where λ j is the jump in the cumulative baseline hazard function at time τ j , and I(τ i > τ j ) is an indicator function.
Estimation equations for H 0 (t) and β can be derived by equating to zero the first-order derivatives of the log-likelihood in (6). For λ j we find and for βᐍ (ᐍ = 1,..., Nhap): where Notice that if z i = 1 for all i, thus when all haplotypes were observed, then = 1, and equation (7) reduces to the usual Breslow estimator of H 0 (t), and equation (8) to the usual Cox estimator of β.
Unfortunately, the weights depend on both H 0 (t) and β.

Estimation algorithm
Since in practice the population haplotype frequencies p must be estimated together with β and H 0 (t), and because depends on H 0 (t) and β, direct maximization of (5) or (6) is complicated. Instead we propose to use an EM algorithm. For that we consider the covariate vector x i as missing, and we maximize the expectation of the joint log- where is the posterior probability that X i = x iq given : and The EM algorithm consists then of iterating two steps. In the M-step of iteration a + 1, (β, H 0 (t), w iq = p j p l /Σp r p s ) are estimated by maximizing (10) given evaluated using (β (a) , H 0 (t) (a) , ), and in the E-step, is re-estimated given using (11) with (β (a+1) , H 0 (t) (a+1) , ).
Estimation equations of β and H 0 (t) are the same as in (8) and (7), but replaced with , and these are solved iteratively. The haplotype relative frequencies p j are estimated as where I(j, q) is an indicator-function denoting whether haplotype j is part of haplotype-combination q. Standard errors of β can be derived from the information-matrix of the log-likelihood in equation (6): Since we are mainly interested in the uncertainty of β, we only used that part of the hessian that pertains to β. Notice that the first term of (14) equals the hessian-matrix that is used in the M-step of the EM algorithm.

Penalized log-likelihood
Mutations in general tend to be rare, and so are the haplotypes in which they are encompassed. Furthermore, when ten loci are considered there are 2 10 = 1024 different haplotypes possible, many of which will have low frequency in samples up to thousands of individuals. If haplotypes have low frequency their associated hazard ratio estimate will be unstable. We used a penalized log-likelihood method to obtain more stable parameter estimates. Basically, we optimized the penalized log-likelihood ᐍ p , defined as , where Pen(β) is the penalty function.
As penalty functions we considered the well-known ridgepenalty function , and a difference- a ab is a fixed and known value representing the similarity of haplotypes a and b. We quantified the similarity between haplotypes (a ab ) by counting the number of shared alleles which -with m loci -varies between zero and m -1.
The penalty parameter λ was found by optimizing the cross-validated log-likelihood (CVL) as described by Verweij and van Houwelingen [15]: with the penalized log-likelihood estimate β λ . The factor c(λ) is an approximation of the effective dimension of the model, which in our case depends on the log hazard ratios β, the baseline hazard function H 0 (t), and the haplotype frequencies p. For convenience sake, we approximated the effective dimension in the same manner as Verweij and van Houwelingen [15] as where U i (β λ ) is the contribution of individual i to the firstorder derivative of the unpenalized log-likelihood, and H λ (β λ ) is the matrix of second-order derivatives of the penalized log-likelihood evaluated at β λ , H 0 (t), and p. Notice that the last term of (15) is equal to the third term of (14).
Although the penalized likelihood estimates of β are somewhat biased, and it is therefore somewhat unclear how to interpret standard errors, we nevertheless assessed the stability of the penalized estimates by a parametric bootstrap procedure. We took 200 bootstrap samples, estimated λ in each sample by optimizing CVL, and derived standard errors from the distribution of the associated penalized estimates of β, p, and H 0 (t). The number of bootstrap samples used was based on results from simulations, in which the number of bootstrap samples varied from 10 to 1000. These data showed that SE estimates were relatively stable after 100 to 200 bootstrap samples.
The EM algorithm presented in this paper was programmed in MATLAB ® R 7.0 (The MathWorks, Natick, MA, USA) as well as in a set of R-functions and is freely available upon request from the corresponding author.

Testing
To illustrate some characteristics of our approach, simulations were carried out. In each replicate, a data set of 200 (simulation 1 and 2) or 2000 (simulation 3) individuals was created in whom 3 loci were measured. We simulated the 8 haplotypes (x 1 ,..., x 8 ) to have frequencies of: p 000 = 0.62, p 001 = 0.05, p 010 = 0.02, p 011 = 0.005, p 100 = 0.02, p 101 = 0.003, p 110 = 0.002 and p 111 = 0.28. Given the haplotypes drawn for a specific individual i, the survival time S was drawn from the exponential distribution with log(intensity) equal to ∑ j β j x ij . A censoring time C was independently drawn from an independent log-normal distribution such that in about 25% of all individuals C <S, in which case the survival time was censored at C. In each replicate, the haplotype effects were estimated using three models: 1) unpenalized (similar to [10] and [11]), 2) ridge penalized and 3) difference penalized. The statistical properties were evaluated using three different measures, namely the mean bias of the parameter estimates, the mean SE and the coverage probability, which is defined as the probability that the 95% confidence interval of the parameter estimate contains the true theoretical value of the parameter estimate.
Furthermore, for each haplotype the percentage of replicates which identified the haplotype as being significantly associated with the outcome (i.e., power or Type I error rate) was calculated. The significance level used to calculate the power and the Type I error rate was set to α = 0.05.
In addition, the effect of omitting rare haplotypes (011, 101 and 110) or fixing their effect to the close haplotype 111 on the effect estimates was investigated in the unpenalized models only. For simulation 1 and 2, 500 replicates were carried out, whereas 100 replicates were carried out in simulation 3.
In the first simulation, replicates contained 200 individuals and the haplotypes with a rare allele at locus 2 (010, 011, 110 and 111) were simulated to have a relative risk of 2 compared to the most frequent haplotype 000, corresponding with a regression parameter of β = 0.69. This simulation mimics a setting in which the observed effects are due to a single SNP. The observed genotype counts of a random replicate are given in Table 1. In total, 75 (37.5%) individuals were heterozygous on multiple loci of which 63 were heterozygous at all three loci. In 80% of the replicates, all eight haplotypes were present. The mean bias, mean SE, coverage probability and percentage estimates with a p < 0.05 for this simulation based on the 80% 'complete' replicates are shown in Table 2. For the less frequent haplotypes (011, 110 and 101), the introduction of the ridge or difference penalty leads to a substantial reduction in the mean SE as compared to the nonpenalized model. On the other hand, the mean SE of the more frequent haplotypes increases somewhat in the penalized models. The effect on the power to detect the effects of the haplotypes 010, 011, 110 and 111 as well as on the type I error probabilities of the haplotypes 001, 100 and 101 is less univocal. In the second simulation, replicates also contained 200 individuals and the haplotypes 001 and 101 were simulated to have a relative risk of 3 (β = 1.10) compared to the reference haplotype 000. This simulation mimics a setting in which a particular allele combination on loci 2 and 3 (i.e. ×01) is related to an increased risk. In 85% of the replicates, all eight haplotypes were available. The mean bias, mean SE, coverage probability and percentage estimates with a p < 0.05 for this simulation based on the 85% 'complete' replicates are shown in Table 3. Similar to simulation 1, inclusion of a penalty leads to a large reduction in mean SE of the less frequent haplotypes. The effect of haplotype 101 is estimated more precisely, but the power to find a significant effect for this haplotype is reduced from 25% in the nonpenalized model to a value similar to the type I error probability of the haplotypes without an effect in both penal-ized models. The effect of introducing a penalty on the mean bias, mean SE and power to detect an significant effect of the more frequent haplotype 001 are only minor.
For both simulations, removal of the rare haplotypes 011, 110 and 101 from the model leads to a small reduction in the bias of the remaining haplotypes (e.g. from 0.048 to 0.038 and from 0.019 to 0.005 for haplotypes 010 and 111, respectively). Depending on the modeled effects of the omitted haplotypes, the bias in the estimate of haplotype 111 decreases (simulation 1: from 0.019 to 0.006) or increases (simulation 2: from 0.005 to 0.011) a little.
In a third simulation, replicates contained 2000 individuals and the haplotypes 001 and 101 were simulated to have a relative risk of 3 (β = 1.10) compared to the reference haplotype 000. This simulation mimics a setting similar to simulation 2, but due to the larger number of individuals, all 8 haplotypes will be present in all individual replicates. The mean bias, mean SE, coverage probability and percentage estimates with a p < 0.05 for this simulation are shown in Table 4.

Implementation
In the GENDER study [13] 3146 patients with cardiovascular disease who were treated with percutaneous transluminal coronary angioplasty (PTCA) with or without stents were followed for at least twelve months for the occurrence of clinical restenosis and revascularization of the vessel which was originally treated with PTCA. Inflammatory processes are involved in such target vessel revascularization (TVR), and the level of inflammatory response is controlled by several genes, among which possibly the IL10 gene. We determined in 2653 patients the variants of four SNPs in this gene (IL10 -592G > T, IL10 -2849G > A, IL10 -1082G > A, and IL10 4251A > G), and evaluated their association with TVR risk. Overall, there were 252 TVR, and TVR risk was 9% at nine months, and 10.5% at twelve months. Rare allele frequencies were 28%, 49%, Of the remaining nine haplotypes, there were four major haplotypes with substantial frequencies 27% (0000), 24% (0001), 21% (0100), and 27% (1110), while the relative frequencies of the remaining five haplotypes varied between 0.46% and 0.02%. The estimated haplotype frequencies and log hazard ratios at the optimal CVL (ridge penalty) are given in Figure 1. The hazard ratios of the four major haplotypes were not different from each other, but of the five rare haplotypes two had decreased (0010, and 0101), and three had increased (0110, 1000, and 1100) hazard ratios, although all had very wide confidence intervals, and none were statistically significantly associated with TVR risk. To validate our model we calculated the survival curves of all 23 subgroups with different genotype combinations according to equation (3) and compared these with the Kaplan-Meier curves. These curves are given in Figure 2 for the subgroup of 233 patients who were homozygous for the most common allele of IL10 -2849G > A and IL10 4251A > G, and heterozygous for IL10 -1082G > A and IL10 -592C > A (Figure 2). These two curves are comparable, indicating that our haplotype method produces valid results.

Discussion
In the present study we present a method to model the relation between failure time and (rare) haplotypes in unrelated individuals. The simulations presented in this study show that haplotype effects of especially the rare haplotypes are closer to the true estimates when a penalty is introduced into the model. Furthermore, the simulations show that the penalized loglikelihood approach that is used to deal with the unstable estimates of rare haplotypes can indeed shrink the estimates and their 95% confidence intervals to 'acceptable' values. Simulations also show that the cross-validated standard errors of the more common haplotypes can be increased compared to their unpenalized standard errors due to uncertainty with respect to the penalty parameter λ.
The power to detect a true haplotype effect is, in general, reduced in the penalized models compared to the nonpenalized model, the reduction being more pronounced for the less frequent haplotypes. This reduced power is due to the shrinkage of the estimates. With respect to the type I error probabilities, the effect of introducing a penalty depends on the penalty applied and the true haplotype effects. For the ridge penalized models, the type I error probabilities are similar to those observed in the non-penalized models. The non-penalized estimates of the haplotypes without a modeled effect are already close to the true value of zero and the nature of the ridge penalty further shrinks the effects towards zero. For the difference penalty, the type I error probability appears to be increased for some haplotypes. This deviation is related to the extent that the assumption (similar haplotypes, similar effects) is met. In the first simulation (Table 2), the mean bias of haplotypes 001 and 100 are increased in the direction of a β > 0. In this scenario, haplotypes 010, 011, 110 and 111 all had a modeled effect and the difference penalty results in estimates for the haplotypes 001 and 100 towards the effects of these haplotypes. In the second simulation, the majority of the haplotypes had no modeled effect and the effects of the rare haplotypes could be directed towards β = 0. In replicates containing 2000 individuals (Table 4), the reduction in SE is still present, but the gain is relatively small compared to simulation with only 200 individuals per replicate. This is conform expectations, since rare haplotypes are 'less rare' in larger samples, thus enabling a more precise estimation of their effect even without a penalty. Based on the characteristics of the models displayed in the simulations and the real data, the penalized log-likelihood method mostly serves the purpose of estimating the effects of rare haplotypes more accurate.
The method described in this paper is a flexible method allowing for adjustment for (environmental) covariates as well as haplotype-environment interactions. Although we focus on haplotypes consisting of a certain number of biallelic SNPs, the method is also capable to handle loci with more than two alleles. Furthermore, the method can be easily extended to deal with missing genotype data, since this will simply increase the number of possible haplotype pairs that are compatible with the observed genotype. The wiq in our method are calculated under the assumption of Hardy-Weinberg. Although we did not check robustness of the method to violations of this assumption, Lin [10] has shown that his method, which is similar to our unpenalized method, is robust to violations of the Hardy-Weinberg assumption. As an alternative estimation method we considered the partial likelihood. Unfortunately, estimation of β and H 0 (t) are not separated, and therefore there is no reason to prefer this partial likelihood approach over the EM algorithm outlined in the present manuscript. Compared to the method described by Tregouet et al [11], the present EM method assumes piecewise constant hazard, which seems less restrictive than the assumption behind the their method using partial likelihood.
We use a penalty function to increase precision of estimates of rare haplotypes. Other strategies for managing unstable estimates of rare haplotypes include excluding the rare haplotypes from the variable list, pooling the rare haplotypes into one category, or pooling the rare haplotypes with common haplotypes that are very similar. The first approach implicitly groups the rare haplotypes with the reference category and the second and third approach lead to pooled categories that are sometimes hard to interpret. Nevertheless, these last two methods seem to increase power [16]. However, the three strategies mentioned above do not result in (individual) effect estimates of rare haplotypes, whereas the penalized models do.

Conclusion
The method presented in this paper can be applied to estimate haplotype effects in cohort studies when haplotype phase is unknown. The joint estimation of haplotype effects and haplotype frequencies together with the penalty function provides a good way of estimating effects of rare haplotypes, which is a common problem in these studies. Frequencies and log TVR hazard ratios with 95% confidence intervals of the IL10 haplotypes Figure 1 Frequencies and log TVR hazard ratios with 95% confidence intervals of the IL10 haplotypes. http://www.biomedcentral.com/1471-2156/9/9