 Methodology article
 Open Access
 Published:
Estimating effects of rare haplotypes on failure time using a penalized Cox proportional hazards regression model
BMC Genetics volume 9, Article number: 9 (2008)
Abstract
Background
This paper describes a likelihood approach to model the relation between failure time and haplotypes in studies with unrelated individuals where haplotype phase is unknown, while dealing with the problem of unstable estimates due to rare haplotypes by considering a penalized loglikelihood.
Results
The Cox model presented here incorporates the uncertainty related to the unknown phase of multiple heterozygous individuals as weights. Estimation is performed with an EM algorithm. In the Estep the weights are estimated, and in the Mstep the parameter estimates are estimated by maximizing the expectation of the joint loglikelihood, and the baseline hazard function and haplotype frequencies are calculated. These steps are iterated until the parameter estimates converge. Two penalty functions are considered, namely the ridge penalty and a difference penalty, which is based on the assumption that similar haplotypes show similar effects.
Simulations were conducted to investigate properties of the method, and the association between IL10 haplotypes and risk of target vessel revascularization was investigated in 2653 patients from the GENDER study.
Conclusion
Results from simulations and real data show that the penalized loglikelihood approach produces valid results, indicating that this method is of interest when studying the association between rare haplotypes and failure time in studies of unrelated individuals.
Background
In recent years there has been a great interest in associating haplotypes with complex disease phenotypes, and many statistical models have been described. These models are complicated by individuals that are heterozygous on two or more of these SNPs, because their haplotypes cannot be determined with certainty. Consider for instance two SNPs with alleles A or a, and B or b. Individuals that are heterozygous for both SNPs, have genotypes Aa and Bb, and they inherited either haplotype AB from one parent and ab from the other, or they inherited haplotypes Ab and aB. Hence, it is unknown whether these individuals have haplotype pair AB/ab or haplotype pair Ab/aB. This uncertainty complicates statistical inference and if the number of biallelic single nucleotide polymorphisms (SNPs) is large, or when allele frequencies of the SNPs are close to 50%, many individuals will be multiple heterozygous.
Most of the current methods focus on continuous or dichotomous outcome data [1–9], while only few can be applied in cohort studies [10, 11]. Another concern is related to the presence of rare haplotypes, which is a very common problem in genetic association studies. In the present paper we adopt the suggestion of Tanck et al. [12] to use a weighted penalized likelihood method to estimate the association between a phenotype and the set of haplotypes, which may include rare haplotypes. We consider a model for the relation between a failure time T measured in N unrelated individuals and the haplotypes of these individuals formed by m SNPs measured in a single gene. Previously, Lin [10] has described a similar method for haplotype analysis in cohort studies, but this method did not include a penalty function for dealing with the unstable estimates of rare haplotypes.
In the sequel we will first describe the kind of data that we analyze, then we will described our statistical model, and the algorithm to estimate the parameters of our model. We will use simulated data to illustrate some characteristics of the estimators, and finally we will analyze real data from the GENDER study on cardiovascular disease [13].
Results
Algorithm
Data and model
We consider a sample of i = 1,..., N unrelated individuals with failure or censoredtime t_{ i }. The indicator d_{ i }is used to indicate whether t_{ i }is an eventtime (d_{ i }= 1), or a censoredtime (d_{ i }= 0). Let g_{ i }be a vector of m biallelic SNPs measured in individual i. With m biallelic SNPs there are j = 1,.., Nhap = 2^{m}different haplotypes possible with population frequencies p_{1},..., p_{ j },... p_{ Nhap }.
Suppose all haplotypes were observed in all patients, then these could be represented with the vector x_{ i }of length Nhap, where x_{ ij }equals 0, 1, or 2, depending on the number of haplotypes of type j observed in patient i. (Notice that ∑_{ j }x_{ ij }= 2, meaning that only Nhap  1 contrasts are identifiable.) The conditional hazard function for failure at t_{ i }given x_{ i }can then be specified as
ln(h(t_{ i }x_{ i })) = ln(h_{0}(t_{ i })) + β'x_{ i }, (1)
where h_{0}(t) is an unspecified baseline hazard function, and β a vector of regression parameters. The survival function S(t_{ i }x_{ i }) equals
where H_{0}(t) is the unspecified baseline cumulative hazard function.
In our case, haplotypes were not observed in all patients, and therefore we model the survival function conditional on the observed genotypes as a weighted mixture over all z_{ i }possible haplotype pairs given genotypes g_{ i }:
where S_{ q }(t_{ i }x_{ iq }) is the survival function specified as in equation (2), and x_{ iq }is the q^{th}haplotype pair that is possible given genotypes g_{ i }, and w_{ iq }is the probability that individual i has haplotype pair q given genotypes g_{ i }: w_{ iq }= Pr(X_{ i }= x_{ iq }g_{ i }).
In most circumstances the population haplotype frequencies p_{1},..., p_{ Nhap }are unknown, and must be estimated from the data at hand. But suppose these are known, then under the assumption of HardyWeinberg equilibrium the probability that an individual is carrying haplotype pair q = (h, r) is calculated as ph * pr (h, r = 1,..., Nhap). Consequently, when observing genotype vector g_{ i }the probability that individual i has haplotype pair q = (h, r) equals:
where the summation takes place over all haplotype pairs that are compatible with genotypes g_{ i }and d_{ hri }is an indicator function, which = 1 when the haplotype pair (h, r) is compatible with g_{ i }and 0 otherwise.
Given the weights w_{ iq }the full likelihood of the data given the model in equation (3) equals
and the regression parameters β, and the baseline cumulative hazard function H_{0}(t) may be estimated by maximizing (5). If H_{0}(t) is a parametric function, maximization of (5) is uncomplicated. If H_{0}(t) is a nonparametric function, we follow a similar argument as Breslow [14]. If events occurred at times τ_{1},..., τ_{ j },..., τ_{ k }, we shift all censored observations in the interval [τ_{j1}, τ_{ j }) to τ_{j1}, and assume (piecewise) constant hazard in the intervals. In that case the logarithm of the likelihood (5) reduces to
where λ_{ j }is the jump in the cumulative baseline hazard function at time τ_{ j }, and I(τ_{ i }> τ_{ j }) is an indicator function.
Estimation equations for H_{0}(t) and β can be derived by equating to zero the firstorder derivatives of the loglikelihood in (6). For λ_{ j }we find
and for β_{ℓ} (ℓ = 1,..., Nhap):
where
Notice that if z_{ i }= 1 for all i, thus when all haplotypes were observed, then ${\tilde{w}}_{iq{t}_{i}}$ = 1, and equation (7) reduces to the usual Breslow estimator of H_{0}(t), and equation (8) to the usual Cox estimator of β.
Unfortunately, the weights ${\tilde{w}}_{iq{t}_{i}}$ depend on both H_{0}(t) and β.
Estimation algorithm
Since in practice the population haplotype frequencies p must be estimated together with β and H_{0}(t), and because ${\tilde{w}}_{iq{t}_{i}}$ depends on H_{0}(t) and β, direct maximization of (5) or (6) is complicated. Instead we propose to use an EM algorithm. For that we consider the covariate vector x_{ i }as missing, and we maximize the expectation of the joint loglikelihood of L((t_{ i }, d_{ i }), x_{ i }g_{ i }) over the posterior probabilities of x_{ i }given the observed genotypes g_{ i }, and given the observed failure/censoring times (t_{ i }, d_{ i }) and given current estimates of the parameters β, H_{0}(t), and w_{ iq }:
where ${\tilde{w}}_{iq{t}_{i}}$ is the posterior probability that X_{ i }= x_{ iq }given ((t_{ i }, d_{ i }), g_{ i }, β^{(a)}, H_{0}(t)^{(a)}, ${w}_{iq}^{(a)}$:
and
The EM algorithm consists then of iterating two steps. In the Mstep of iteration a + 1, (β, H_{0}(t), w_{ iq }= p_{ j }p_{ l }/Σp_{ r }p_{ s }) are estimated by maximizing (10) given ${\tilde{w}}_{iq{t}_{i}}$ evaluated using (β^{(a)}, H_{0}(t)^{(a)}, ${w}_{iq}^{(a)}$), and in the Estep, ${\tilde{w}}_{iq{t}_{i}}$ is reestimated given using (11) with (β^{(a+1)}, H_{0}(t)^{(a+1)}, ${w}_{iq}^{(a+1)}$). Estimation equations of β and H_{0}(t) are the same as in (8) and (7), but $\tilde{w}$ replaced with $\widehat{w}$, and these are solved iteratively. The haplotype relative frequencies p_{ j }are estimated as
where I(j, q) is an indicatorfunction denoting whether haplotype j is part of haplotypecombination q. Standard errors of β can be derived from the informationmatrix of the loglikelihood in equation (6):
Since we are mainly interested in the uncertainty of β, we only used that part of the hessian that pertains to β. Notice that the first term of (14) equals the hessianmatrix that is used in the Mstep of the EM algorithm.
Penalized loglikelihood
Mutations in general tend to be rare, and so are the haplotypes in which they are encompassed. Furthermore, when ten loci are considered there are 2^{10} = 1024 different haplotypes possible, many of which will have low frequency in samples up to thousands of individuals. If haplotypes have low frequency their associated hazard ratio estimate will be unstable. We used a penalized loglikelihood method to obtain more stable parameter estimates. Basically, we optimized the penalized loglikelihood ℓ^{p}, defined as ${\ell}^{p}=ln{L}^{Breslow}\frac{\lambda}{2}Pen(\beta )$, where Pen(β) is the penalty function.
As penalty functions we considered the wellknown ridgepenalty function $(Pen(\beta )={\displaystyle {\sum}_{a}{\beta}_{a}^{2}})$, and a differencepenalty function (Pen(β) = ∑_{ a }∑_{ b }a_{ ab }(β_{ a } β_{ b })^{2} (a > b)), where a_{ ab }is a fixed and known value representing the similarity of haplotypes a and b. We quantified the similarity between haplotypes (a_{ ab }) by counting the number of shared alleles which – with m loci – varies between zero and m  1.
The penalty parameter λ was found by optimizing the crossvalidated loglikelihood (CVL) as described by Verweij and van Houwelingen [15]: CVL(λ) = lnL^{Breslow}(β^{λ})  c(λ), where lnL^{Breslow}(β^{λ}) is the loglikelihood evaluated with the penalized loglikelihood estimate β^{λ}. The factor c(λ) is an approximation of the effective dimension of the model, which in our case depends on the log hazard ratios β, the baseline hazard function H_{0}(t), and the haplotype frequencies p. For convenience sake, we approximated the effective dimension in the same manner as Verweij and van Houwelingen [15] as
where U_{ i }(β^{λ}) is the contribution of individual i to the firstorder derivative of the unpenalized loglikelihood, and H^{λ}(β^{λ}) is the matrix of secondorder derivatives of the penalized loglikelihood evaluated at β^{λ}, H_{0}(t), and p. Notice that the last term of (15) is equal to the third term of (14).
Although the penalized likelihood estimates of β are somewhat biased, and it is therefore somewhat unclear how to interpret standard errors, we nevertheless assessed the stability of the penalized estimates by a parametric bootstrap procedure. We took 200 bootstrap samples, estimated λ in each sample by optimizing CVL, and derived standard errors from the distribution of the associated penalized estimates of β, p, and H_{0}(t). The number of bootstrap samples used was based on results from simulations, in which the number of bootstrap samples varied from 10 to 1000. These data showed that SE estimates were relatively stable after 100 to 200 bootstrap samples.
The EM algorithm presented in this paper was programmed in MATLAB^{®} R 7.0 (The MathWorks, Natick, MA, USA) as well as in a set of Rfunctions and is freely available upon request from the corresponding author.
Testing
To illustrate some characteristics of our approach, simulations were carried out. In each replicate, a data set of 200 (simulation 1 and 2) or 2000 (simulation 3) individuals was created in whom 3 loci were measured. We simulated the 8 haplotypes (x_{1},..., x_{8}) to have frequencies of: p_{000} = 0.62, p_{001} = 0.05, p_{010} = 0.02, p_{011} = 0.005, p_{100} = 0.02, p_{101} = 0.003, p_{110} = 0.002 and p_{111} = 0.28. Given the haplotypes drawn for a specific individual i, the survival time S was drawn from the exponential distribution with log(intensity) equal to ∑_{ j }β_{ j }x_{ ij }. A censoring time C was independently drawn from an independent lognormal distribution such that in about 25% of all individuals C <S, in which case the survival time was censored at C. In each replicate, the haplotype effects were estimated using three models: 1) unpenalized (similar to [10] and [11]), 2) ridge penalized and 3) difference penalized. The statistical properties were evaluated using three different measures, namely the mean bias of the parameter estimates, the mean SE and the coverage probability, which is defined as the probability that the 95% confidence interval of the parameter estimate contains the true theoretical value of the parameter estimate.
Furthermore, for each haplotype the percentage of replicates which identified the haplotype as being significantly associated with the outcome (i.e., power or Type I error rate) was calculated. The significance level used to calculate the power and the Type I error rate was set to α = 0.05. In addition, the effect of omitting rare haplotypes (011, 101 and 110) or fixing their effect to the close haplotype 111 on the effect estimates was investigated in the unpenalized models only. For simulation 1 and 2, 500 replicates were carried out, whereas 100 replicates were carried out in simulation 3.
In the first simulation, replicates contained 200 individuals and the haplotypes with a rare allele at locus 2 (010, 011, 110 and 111) were simulated to have a relative risk of 2 compared to the most frequent haplotype 000, corresponding with a regression parameter of β = 0.69. This simulation mimics a setting in which the observed effects are due to a single SNP. The observed genotype counts of a random replicate are given in Table 1. In total, 75 (37.5%) individuals were heterozygous on multiple loci of which 63 were heterozygous at all three loci. In 80% of the replicates, all eight haplotypes were present. The mean bias, mean SE, coverage probability and percentage estimates with a p < 0.05 for this simulation based on the 80% 'complete' replicates are shown in Table 2. For the less frequent haplotypes (011, 110 and 101), the introduction of the ridge or difference penalty leads to a substantial reduction in the mean SE as compared to the nonpenalized model. On the other hand, the mean SE of the more frequent haplotypes increases somewhat in the penalized models. The effect on the power to detect the effects of the haplotypes 010, 011, 110 and 111 as well as on the type I error probabilities of the haplotypes 001, 100 and 101 is less univocal. In the second simulation, replicates also contained 200 individuals and the haplotypes 001 and 101 were simulated to have a relative risk of 3 (β = 1.10) compared to the reference haplotype 000. This simulation mimics a setting in which a particular allele combination on loci 2 and 3 (i.e. ×01) is related to an increased risk. In 85% of the replicates, all eight haplotypes were available. The mean bias, mean SE, coverage probability and percentage estimates with a p < 0.05 for this simulation based on the 85% 'complete' replicates are shown in Table 3. Similar to simulation 1, inclusion of a penalty leads to a large reduction in mean SE of the less frequent haplotypes. The effect of haplotype 101 is estimated more precisely, but the power to find a significant effect for this haplotype is reduced from 25% in the nonpenalized model to a value similar to the type I error probability of the haplotypes without an effect in both penalized models. The effect of introducing a penalty on the mean bias, mean SE and power to detect an significant effect of the more frequent haplotype 001 are only minor.
For both simulations, removal of the rare haplotypes 011, 110 and 101 from the model leads to a small reduction in the bias of the remaining haplotypes (e.g. from 0.048 to 0.038 and from 0.019 to 0.005 for haplotypes 010 and 111, respectively). Depending on the modeled effects of the omitted haplotypes, the bias in the estimate of haplotype 111 decreases (simulation 1: from 0.019 to 0.006) or increases (simulation 2: from 0.005 to 0.011) a little.
In a third simulation, replicates contained 2000 individuals and the haplotypes 001 and 101 were simulated to have a relative risk of 3 (β = 1.10) compared to the reference haplotype 000. This simulation mimics a setting similar to simulation 2, but due to the larger number of individuals, all 8 haplotypes will be present in all individual replicates. The mean bias, mean SE, coverage probability and percentage estimates with a p < 0.05 for this simulation are shown in Table 4.
Implementation
In the GENDER study [13] 3146 patients with cardiovascular disease who were treated with percutaneous transluminal coronary angioplasty (PTCA) with or without stents were followed for at least twelve months for the occurrence of clinical restenosis and revascularization of the vessel which was originally treated with PTCA. Inflammatory processes are involved in such target vessel revascularization (TVR), and the level of inflammatory response is controlled by several genes, among which possibly the IL10 gene. We determined in 2653 patients the variants of four SNPs in this gene (IL10 592G > T, IL10 2849G > A, IL10 1082G > A, and IL10 4251A > G), and evaluated their association with TVR risk. Overall, there were 252 TVR, and TVR risk was 9% at nine months, and 10.5% at twelve months. Rare allele frequencies were 28%, 49%, 27%, and 24%, of the four SNPs, respectively. All four markers were in linkage disequilibrium (P < 0.001), and HardyWeinberg equilibrium was not rejected for any of the markers (P > 0.0125; significance level (Bonferroni) corrected for multiple testing).
Univariately, in a Cox model assuming codominant effects, IL10 2849G > A, IL10 1082G > A, and IL10 4251A > G were significantly associated with TVR risk with hazard ratios (HR) 1.21 (95% CI: 1.00–1.46), 1.20 (1.01–1.42), 1.20 (0.99–1.45), respectively. HR of IL10 592C > A was 0.87 (0.70–1.07). In a Cox model with all SNPs, and all twoway interactions, we found significant interactions between SNPs IL10 2849G > A, and IL10 1082G > A (P = 0.003), and IL10 1082G > A, and IL10 4251A > G (P = 0.001). Higherorder interactions were not significant. The prognostic index of this model (Xβ) varied between 2.6 and +1.6, and had 23 different values corresponding to the 23 different genotype combinations that were observed.
With 4 biallelic SNPs, 16 different haplotypes are possible, but seven had zero frequency in the current sample. Of the remaining nine haplotypes, there were four major haplotypes with substantial frequencies 27% (0000), 24% (0001), 21% (0100), and 27% (1110), while the relative frequencies of the remaining five haplotypes varied between 0.46% and 0.02%. The estimated haplotype frequencies and log hazard ratios at the optimal CVL (ridge penalty) are given in Figure 1. The hazard ratios of the four major haplotypes were not different from each other, but of the five rare haplotypes two had decreased (0010, and 0101), and three had increased (0110, 1000, and 1100) hazard ratios, although all had very wide confidence intervals, and none were statistically significantly associated with TVR risk.
To validate our model we calculated the survival curves of all 23 subgroups with different genotype combinations according to equation (3) and compared these with the KaplanMeier curves. These curves are given in Figure 2 for the subgroup of 233 patients who were homozygous for the most common allele of IL10 2849G > A and IL10 4251A > G, and heterozygous for IL10 1082G > A and IL10 592C > A (Figure 2). These two curves are comparable, indicating that our haplotype method produces valid results.
Discussion
In the present study we present a method to model the relation between failure time and (rare) haplotypes in unrelated individuals. The simulations presented in this study show that haplotype effects of especially the rare haplotypes are closer to the true estimates when a penalty is introduced into the model.
Furthermore, the simulations show that the penalized loglikelihood approach that is used to deal with the unstable estimates of rare haplotypes can indeed shrink the estimates and their 95% confidence intervals to 'acceptable' values. Simulations also show that the crossvalidated standard errors of the more common haplotypes can be increased compared to their unpenalized standard errors due to uncertainty with respect to the penalty parameter λ. The power to detect a true haplotype effect is, in general, reduced in the penalized models compared to the nonpenalized model, the reduction being more pronounced for the less frequent haplotypes. This reduced power is due to the shrinkage of the estimates. With respect to the type I error probabilities, the effect of introducing a penalty depends on the penalty applied and the true haplotype effects. For the ridge penalized models, the type I error probabilities are similar to those observed in the nonpenalized models. The nonpenalized estimates of the haplotypes without a modeled effect are already close to the true value of zero and the nature of the ridge penalty further shrinks the effects towards zero. For the difference penalty, the type I error probability appears to be increased for some haplotypes. This deviation is related to the extent that the assumption (similar haplotypes, similar effects) is met. In the first simulation (Table 2), the mean bias of haplotypes 001 and 100 are increased in the direction of a β > 0. In this scenario, haplotypes 010, 011, 110 and 111 all had a modeled effect and the difference penalty results in estimates for the haplotypes 001 and 100 towards the effects of these haplotypes. In the second simulation, the majority of the haplotypes had no modeled effect and the effects of the rare haplotypes could be directed towards β = 0. In replicates containing 2000 individuals (Table 4), the reduction in SE is still present, but the gain is relatively small compared to simulation with only 200 individuals per replicate. This is conform expectations, since rare haplotypes are 'less rare' in larger samples, thus enabling a more precise estimation of their effect even without a penalty. Based on the characteristics of the models displayed in the simulations and the real data, the penalized loglikelihood method mostly serves the purpose of estimating the effects of rare haplotypes more accurate.
The method described in this paper is a flexible method allowing for adjustment for (environmental) covariates as well as haplotypeenvironment interactions. Although we focus on haplotypes consisting of a certain number of biallelic SNPs, the method is also capable to handle loci with more than two alleles. Furthermore, the method can be easily extended to deal with missing genotype data, since this will simply increase the number of possible haplotype pairs that are compatible with the observed genotype. The wiq in our method are calculated under the assumption of HardyWeinberg. Although we did not check robustness of the method to violations of this assumption, Lin [10] has shown that his method, which is similar to our unpenalized method, is robust to violations of the HardyWeinberg assumption.
As an alternative estimation method we considered the partial likelihood. Unfortunately, estimation of β and H_{0}(t) are not separated, and therefore there is no reason to prefer this partial likelihood approach over the EM algorithm outlined in the present manuscript. Compared to the method described by Tregouet et al [11], the present EM method assumes piecewise constant hazard, which seems less restrictive than the assumption behind the their method using partial likelihood.
We use a penalty function to increase precision of estimates of rare haplotypes. Other strategies for managing unstable estimates of rare haplotypes include excluding the rare haplotypes from the variable list, pooling the rare haplotypes into one category, or pooling the rare haplotypes with common haplotypes that are very similar. The first approach implicitly groups the rare haplotypes with the reference category and the second and third approach lead to pooled categories that are sometimes hard to interpret. Nevertheless, these last two methods seem to increase power [16]. However, the three strategies mentioned above do not result in (individual) effect estimates of rare haplotypes, whereas the penalized models do.
Conclusion
The method presented in this paper can be applied to estimate haplotype effects in cohort studies when haplotype phase is unknown. The joint estimation of haplotype effects and haplotype frequencies together with the penalty function provides a good way of estimating effects of rare haplotypes, which is a common problem in these studies.
Abbreviations
 SNP(s):

single nucleotide polymorphism(s)
 SE:

standard error
 PTCA:

percutaneous transluminal coronary angioplasty
 TVR:

target vessel revascularization
 IL10:

interleukin 10.
References
 1.
Durrant C, Zondervan KT, Cardon LR, Hunt S, Deloukas P, Morris AP: Linkage disequilibrium mapping via cladistic analysis of singlenucleotide polymorphism haplotypes. Am J Hum Genet. 2004, 75: 3543. 10.1086/422174.
 2.
Epstein MP, Satten GA: Inference on haplotype effects in casecontrol studies using unphased genotype data. Am J Hum Genet. 2003, 73: 13161329. 10.1086/380204.
 3.
Satten GA, Epstein MP: Comparison of prospective and retrospective methods for haplotype inference in casecontrol studies. Genet Epidemiol. 2004, 27: 192201. 10.1002/gepi.20020.
 4.
Schaid DJ, Rowland CM, Tines DE, Jacobson RM, Poland GA: Score tests for association between traits and haplotypes when linkage phase is ambiguous. Am J Hum Genet. 2002, 70: 425434. 10.1086/338688.
 5.
Seltman H, Roeder K, Devlin B: Evolutionarybased association analysis using haplotype data. Genet Epidemiol. 2003, 25: 4858. 10.1002/gepi.10246.
 6.
Sham PC, Rijsdijk FV, Knight J, Makoff A, North B, Curtis D: Haplotype association analysis of discrete and continuous traits using mixture of regression models. Behav Genet. 2004, 34: 207214. 10.1023/B:BEGE.0000013734.39266.a3.
 7.
Stram DO, Leigh PC, Bretsky P, Freedman M, Hirschhorn JN, Altshuler D, Kolonel LN, Henderson BE, Thomas DC: Modeling and EM estimation of haplotypespecific relative risks from genotype data for a casecontrol study of unrelated individuals. Hum Hered. 2003, 55: 179190. 10.1159/000073202.
 8.
Zaykin DV, Westfall PH, Young SS, Karnoub MA, Wagner MJ, Ehm MG: Testing association of statistically inferred haplotypes with discrete and continuous traits in samples of unrelated individuals. Hum Hered. 2002, 53: 7991. 10.1159/000057986.
 9.
Zhao LP, Li SS, Khalid N: A method for the assessment of disease associations with singlenucleotide polymorphism haplotypes and environmental variables in casecontrol studies. Am J Hum Genet. 2003, 72: 12311250. 10.1086/375140.
 10.
Lin DY: Haplotypebased association analysis in cohort studies of unrelated individuals. Genet Epidemiol. 2004, 26: 255264. 10.1002/gepi.10317.
 11.
Tregouet DA, Tiret L: Cox proportional hazards survival regression in haplotypebased association analysis using the StochasticEM algorithm. Eur J Hum Genet. 2004, 12: 971974. 10.1038/sj.ejhg.5201238.
 12.
Tanck MW, Klerkx AH, Jukema JW, Knijff PD, Kastelein JJ, Zwinderman AH: Estimation of multilocus haplotype effects using weighted penalised loglikelihood: analysis of five sequence variations at the cholesteryl ester transfer protein gene locus. Ann Hum Genet. 2003, 67: 175184. 10.1046/j.14691809.2003.00021.x.
 13.
Agema WR, Monraats PS, Zwinderman AH, de Winter RJ, Tio RA, Doevendans PA, Waltenberger J, de Maat MP, Frants RR, Atsma DE, van der Laarse A, van der Wall EE, Jukema JW: Current PTCA practice and clinical outcomes in The Netherlands: the real world in the predrugeluting stent era. Eur Heart J. 2004, 25: 11631170. 10.1016/j.ehj.2004.05.006.
 14.
Augustin T: An exact corrected loglikelihood function for Cox's proportional hazards model under measurement error and some extensions. Scandinavian Journal of Statistics. 2004, 31: 4350. 10.1111/j.14679469.2004.00371.x.
 15.
Verweij PJM, van Houwelingen HC: Penalized likelihood in Cox regression. Stat Med. 1994, 13: 24272436. 10.1002/sim.4780132307.
 16.
Jannot AS, Essioux L, ClergetDarpoux F: Association in multifactorial traits: how to deal with rare observations?. Hum Hered. 2004, 58: 7381. 10.1159/000083028.
Acknowledgements
M.W.T. Tanck was financially supported by the Netherlands Heart Foundation grant no 2000.125.
Author information
Affiliations
Corresponding author
Additional information
Authors' contributions
OWS participated in programming the method, performed the statistical analysis of the data and drafted the manuscript. AHZ was responsible for the theoretical background of the method, participated in the programming of the method and the analysis of the data and helped to draft the manuscript. JWJ was responsible for the collection of the data and critically evaluated the manuscript. MWTT participated in the development of the method, performed the simulations and helped to draft the manuscript. All authors read and approved the final manuscript.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Souverein, O.W., Zwinderman, A.H., Jukema, J.W. et al. Estimating effects of rare haplotypes on failure time using a penalized Cox proportional hazards regression model. BMC Genet 9, 9 (2008). https://doi.org/10.1186/1471215699
Received:
Accepted:
Published:
Keywords
 Penalty Function
 Coverage Probability
 Target Vessel Revascularization
 Partial Likelihood
 Baseline Hazard Function