Statistically efficient association analysis of quantitative traits with haplotypes and untyped SNPs in family studies

Background Associations between haplotypes and quantitative traits provide valuable information about the genetic basis of complex human diseases. Haplotypes also provide an effective way to deal with untyped SNPs. Two major challenges arise in haplotype-based association analysis of family data. First, haplotypes may not be inferred with certainty from genotype data. Second, the trait values within a family tend to be correlated because of common genetic and environmental factors. Results To address these challenges, we present an efficient likelihood-based approach to analyzing associations of quantitative traits with haplotypes or untyped SNPs. This approach properly accounts for within-family trait correlations and can handle general pedigrees with arbitrary patterns of missing genotypes. We characterize the genetic effects on the quantitative trait by a linear regression model with random effects and develop efficient likelihood-based inference procedures. Extensive simulation studies are conducted to examine the performance of the proposed methods. An application to family data from the Childhood Asthma Management Program Ancillary Genetic Study is provided. A computer program is freely available. Conclusions Results from extensive simulation studies show that the proposed methods for testing the haplotype effects on quantitative traits have correct type I error rates and are more powerful than some existing methods.


Background
With the advances in high-throughput genotyping technologies and the availability of dense SNP maps across the human genome [1], haplotype-based association analysis plays an increasingly important role in mapping genes that influence complex human diseases. Haplotypes, which are specific combinations of alleles at several tightly linked SNPs on a chromosome, incorporate the linkage disequilibrium information and pertain to the functional properties of proteins through the amino acids sequences.
Family studies are more attractive than populationbased studies because family data reduce the ambiguity of haplotypes and are less prone to spurious associations caused by population admixture and stratification. Several methods have been developed to estimate haplotype frequencies or infer individual haplotypes from unphased genotype data for general pedigrees, including HAPLORE [23], GENEHUNTER [24], PedPhase [25], and MERLIN [26]. Zhang and Zhao [27] demonstrated through simulation studies that HAPLORE and MERLIN had comparable performance and outperformed the other two methods.
Several methods have been developed for the haplotype association analysis in family studies. Horvath et al. [28] extended the method of Rabinowitz and Laird [29] to multiple markers and proposed a haplotype version of family-based association tests (FBAT). The haplotype FBAT estimates the haplotype frequencies by the expectation-maximization (EM) algorithm [30] under Hardy-Weinberg equilibrium. A score statistic is then constructed in the same manner as the original FBAT except that the genotype score is coded as a weighted sum of haplotype scores, the weight being the conditional probability of a particular haplotype configuration given that it is compatible with the unphased genotype. The haplotype FBAT is computationally simple and can provide either haplotype-specific tests or multi-haplotype tests. However, this method is limited to nuclear families without covariates, does not account for within-family trait correlations, and does not estimate genetic effects. Furthermore, it discards the parental phenotype information and thus may cause substantial loss of power. Dudbridge [31] proposed a retrospective likelihood approach for the association analysis for nuclear families and unrelated subjects with missing genotype data. The retrospective likelihood is based on the probability of observing the parental and offspring genotypes, given the trait values of the all the children in a nuclear family. Other related work includes the family-based association test for dichotomous traits [32], the extension of haplotype FBAT to multiple phenotypes [33], and a Bayesian regression method [34].
Missing genotype data are inevitable in genetic association studies. For example, some study subjects may have missing genotypes at certain SNP loci due to assay failures. Another form of missing data arises when the investigators are interested in untyped SNPs, i.e., the SNPs that are not on the genotyping platform used in the study and thus missing on all study subjects. Haplotypes provide an effective way of inferring the missing genotypes at a particular SNP from the observed genotypes of neighboring SNPs.
Lin et al. [35] developed efficient likelihood-based methods to deal with missing genotype data in case-control studies. For family studies, Burdick et al. [36] and Chen and Abecasis [37] imputed the missing genotype values by their expected values via the Elston-Steward or Lander-Green algorithm. Both methods require that at least some members of a family have non-missing genotype data so that they can be used to estimate the conditional distribution of the missing genotypes for other members of the family. These methods cannot be used when the genotype data of the entire family are missing and thus cannot handle untyped SNPs.
In this paper, we present an efficient likelihood-based approach to studying the associations between haplotypes and quantitative traits. This approach estimates the haplotype frequencies and the haplotype effects on the quantitative trait simultaneously. It is very efficient in dealing with missing genotype data. In addition, it allows departures from Hardy-Weinberg equilibrium, accounts for within-family trait correlations, and accommodates general pedigrees with arbitrary patterns of missing data. We characterize the effects of haplotypes on the quantitative trait by a linear regression model with random effects and derive the corresponding likelihood function. We develop efficient likelihood-based estimation and testing procedures. Extensive simulation studies show that the new methods perform well in realistic scenarios. An application to family data from the Childhood Asthma Management Program (CAMP) Ancillary Genetic Study [38] is provided.

Simulation studies
We conducted extensive simulation studies to assess the performance of the new methods in realistic settings. We simulated SNP genotypes according to the haplotype distribution observed in the CEU sample of the HapMap. The inbreeding coefficient ρ was set to 0.02. We generated the quantitative trait values from model (2) in the Methods section with a potentially causal haplotype or SNP. For each scenario, we generated 10,000 data sets, each of which contains 100 nuclear families with two parents and two children.
In the first set of simulation studies, we evaluated the performance of the new method for haplotype association analysis. We were particularly interested in SNPs 20-24 on chromosome 18 of the CEU sample in the HapMap genomewide data. This set of SNPs was previously considered by Lin et al. [35]. The LD among the 5 SNPs is not particularly strong. The five most common haplotypes are 00000, 00011, 00100, 01000, and 01011, with frequencies 0.1431, 0.1312, 0.1941, 0.1756, and 0.1579, respectively.
We generated trait values from the additive model where the target haplotype h * is 00100 and the environmental variable X ij is a Bernoulli random variable with 0.3 success probability. The parameters β 1 , β 2 , and β 3 correspond to the effect of the target haplotype, the effect of the environmental variable, and the haplotype-environment interaction, respectively. We set α, σ 2 g , and σ 2 e to 1, 0.5, and 1.0, respectively. For making inference on β 1 , we set β 3 = 0.4 and varied β 1 from 0 to 0.4; for making inference on β 3 , we set β 1 = 0.4 and varied β 3 from 0 to 0.4. For each setting, we considered both the situation of no missing genotypes and the situation with 10% randomly missing genotypes.
Tables 1 and 2 summarize the results for estimating the haplotype effect and haplotype-environment interaction, respectively, while Table 3 presents the results for estimating the haplotype frequencies under β 1 = β 3 = 0.4. The estimators of the haplotype effect and haplotypeenvironment interaction are virtually unbiased, and so are the estimators of the haplotype frequencies. The variance estimators accurately reflect the true variations of the parameter estimators, and the confidence intervals have correct coverage probabilities. The results of the haplotype frequencies estimates are similar to those obtained from HAPLORE. However, our main objective is to conduct haplotype association analysis and the proposed full information maximum likelihood approach typically yields statistically efficient parameter estimators by the parametric likelihood theory.
We also compared the new method to the Haplotype FBAT. Since the latter cannot handle covariates, we set   (1). Figures 1 and 2 display the type I error and power of the association tests for the haplotype effect at the nominal significance level of 0.01 without missing data and with 10% missing data, respectively. The new method has the correct type I error and is more powerful than the Haplotype FBAT. The power differences are particularly strong when parental phenotype data are available. The power gain of the new method over the Haplotype FBAT is expected to be even more substantial in the presence of covariate effects. Without missing data, the new method has almost the same power as the ideal case of known haplotypes. The loss of power for the new method caused by missing genotypes is rather moderate, even when there is substantial missingness. These results suggest that the new method can effectively infer the haplotype configuration and is efficient in dealing with missing genotype data. We next studied the problem of missing genotype data. We considered the same model as before but set SNP 20 to be causal with an additive effect. We let the genotypes of the 5 SNPs be missing independently with a 10% missing  Figure 3 displays the type I error of the association tests at SNP 21, which is null, and the power of the association tests at SNP 20, which is causal. The new method provides accurate control of the type I error. The improvement of the new method over the complete-case analysis is substantial. Compared to the full-data analysis, the new method has little loss of power. We also performed single-SNP analysis by including only the causal SNP in the model and compared the new method to the imputation method by Chen and Abecasis [37]. Figures 4  and 5 show the size/power curves of the association tests at SNP 20 with missing genotype rates of 10% and 20%, respectively. The new method is substantially more powerful than the imputation approach, especially when the missing rate is high. We finally studied the problem of untyped SNPs. We considered the same model as in the above simulation studies with missing genotype data. We set the causal SNP 20 to be untyped and performed single-SNP analysis on SNP 20. In addition to the study sample, we generated a reference panel with 30 or 60 nuclear families. As shown in Fig. 6, the new method has proper type I error and reasonable power compared to the ideal full-data analysis.
The reference panel of 30 families is almost as informative as that of 60 families.

CAMP study
We used the new method to study associations between asthma phenotypes and eight SNPs in the Beta2-Adrenergic Receptor (β 2 AR) with data from the CAMP Ancillary Genetic Study and compared the results to those of the haplotype FBAT. The CAMP study was a clinical trial of asthmatic children (mild to moderate asthma) who were randomized to three different treatments. The CAMP data set consists of 2,011 individuals in 652 nuclear families. Four percent of the genotypes at eight β 2 AR SNPs were missing. Polymorphisms in β 2 AR were found to be associated with several asthma phenotypes in previous studies [39,40]. In this paper, we considered the standardized mean asthma symptom score. Only 573 individuals had non-missing asthma symptom score data, but genotype data for all individuals were used in the model to infer the haplotype configurations. The same data set was previously analyzed in [28].
As shown in Table 4, the new method and the haplotype FBAT identified the same eight haplotypes with frequencies greater than 0.01. The inbreeding coefficient Fig. 3 Type I error and power of association tests at SNP 20, which has a causal additive effect on the phenotype, and SNP 21, which is null, at the 1% nominal significance level when there are 10% missing genotype data. For complete-data analysis, all subjects with missing data are removed. For full-data analysis, the missing genotypes are replaced by their true values Fig. 4 Type I error and power of single-SNP association tests at SNP 20, which has a causal additive effect on the phenotype, at the 1% nominal significance level when there are 10% missing genotype data. For complete-data analysis, all subjects with missing data are removed. For full-data analysis, the missing genotypes are replaced by their true values was estimated at 0.04, with a p-value 0.005. The Haplotype FBAT did not detect any significant associations, whereas the new method detected significant associations of haplotypes "12211211" and "11211211" with the mean asthma symptom score, the p-values being 0.017 and 0.023, respectively. Note that the Haplotype FBAT does not estimate haplotype effects.
The results presented in Table 4 pertain to the comparison of a target haplotype with all other haplotypes. We also performed an overall test by comparing six most frequent haplotypes to all other haplotypes within the same model. The resultant LR was 11.66 with 6 degrees of freedom, the corresponding p-value being 0.07. The chisquare test statistic from the haplotype FBAT was 6.98, with a p-value of 0.32.

Conclusion
Haplotype-based association analysis of quantitative traits in family studies is an important tool to identify genes that influence complex human diseases. The existing methods have severe limitations. In this article, we provide an efficient likelihood-based approach to investigating the associations between haplotypes and quantitative traits. Our approach acknowledges the ambiguities of haplotypes in the association analysis by integrating the construction of haplotypes and the estimation of haplotype effects into a single likelihood framework. In addition, our approach accommodates environmental factors, properly accounts for familiar correlations of trait values, and allows departures from Hardy-Weinberg equilibrium.
The proposed method appears to be more powerful than Haplotype FBAT, which is a conditional test. The haplotype FBAT, however, would be more robust to population stratification than the proposed method. To control for spurious association due to population stratification, one may partition the haplotype effect into between-and within-family components, as in [41,42]. The betweenfamily component accounts for all the spurious association and the within-family component provides a direct measure of the haplotype effect.
In this paper, we use the algorithm described in Zhang et al. [23] to identify the set of all possible haplotype configurations compatible with the observed genotype data. This set can be large for large pedigrees or when the number of SNPs under consideration is large. Although our theory applies to arbitrarily large pedigrees and large number of SNPs, the proposed method is computationally fast and numerically stable when both the pedigrees and the number of SNPs are small. To overcome the computational limitation, one can use the alternative haplotype    [43] proposed a likelihood-based approach to single-marker association analysis of binary traits using data from triads and unrelated subjects. For the haplotype-based association analysis of quantitative traits, our approach is applicable to arbitrary pedigrees and therefore can combine information from families and unrelated individuals in a single combined analysis. In fact, we can treat unrelated individuals as unrelated families but with just one individual in each family. This nice feature allows us to extract all available information and further improve the power of association tests.
Under model (2), the quantitative traits within a family follow a multivariate normal distribution. This model may not be appropriate for non-normal traits or traits with outliers. We are currently developing robust semiparametric variance-components models [44] for studying associations between haplotypes and non-normally distributed quantitative traits.
It is desirable to adjust for the effects of multiple testing when considering several haplotype configurations in the same study, especially in genomewide studies. The Bonferroni correction would be overly conservative and permutation would be computationally intensive. Huang et al. [45] proposed an efficient Monte-Carlo approach to adjusting for multiple testing for the haplotype analysis in case-control studies. It would be worthwhile to extend their approach to family studies.
In a haplotype association analysis, the set of all possible haplotype configurations can be large, even for a moderately large number of SNPs. Consequently, the number of parameters included in the model can be huge if we include all possible haplotype configurations in the phenotype model, and numerical computations may not be stable. We suggest specifying one or a few haplotypes of interest in the association test. We may consider a twostep procedure to identify a set of "risk" haplotypes. In the first step, for each possible haplotype configuration with the frequency above a certain threshold (e.g., 0.01), we fit the proposed model and estimate the haplotype effect. In the second step, we include those haplotypes with significant effects (e.g., with p-values <0.05) in the phenotype model. This procedure is similar to some methods in the variable selection literature. It would be interesting to investigate the properties of such a procedure in the future.
In the presence of missing genotype data, one can carry out the single-SNP analysis by using the imputation method, in particular, the multiple imputation procedure [46]. In contrast, the proposed method is based on the full information maximum likelihood, which tends to have more efficient parameter estimators than the multiple imputation method when the model assumptions are satisfied. Additionally, multiple imputation method requires repeated runs of the model, and special care is needed to estimate the standard errors of the parameter estimators. On the other hand, the multiple imputation method is more flexible than the full information maximum likelihood approach.
One limitation of the proposed method is that it is developed for common variant analysis. With the availability of high-throughput sequencing data, it would be interesting to develop rare variance analysis methods for family studies with missing data. This is a topic for future research.

Methods
Suppose that the study contains n families or general pedigrees, with n i individuals in the ith pedigree. Let Y ij be the quantitative trait of interest and x ij be a set of environmental variables for the jth member of the ith pedigree. Assume that each individual is genotyped at M tightly linked diallelic SNPs. At each SNP locus, the two possible alleles are denoted by 0 and 1. The total number of possible haplotypes is K = 2 M . For example, the possible haplotypes for three SNPs are 000, 001, 010, 011, 100, 101, 110, and 111. For k = 1, · · · , K, let h k denote the kth possible haplotype. The distribution of the diplotype (i.e., the pair of haplotypes on the two homologous chromosomes) is often assumed to satisfy Hardy-Weinberg equilibrium such that π kl = π k π l , k, l = 1, · · · , K, where π kl is the probability that the diplotype H consists of h k and h l , and π k is the population frequency of haplotype h k . We consider the following extension of Hardy-Weinberg equilibrium: where 0 ≤ π k ≤ 1, K k=1 π k = 1, and ρ is the inbreeding coefficient [47, p. 93]. Excessive homozygosity and excessive heterozygosity arise under ρ > 0 and ρ < 0, respectively. The special case of ρ = 0 corresponds to Hardy-Weinberg equilibrium.
For i = 1, . . . , n and j = 1, . . . , n i , let H ij ≡ (H ij1 , H ij2 ) denote the diplotype of the jth member of the ith family, and G ij denote the corresponding multi-locus genotype. Note that G ij codes the number of the "1" allele at each locus such that G ij = H ij1 + H ij2 . We cannot determine H ij from G ij with certainty if the individual is heterozygous at more than one SNP site or if any SNP genotype is missing.
In association studies, we are interested in estimating the effects of H ij and x ij and possibly their interactions on Y ij . However, we observe G ij instead of H ij . We denote the probability distribution of H ij by P(H ij ; γ ), where γ consists of π k (k = 1, · · · , K) and ρ. By using a haplotype reconstruction program, such as HAPLORE [23], we can identify the set of all possible haplotype configurations, denoted by S(G i ), which are compatible with (possibly missing) genotype G i ≡ (G i1 , · · · , G in i ).
We specify the following linear regression model with random effects where α is the intercept, Z(H ij1 , H ij2 , x ij ) is a vector function of (H ij1 , H ij2 ) and x ij , β is the corresponding set of regression parameters, g ij is a random effect due to genes at unlinked loci, and e ij is an individual-specific residual environmental effect. Note that g ij is used to capture the correlations of the quantitative trait values within the family. The random variables g ij and e ij are assumed to be independent zero-mean normal with variances σ 2 g and σ 2 e , respectively. The phenotypic covariance matrix of Y i ≡ (Y i1 , · · · , Y in i ) T can be expressed as where gi is the matrix of kinship coefficients, and I i is an identity matrix.
We define Z(H ij1 , H ij2 , x ij ) according to the genetic mode of inheritance. For example, the choice of corresponds to an additive model for the haplotype effect, environmental effects and haplotype-environment interactions, where h * is the target haplotype of interest, and I(·) is the indicator function. If we are interested in the recessive or dominant effect of h * , then we set the genotype score in Z(H ij1 , H ij2 , x ij ) to I(H ij1 = H ij2 = h * ) or I(H ij1 = h * or H ij2 = h * ), respectively. We may include additional terms in Z so as to assess the effects of several haplotype configurations and to test for multi-haplotype effects. Write The complete-data likelihood function for parameters θ and γ given ( and the likelihood function based on the observed data It is worth noting, although we use the algorithm in HAPLORE to identify the set S(G i ), we maximize the likelihood in (3) to obtain the estimators of all unknown parameters simultaneously, including haplotype frequencies, regression coefficients, and variance parameters.
Treating the H ij 's as missing data, we maximize (3) through the EM algorithm. In the (t + 1)th E-step, we calculate the conditional expectation of the logarithm of (3) given the observed data and current parameter estimates θ (t) and γ (t) as follows: where ω (t) , which is the conditional probability that the haplotype configuration for the ith family is H i given the observed data and current parameter estimates. In the (t + 1)th M-step, we maximize (4) to update parameter estimates. We iterate the E-step and the M-step until convergence. One can also maximize the observed-data likelihood function (3) directly by using an optimization algorithm [48]. The resultant maximum likelihood estimator (MLE) is denoted by ( θ, γ ). By extending the arguments of Lin and Zeng [21] and Diao and Lin [44], we can show that ( θ , γ ) is consistent, asymptotically normal and asymptotically efficient. In addition, the covariance matrix of ( θ, γ ) can be estimated by the inverse of the observed information matrix. To test the haplotype effects or haplotypeenvironment interactions, we calculate the likelihood ratio test statistic where ( θ , γ ) is the restricted MLE under the null hypothesis. The null distribution of LR is asymptotically χ 2 with the degrees of freedom equal to the number of the regression parameters in the null hypothesis. (Note that we are testing the regression effects, not the variance components.) The Wald statistic can also be used to perform hypothesis testing and construct confidence intervals. Analysis of single SNPs with missing genotypes can be treated as a special case of the proposed haplotype analysis. If we are interested in the additive effect of a particular SNP, then we set the genotype score in Z(H ij1 , H ij2 , x ij ) in model (2) to be the value of (H ij1 + H ij2 ) at that SNP position; recessive and dominant effects can be similarly modeled. We can also define Z(H ij1 , H ij2 , x ij ) to formulate the joint effects of all M SNPs or any subset of them.
When one of the M SNPs is untyped, there is no information in the study data to estimate the joint distribution of the M SNPs. We can infer the joint distribution from an external reference database, such as the HapMap. Naturally, the family study and the reference panel are assumed to come from a common population. Let L R (γ ) denote the likelihood for γ based on the reference database. Then the likelihood for (θ, γ ) that combines the study data and reference database is L C (θ , γ ) = L(θ, γ )L R (γ ).
The EM algorithm described earlier can be used to maximize the likelihood L C (θ , γ ). The resultant MLE of (θ, γ ) preserves the desired asymptotic properties.
We have developed a stand-alone computer program that implements the new methods. The program is reasonably efficient in terms of computation. It takes about 0.8 seconds to analyze one data set in the simulation studies presented in the next section on an iMac with a 3 GHz Intel Core i5 processor. This program is freely available on the website: https://sites.google.com/view/guoqingdiaohomepage.