Longitudinal data methods for evaluating genome-by-epigenome interactions in families

Background Longitudinal measurement is commonly employed in health research and provides numerous benefits for understanding disease and trait progression over time. More broadly, it allows for proper treatment of correlated responses within clusters. We evaluated 3 methods for analyzing genome-by-epigenome interactions with longitudinal outcomes from family data. Results Linear mixed-effect models, generalized estimating equations, and quadratic inference functions were used to test a pharmacoepigenetic effect in 200 simulated posttreatment replicates. Adjustment for baseline outcome provided greater power and more accurate control of Type I error rates than computation of a pre-to-post change score. Conclusions Comparison of all modeling approaches indicated a need for bias correction in marginal models and similar power for each method, with quadratic inference functions providing a minor decrement in power compared to generalized estimating equations and linear mixed-effects models.


Background
Longitudinal measurement provides numerous benefits for evaluating the association between genetic variation and human health. For example, the inclusion of multiple time points allows for the prospective study of time-varying covariates and an improved understanding of disease and trait progression over time. However, the promise of applying longitudinal data to genetic-based problems is tempered by the analytic challenges inherent to the approach. Particularly when incorporating related subjects, statistical methods should account for both the within-family correlation structure and the correlation among repeated measurements. The added complexity of large genetic data sets and influence of family structure makes computational limitations an additional barrier to consider. When global DNA methylation data are added to traditional candidate single-nucleotide polymorphism (SNP) genotypes these analytic and computational concerns are exacerbated.
Linear mixed effects (LME) models and generalized estimating equations (GEEs) are two broad and commonly used strategies for analyzing clustered/longitudinal data. In LME models, beyond parameterizing the correlation structure of repeated measurements, the correlations induced by familial relationships can also be explicitly modeled. This is often done using a kinship matrix to parameterize the degree of within-family relatedness. GEEs can also be used for analyzing individual-level outcomes within any given family (or cluster) assumed to be correlated [1]. When the mean structure is correctly specified, consistent regression parameter estimates are obtained, even if the working correlation structure is misspecified. However, accurately modeling the working correlation structure can be crucial in terms of estimation efficiency [2]. A method that has the potential to improve efficiency compared to GEEs is the use of quadratic inference functions (QIFs) [3]. Relative to the performance of GEEs, the QIF approach is more efficient when the working correlation structure is incorrect, whereas the 2 approaches are equally efficient once the structure is correctly specified [4,5]. When cluster sizes vary and the number of clusters is not large, as is the case in our simulations, QIFs might produce estimates with larger variability [6].
Although longitudinal and other multivariate methods have been considered and compared when evaluating SNP genotypes in previous Genetic Analysis Workshops (GAWs) [7][8][9], the relative power and Type I error rates of these methods have not been extensively studied for methylation outcomes with family-based studies. Furthermore, researchers have yet to compare GEE and QIF methods for analyzing genetic data. Here, we evaluate 3 methods (LME, GEEs, and QIFs) that account for clustered measurements and test epigenetic-by-genetic interactions. We consider a conditional (baseline-adjusted) model given the nature of the data simulation process in which pretreatment values were anchored to posttreatment mean outcomes. We also consider a change score setting as a comparison method. These simplified models mean that the temporal nature of the study was not included in the design explicitly and we instead focus on a more general multivariate condition that includes repeated measurements.

Methods
We used pharmacoepigenetic data from 200 simulated posttreatment replicates from the GAW20 data set based on the Genetics of Lipid Lowering Drugs and Diet Network (GOLDN) study. Simulation solutions were known prior to analysis. Details regarding the GOLDN study from which these data were simulated are provided elsewhere [10]. We examined interactions between 10 SNPs and the corresponding nearby cytosine-phosphate-guanine (CpG) markers. Five SNP sites were simulated to exhibit a large effect on triglyceride (TG) response [11]. Each site was simulated with varying expected heritability (h g 2 = 0.025 to 0.125). The causal SNP effect was conditionally related to methylation at the nearby CpG site, with complete expression when that CpG site was fully unmethylated and complete suppression when fully methylated. The other 5 SNPs tested were noncausal and their corresponding CpG sites shared a simulated random variability with the 5 simulated causative CpG markers. This analysis included 680 individuals from 164 families with complete data at each site tested.
Separate models tested the interaction of CpG-by-SNP site (interaction test = CpG × SNP). The methylation site analyzed was posttreatment CpG located closest to the candidate SNPs. Both SNP and CpG sites were standardized prior to analysis. All analyses also included the covariates age, sex, and study center. Therefore, the full model included the CpG × SNP interaction, its constituent single-order terms, and the other covariates (age, sex, and study site). We address two different outcome settings: (a) posttreatment TGs, using pretreatment TGs as a model covariate (henceforth referred to as "Pre/Post") and (b) a post/pretreatment TG score (henceforth referred to as "Change"). These settings mean that the temporal nature of the study was not explicitly included in the design and was evaluated as a more general multivariate rather than longitudinal method. A log transformation was applied to TG values prior to all calculations.

LME approach
First, we used LME modeling to evaluate TG treatment response. The LME model assumed CpG, SNP, and other covariates as fixed effects and a random effect of family parameterized by the kinship correlation matrix obtained using the make kinship function of the "kinship2" package in R [12]. The lmekin function of the "coxme" package in R incorporated this kinship matrix for use in the LME model [13]. In contrast to the later GEE approaches that assume a single random effect for family, this more precisely models relationships within families.

GEE approach
Next, GEE was used to examine TG treatment response. The observed outcome vector for the ith family with n i members is denoted by TG i ¼ ½TG i1 ; …; TG in i T ; i ¼ 1; …; N (total number of families), which has a marginal mean given by E(TG i | X i ) = μ i linked to covariates though the identify func- where p is the number of model covariates. The working covariance matrix for TG i is represented where A i is a diagonal matrix denoting the working marginal variances, and R i is a symmetric and positive definite working correlation matrix with 1 along the diagonal elements. Let D i = ∂μ i /∂β T and to obtain the regression parameter estimates,β, using the GEE approach [1] for marginal analysis, we iteratively solve An exchangeable working correlation structure, rather than an independence structure [8], was employed. Notably, the empirical covariance might be inflated compared to its theoretical value despite both covariance matrices being asymptotically equivalent. As a result, Wald statistics are inflated and confidence intervals are overly narrow [14]. We incorporate the bias-corrected method of Mancl and DeRouen [15] to account for the bias for small sample size from ðTG i −μ i ÞðTG i −μ i Þ T orê iêi T ; i ¼ 1; …; N , in which the estimated residuals,ê i , are relatively small on average [15].

QIF approach
The QIF approach is based on GEE [1] and the generalized method of moments (GMM) approaches [16]. It re- (1), where M ri , r = 1, …, m, i = 1, …, N, are known basis matrices and α ri , r = 1, …, m, i = 1, …, N, are functions of correlation parameters [3]. This approach treats GEE as a linear combination of m sets of unbiased estimating equations, and uses the 2-step GMM approach for extended score equations to obtain regression parameter estimates. For exchangeable covariance, M 1i is an identity matrix and M 2i is a matrix with 0 on the diagonal and 1 elsewhere. Regression parameter and standard error estimates are consistent even when the working correlation structure is incorrect. We note that bias-corrected methods, such as the ones discussed for the GEE approach, can also be applied to the QIF approach [15,17]. The R code to analyze the marginal models with the GEE and QIF approaches may be found in the supporting information of previous research [18].

Power
Each of the 5 SNP-CpG interactions from the simulation that were simulated causal for TG response were tested using a nominal 5% significance threshold. A Bonferroni correction was also calculated based on the genome-wide methylation data to increase generalizability to the typical experimental setting. This resulted in a significance threshold of 1.08 × 10 − 7 (0.05/461281 CpG sites) [10]. Empirical power was calculated as the number of times the causal site was significant out of the total number of simulations.

Type I error rate
To evaluate false-positive rates, we examined the 5 simulated noncausal SNP-CpG interactions. A nominal 5% significance threshold was first used. The same Bonferroni correction was then applied for genome-wide correction. The Type I error rate was computed by recording the number of times the noncausal loci were significant across all tested sites and simulations. Table 1 contains the power and Type I error rates for each approach and outcome setting (ie, Pre/Post vs Change). At all causal sites, the Pre/Post LME model conferred greater power than the Change setting. Power was, as expected, systematically related to simulated h g 2 , with > 90% power observed at the chromosome 1 site (h g 2 = 0.125) but only 29% power observed at the chromosome 10 site (h g 2 = 0.025). Both LME settings adequately controlled for Type I error rates, with the Change setting proving a more conservative test (α = 0.038) than the Pre/Post test (α = 0.048).

Results
A similar pattern of results was observed when testing the GEE and QIF models. Again, in all instances, the Pre/Post setting provided improved power compared to the Change setting. Consistent improvements in power with increases in simulated h g 2 were also observed. Both GEE and QIF at the two outcome settings adequately controlled for Type I error rates after applying the bias-correction methods. Removal of the bias-correction methods improved power, particularly for the QIF models, but also resulted in inflated Type I error rates.
Few causal sites were identified as significant after applying the genome-wide correction across all models and settings (power ≤ 5% across all cases).  Run times were estimated as 0.5, 1.25, and 0.4 s/site for the LME, GEE, and QIF models, respectively. Consequently, epigenome-wide testing (ie, 461,281 CpG sites) would take approximately 64, 160, and 52 h for the LME, GEE, and QIF models, respectively.

Discussion
The primary purpose of the present analysis was to evaluate 3 methods (LME, GEE, and QIF) for analyzing clustered, family-based data, including genomic and epigenomic measures. Acceptable power was generally observed across the 3 methods and 2 outcome settings when the effect sizes were large (ie, simulated h g 2 was high) and a nominal significance threshold was applied. Covarying baseline performance provided greater power and more accurate control of Type I error rates than computation of a pre-to-post change score in all instances. This outcome suggests that settings similar to those typically used for more complex longitudinal designs involving multiple time points will provide comparable or greater power as change score computation. Note, however, that this is also a function of the simulation structure and needs to be evaluated with other models of data generation. Comparisons between baseline adjustments and change scores for commonly employed longitudinal methods have been examined extensively in other settings [19].
After genome-wide correction, low power was observed at all sites, settings, and models tested. This outcome was not surprising given the relatively small sample analyzed, and reinforces the need for large samples and the development of more powerful methods for analyzing genome-wide genomic and epigenomic outcomes in families. Also, the limited resolution resulting from the number replications diminished our ability to appropriately test genome-wide Type I error.
Here, we evaluate for the first time QIF models for use with genetic data. Comparison of all models and settings suggested comparable power across methods, with QIF providing a slight decrement in some cases compared to GEE and LME. These findings are consistent with previous studies evaluating LME and GEE techniques for analyzing genetic data [20] and extend them to the QIF method. We also incorporated bias-correction [14,15,17] and found notable improvements in Type I error rates for both GEE and QIF methods. Although QIF methods gain advantage in estimation inference under moderately sized samples when employing the incorrect covariance structure, the slight decrease in power observed here may be a result of varying cluster size (ie, family size) and consequent reductions in precision [6].
Some limitations must be considered. Family structure was not explicitly included in the simulation procedure; consequently, the outcomes were based only on the familial relationships present in the parent study. We also did not incorporate the kinship matrix into the GEE and QIF methods as presented. Instead, we made a simplifying assumption of family as the cluster unit and an exchangeable correlation structure. This limitation makes comparisons between LME and GEE/QIF difficult. Important to note, however, is that LME models conducted without the kinship matrix resulted in similar power and Type I error rates as reported above (data not shown). In addition, estimates of heritability are not generated that would have allowed for evaluation of bias. Finally, we did not take full advantage of the correlation structure within subjects and focused instead on a more general multivariate method with repeated measurements. Future research should evaluate longitudinal data with more measurements to determine if these results extend to those settings.

Conclusions
These limitations outstanding, the present analysis represents one of the first evaluations of QIF models for analyzing genetic data. Our findings suggest that such methods may provide comparable power and adequate control of Type I error rates. Although computational estimates did not include the time necessary for manipulating and cleaning the large data sets incumbent to genetic analysis, these estimates suggest that each model could be scaled up to the epigenomic level. Future studies can further evaluate these models to identify ways to improve computational efficiency for application to large-scale genetic and epigenetic data. Funding Publication of this article was supported by NIH R01 GM031575. This work was partially supported by the National Institute on Aging (DWF: K25AG043546) and National Science Foundation (JCS: 1247392). The GAW is supported by the National Institute of General Medical Sciences grant R01GM031575. The GAW20 phenotype and sequence data were provided by the Genetics of Lipid Lowering Drugs and Diet Network (GOLDN) study, which is supported by the National Heart, Lung, and Blood Institute grants R01HL104135 and U01HL72524.

Availability of data and materials
The data that support the findings of this study are available from the Genetic Analysis Workshop (GAW), but restrictions apply to the availability of these data, which were used under license for the current study. Qualified researchers may request these data directly from GAW.

About this supplement
This article has been published as part of BMC Genetics Volume 19 Supplement 1, 2018: Genetic Analysis Workshop 20: envisioning the future of statistical genetics by exploring methods for epigenetic and pharmacogenomic data. The full contents of the supplement are available online at https://bmcgenet.biomed central.com/articles/supplements/volume-19-supplement-1.