We used pharmacoepigenetic data from 200 simulated posttreatment replicates from the GAW20 data set based on the Genetics of Lipid Lowering Drugs and Diet Network (GOLDN) study. Simulation solutions were known prior to analysis. Details regarding the GOLDN study from which these data were simulated are provided elsewhere [10]. We examined interactions between 10 SNPs and the corresponding nearby cytosine-phosphate-guanine (CpG) markers. Five SNP sites were simulated to exhibit a large effect on triglyceride (TG) response [11]. Each site was simulated with varying expected heritability (hg2 = 0.025 to 0.125). The causal SNP effect was conditionally related to methylation at the nearby CpG site, with complete expression when that CpG site was fully unmethylated and complete suppression when fully methylated. The other 5 SNPs tested were noncausal and their corresponding CpG sites shared a simulated random variability with the 5 simulated causative CpG markers. This analysis included 680 individuals from 164 families with complete data at each site tested.
Separate models tested the interaction of CpG-by-SNP site (interaction test = CpG × SNP). The methylation site analyzed was posttreatment CpG located closest to the candidate SNPs. Both SNP and CpG sites were standardized prior to analysis. All analyses also included the covariates age, sex, and study center. Therefore, the full model included the CpG × SNP interaction, its constituent single-order terms, and the other covariates (age, sex, and study site). We address two different outcome settings: (a) posttreatment TGs, using pretreatment TGs as a model covariate (henceforth referred to as “Pre/Post”) and (b) a post/pretreatment TG score (henceforth referred to as “Change”). These settings mean that the temporal nature of the study was not explicitly included in the design and was evaluated as a more general multivariate rather than longitudinal method. A log transformation was applied to TG values prior to all calculations.
LME approach
First, we used LME modeling to evaluate TG treatment response. The LME model assumed CpG, SNP, and other covariates as fixed effects and a random effect of family parameterized by the kinship correlation matrix obtained using the make kinship function of the “kinship2” package in R [12]. The lmekin function of the “coxme” package in R incorporated this kinship matrix for use in the LME model [13]. In contrast to the later GEE approaches that assume a single random effect for family, this more precisely models relationships within families.
GEE approach
Next, GEE was used to examine TG treatment response. The observed outcome vector for the ith family with ni members is denoted by \( {\boldsymbol{TG}}_i={\left[{TG}_{i1},\dots, {TG}_{i{n}_i}\right]}^T,i=1,\dots, N \) (total number of families), which has a marginal mean given by E(TGi| Xi) = μi linked to covariates though the identify function, \( f\left({\mu}_{ij}\right)={\mathbf{x}}_{ij}^T\beta, j=1,\dots, {n}_i \), for xij = [1, x1ij, …, x(p − 1)ij]T and β = [β0, β1, …, βp − 1]T, where p is the number of model covariates. The working covariance matrix for TGi is represented as \( {\boldsymbol{V}}_i={\boldsymbol{A}}_i^{1/2}{\boldsymbol{R}}_i{\boldsymbol{A}}_i^{1/2} \) where Ai is a diagonal matrix denoting the working marginal variances, and Ri is a symmetric and positive definite working correlation matrix with 1 along the diagonal elements. Let Di = ∂μi/∂βT and to obtain the regression parameter estimates, \( \widehat{\beta} \), using the GEE approach [1] for marginal analysis, we iteratively solve
$$ \sum \limits_{i=1}^N{\boldsymbol{D}}_i^T{\boldsymbol{A}}_i^{-\frac{1}{2}}{\boldsymbol{R}}_i^{-1}{\boldsymbol{A}}_i^{-\frac{1}{2}}\left({\boldsymbol{Y}}_i-{\boldsymbol{\mu}}_i\right)=0 $$
(1)
An exchangeable working correlation structure, rather than an independence structure [8], was employed. Notably, the empirical covariance might be inflated compared to its theoretical value despite both covariance matrices being asymptotically equivalent. As a result, Wald statistics are inflated and confidence intervals are overly narrow [14]. We incorporate the bias-corrected method of Mancl and DeRouen [15] to account for the bias for small sample size from \( \left({\boldsymbol{TG}}_i-{\widehat{\boldsymbol{\mu}}}_i\right){\left({\boldsymbol{TG}}_i-{\widehat{\boldsymbol{\mu}}}_i\right)}^T \) or \( {\widehat{\boldsymbol{e}}}_i{{\widehat{\boldsymbol{e}}}_{\boldsymbol{i}}}^T,i=1,\dots, N \), in which the estimated residuals, \( {\widehat{\boldsymbol{e}}}_i \), are relatively small on average [15].
QIF approach
The QIF approach is based on GEE [1] and the generalized method of moments (GMM) approaches [16]. It rewrites \( {\boldsymbol{R}}_i^{-1}={\sum}_{r=1}^m{\alpha}_{ri}{\boldsymbol{M}}_{ri} \) in eq. (1), where Mri, r = 1, …, m, i = 1, …, N, are known basis matrices and αri, r = 1, …, m, i = 1, …, N, are functions of correlation parameters [3]. This approach treats GEE as a linear combination of m sets of unbiased estimating equations, and uses the 2-step GMM approach for extended score equations to obtain regression parameter estimates. For exchangeable covariance, M1i is an identity matrix and M2i is a matrix with 0 on the diagonal and 1 elsewhere. Regression parameter and standard error estimates are consistent even when the working correlation structure is incorrect. We note that bias-corrected methods, such as the ones discussed for the GEE approach, can also be applied to the QIF approach [15, 17]. The R code to analyze the marginal models with the GEE and QIF approaches may be found in the supporting information of previous research [18].
Power
Each of the 5 SNP–CpG interactions from the simulation that were simulated causal for TG response were tested using a nominal 5% significance threshold. A Bonferroni correction was also calculated based on the genome-wide methylation data to increase generalizability to the typical experimental setting. This resulted in a significance threshold of 1.08 × 10− 7 (0.05/461281 CpG sites) [10]. Empirical power was calculated as the number of times the causal site was significant out of the total number of simulations.
Type I error rate
To evaluate false-positive rates, we examined the 5 simulated noncausal SNP–CpG interactions. A nominal 5% significance threshold was first used. The same Bonferroni correction was then applied for genome-wide correction. The Type I error rate was computed by recording the number of times the noncausal loci were significant across all tested sites and simulations.