GAW20 data
GAW20 data included a real and simulated data set on 200 replicates of the real data phenotypes and CpG sites for 2 time points, before and after 3 weeks of daily treatment with a lipid-lowering drug (fenofibrate). All 7 contributions summarized here used the GAW20 real data set. The real data set was provided by the Genetics of Lipid Lowering Drugs and Diet Network (GOLDN) study and included EWAS and genome-wide association genotypes from 188 extended families from Minnesota and Utah [15]. DNA methylation in CD4+ cells was measured using the 450 K Infinium array for 463,995 CpG sites and was available for 995 and 530 individuals pretreatment and posttreatment, respectively. Phenotype information included sex, age, recruitment center, smoking status, and blood lipid levels, and was available both pretreatment and posttreatment for 818 and 861 individuals for TG and high-density lipoprotein (HDL), respectively.
EWAS quality control
An important consideration for EWAS studies are quality control (QC) and normalization of the CpG sites. Proper QC helps detect bias and potential technical confounders, and is essential in both cross-sectional and longitudinal studies, making sure phenotypic data between time points are comparable.
The Illumina Human Methylation 450 K array (Illumina, San Diego, CA USA), uses 2 different chemistries to detect DNA methylation. As these 2 chemistries differ in dynamic range, sensitivity, and biological annotation, the observed methylation patterns follow 2 different distributions. This is further complicated by the mixture distributions generally observed in methylation data. Multiple methods have been developed to address the issues caused by the 2 chemistries [16,17,18,19]. Two of the GAW20 contributions from this working group focused solely on QC of the EWAS data [20, 21].
Canty and Paterson [20] focused mainly on batch effects and QC using independent observations, while LeBlanc et al. [21] focused on using family structure in their QC steps. Inspection of the provided GAW20 data by both studies revealed insufficient probe-normalization, which was addressed in multiple papers submitted to GAW20, including 5 contributions from this group [20,21,22,23,24]. Three of these studies [21, 23, 24] used beta-mixture quantile normalization (BMIQ) [19] to normalize out probe-type effects, whereas Canty and Paterson [20] analyzed the CpG probe types in 2 separate strata. Almeida et al. [22] used inverse-normalization to convert the DNA methylation beta values to have range (−∞,∞). For the same reason, Canty and Paterson [20], Nustad et al. [24], and LeBlanc et al. [21] used the inverse logit transformation of beta values (M values) for their analysis. Two contributions from this group [25, 26] did not use any normalizations on the epigenetic data. However, because Fernández-Rhodes [25] only used Type II probes, the observed probe bias should not have affected their results.
Further inspection of the epigenetic data indicated strong batch effects for pretreatment and posttreatment, as well as evidence for sample swaps. This was clearly outlined in both the QC contributions by 2 of these GAW20 contributions [20, 21]. Batch effects in genomic studies are sometimes adjusted for by adding the principal components (PCs) in the analysis. In our GAW20 group, 4 groups [20, 22, 23, 25] adjusted for DNA methylation-derived PCs in their analysis.
The interpretation of DNA methylation-derived PCs is still unclear in EWAS, but is often taken to represent either batch effects or reflect the sample-specific cell-type composition. In Irving et al. [27], the PCs were interpreted as impurities in the CD4+ T-cell population.
GAW20 approaches
Heritability
Three GAW20 contributions in this group estimated h2 based on the reported family relationships for either phenotypes or DNA methylation. Narrow-sense h2 was estimated for blood lipids [22, 24], metabolic syndrome [25], treatment effect [24], and DNA methylation [22, 24, 25]. All 3 of these contributions used a linear mixed effect (LME) model (variance component model) approach. Frequentist models [22, 25] were implemented in SOLAR (sequential oligogenic linkage analysis routines) [9], whereas a Bayesian model [24] was implemented in INLA (integrated nested Laplace approximation) for h2 estimates [28].
All 3 contributions [22, 24, 25] estimated h2 with some clinical covariates accounted for in their LME models. Two of these contributions investigated pretreatment and posttreatment h2 estimates epigenome-wide [22, 24], while Fernández-Rhodes [25] focused on metabolic syndrome-associated CpG sites. Almeida et al. [22] used LME to estimate h2 of inverse-normalized CpG sites epigenome-wide. These researchers also investigated HDL h2 for both pretreatment and posttreatment with and without the first 20 DNA methylation-derived PCs as covariates in their LME [22]. In addition, they calculated covariance matrices between samples based on gene-specific methylation sites. These matrices were used as an additional component in a LME, where they investigated if some of these matrices could explain a significant proportion of the HDL phenotypic variance.
Fernández-Rhodes et al. [25] estimated the h2 of 4 CpG sites (cg00574958, cg17058475, cg18181703, and cg06500161) previously associated with metabolic syndrome in GOLDN and other studies, focusing on building LMEs, as implemented in SOLAR [9] to account for shared genetic and environmental factors. They preprocessed the pretreatment methylation data by adjusting for the top 4 methylation PCs, as previously described to account for T-cell purity or residual batch effects [27]. Using a variance-component LME, they estimated h2 for metabolic syndrome and the 4 specific metabolic syndrome CpG sites in models with (a) no covariates, (b) with individual-level covariates that incorporated age, sex, and their interactions, (c) sequentially adding environmental covariates (study center, smoking status) to the (b) model. Only covariates with p < 0.1 were kept in their reduced model. Finally, to this reduced model, they separately added a household variance component for siblings representing “early life shared environment” and one for parents representing “later life shared environment,” and then screened for nominally significant cis-acting and trans-acting single-nucleotide polymorphisms (SNPs).
Nustad et al. [24] applied a Bayesian LME implemented in R-INLA [28] to estimate the h2 of TG and HDL, treatment response (change in TG and HDL from pretreatment to posttreatment) and epigenome-wide CpG sites. They performed model selection using the deviance information criterion to identify CpG sites having strong evidence of nonzero h2. They used BMIQ-normalized methylation on the M-scale in their analyses, and excluded SNP-associated CpG sites while accounting for age and sex.
Drug treatment response
Two contributions [23, 26] in this group examined the treatment effect on DNA methylation. Yu et al. [26] used a generalized estimating equation to estimate the association between log-transformed TG levels and the methylation proportion separately for the pretreatment and posttreatment of 349,755 CpG sites that were uniquely mapped to a gene. They adjusted for age, sex, study center, and smoking status in their analysis. Furthermore, they examined whether the effect of adding log-transformed HDL to the model changed the evidence for association. Using the subset of 421 individuals with methylation and lipids at both time points they also conducted a longitudinal analysis, adding a covariate fenofibrate treatment (time) and the interaction between treatment (time) and methylation proportion. Except for the added indicator covariate for drug treatment, they used the same covariates in the longitudinal modeling and conducted with and without adjustment for HDL. Lim et al. [23] restricted their analysis to 14,850 CpG sites that were nominally significant (p < 0.05) with log-transformed TG level at baseline methylation. For each of these sites, residuals were found from a LME accounting for family structure and covariates, such as age, sex, study center, smoking status, and 10 PCs separately pretreatment and posttreatment DNA methylation. These residuals were used to construct networks using weighted gene coexpression network analysis (WGCNA) to find modules of highly interconnected CpG sites [29]. They tested whether pretreatment modules changed more than by chance in the posttreatment modules using both the WGCNA module preservation method and generalized hamming distance [30].
Targeted versus epigenome-wide data
In EWAS, a proportion of DNA methylation covering the epigenome is investigated, with little regard for prior biological knowledge or reasoning. Although this is more computationally intensive, it has the ability to detect previously unknown epigenetic associations with a phenotype and generate new hypotheses about the underlying biology of complex disease.
One contribution from this GAW20 group preprocessed the epigenome-wide methylation data for cell purity or residual batch effects, but then performed a targeted analysis [25] that focused on 3 genes (CPT1A, SOCS3 and ABCG1) previously reported to be associated with metabolic syndrome [31,32,33,34,35]. All other GAW20 contributions from this group applied an epigenome-wide hypothesis-generating approach where the entire epigenome was interrogated. However, 2 of these contributions implemented data-driven approaches to reduce the final number of analytic tests conducted. In their network analysis, Lim et al. [23] used a reduced data set consisting of 14,850 CpG sites that showed a nominal association of log-transformed TG with pretreatment with fenofibrate DNA methylation. Almeida et al. [22] reduced the methylation data to the gene-specific CpG sites in their search for gene-specific methylation that could explain a significant proportion of the observed HDL h2.
Family versus unrelated
Six of 7 scientific contributions in our GAW20 group used the known pairwise family relationships in their analyses [21,22,23,24,25,26]. These contributions included pedigree information that described the expected proportion of shared genetic information between extended family members. The family information was used either to estimate h2 or genetic values for various traits [21, 22, 24, 25] or to model the dependency between individuals in drug treatment response models [23, 26].
A single GAW20 contribution from this working group used a customized unrelated data set [20]. These authors randomly selected 1 individual from each pedigree and looked for the differences in both the mean and variability of DNA methylation between pretreatment and posttreatment.