Theme 1: Family versus unrelated individuals
Six contributions in our group specifically explored the family data to perform association analyses (Fuady et al..., Lent et al., Sarnowski et al., Wang et al., Wu et al., Zhao et al). In the contribution of Fuady et al., family data was used to investigate the association between lipid changes and methylation changes after treatment using various family-based ascertainment correction approaches. Zhao et al. employed family structure to identify gene-based association between chromosomes 11 and 19 DNA methylation sites and change in lipid levels. Parent-of-origin effect association analyses were performed by Sarnowski et al., using the family data. Wang et al. explored 2 independent methods to investigate the association between changes in methylation and drug response in families. Wu et al. used family data to assess the relationship between baseline methylation and TG levels. Lent et al. compared various methods to analyze the association between DMRs and TG level using family data. In all contributions, empirical kinship was used to adjust for family structure.
Ascertainment correction
Ascertainment correction plays a role not only for families but also for unrelated individuals who were selected from the pedigree. The available data set in GAW20 is based on the GOLDN study, which recruited families with at least 2 coronary heart disease events and a family risk score of 0.5 or higher [2, 27]. The multiple-case family design is enriched for the outcome variable (primary phenotype). Some work has been done when modeling the primary phenotype to address the ascertainment bias [28]. When modeling the secondary phenotype, it is also necessary to correct for the ascertainment, especially when the primary and the secondary phenotype are correlated [29].
Fuady et al. explored 2 existing statistical methods to take ascertainment into account for multiple-case families. The first method is the secondary phenotype approach [30]. In this approach, the ascertainment process is corrected using the retrospective likelihood and joint model between the primary and secondary phenotype. They choose metabolic syndrome (MetS) as the primary phenotype and methylation as the secondary phenotype. The second approach is an ascertainment correction based on proband as implemented in SOLAR (Sequential Oligogenic Linkage Analysis Routines) [31]. The correction was performed by conditioning the likelihood of each pedigree with the trait value of the proband [32].
In the contribution of Zhao et al., the ascertainment correction was implemented by constructing principal components (PCs) of genome-wide methylation levels using 2000 probes randomly sampled from all chromosome. PCs were used to adjust for known and unknown confounders. Moreover, it is also useful to improve the validity of the VC-score test. The top 4 methylation-derived PCs were used as an additional covariate in the model. An extended VC-score test [26] was used to incorporate the family structure in the data set. This approach decomposes the total variation of the phenotype into variation explained by region-methylation profiles and residual variation. In particular, it assumes that the phenotypic similarity between individuals is totally captured by the region-methylation similarity, after the ascertainment adjustment via the PCs.
To compare the ascertainment correction approach in the multiple-case family, Fuady et al. computed the average distance of the effect of the change in TG on the change in methylation from a combination of 3 approaches. The average distance between the secondary phenotype and the naïve approach is smaller when the CpGs are not associated with the primary phenotype MetS (0.169 vs 0.418). This suggests that using MetS instead of coronary heart disease as the primary phenotype captures a part of selection mechanism. For CpG sites associated with the primary phenotype MetS, it is necessary to use the more complex secondary phenotype approach. For the other CpG sites, both approaches should provide the same results. Zhao et al. found that inclusion of the top 4 methylation-derived PCs was important in controlling for unknown confounding. Without this adjustment, the distribution of p values was very biased from what would be expected under the null. In this analysis, the logit transformation did not seem to provide any improvements. Moreover, as a consequence of the increase in sample size, the family-data analysis identified more significant genes than the analysis using unrelated individuals.
Modeling the family structure
Four contributors used the linear mixed effect model to account for familial relatedness. Fuady et al. and Wang et al. used this model to assess the relationship between change in methylation and change in lipid levels in the naïve approach and MMLT, respectively. Specifically, they used lmekin function provided by coxme package in R [2] to conduct the analysis. In MMLT [15], the outcome was the lipid change while the change in median methylation was treated as an independent variable. In the naïve approach, methylation data was treated as dependent variable and the lipid changes as an independent variable. Sarnowski et al. used the linear mixed effect model provided by the QTDT software [12] to perform parent-of-origin effect association analyses with TG levels at baseline, follow-up, or on the difference between baseline and follow-up. They modeled the phenotype of the offspring as a function of covariates and genotype by taking parental original of each allele into consideration. In the contribution of Zhao et al., a linear mixed effect model was used in the family-based VC-score test [26].
In the parent-of-origin association analysis, Sarnowski et al. found a paternal effect of rs301621-G on the pretreatment and posttreatment of TG difference. They also found a maternal effect of this SNP on cg10206250 methylation levels. The observed paternal effect of this SNP is induced by treatment. However, this effect is not mediated by DNA methylation at cg10205250.
Analyzing unrelated individuals
To analyze unrelated individuals, Zhao et al. used PCEV [25], which seeks to identify the linear combination of outcomes that maximizes the proportion of the variance being explained by the covariate. They contrasted the method using the VC-score test [26], which reduces significantly the model degrees of freedom compared to standard multivariate regression models. They restricted the analysis to genes on chromosome 11. The VC-score approach identified 1 gene, SPTY2D1, which was significantly associated with HDL-C changes. Using the PCEV approach, 1 gene, NAV2, was significantly associated with TG changes.
Theme 2: Multimarker versus single-marker tests
Single-marker EWAS, which model the association between a phenotype and each CpG site individually, are widely used but may be underpowered owing to the small effect sizes often seen in epigenetic studies [33] and the high multiple testing burden of DNA methylation arrays [34]. Six papers in our group explored multimarker methods to test the association between a phenotype and set of CpG sites (Lent et al..., Huisman et al., Wang et al., Zhao et al., Xu et al., and Wu et al).
These methods can be categorized into gene score approaches, collapsing methods, and combination of EWAS summary statistics. The authors used real and simulated data to compare these multimarker association tests to the standard single-marker EWAS and evaluated consistency of results from different multimarker approaches.
Gene score approaches
One way to reduce the multiple testing burden with methylation array data is to create gene scores and perform 1 test of association per gene rather than 1 test per CpG site. Wang et al. implemented 1 gene score method, MMLT [15], using the real data set, whereas Huisman et al. tested several gene score methods using the simulated data set. Wang et al. did not find any genes where the change in median methylation before and after treatment was associated with the change in either HDL-C or TG. Huisman et al. found that gene scores derived from single-marker EWAS p values—minimum gene p value, sum of log p values, and sum of squared log p values—had more power to detect simulated associations than single-marker EWAS. Gene scores derived from single-marker EWAS estimates of effect—maximum absolute value, median absolute value, sum, and sum of squared estimates of effect—had lower power than single-marker EWAS and higher Type I error.
Collapsing methods
Like single-marker EWAS, gene score methods model the association between a phenotype and a single summary measure of gene methylation. An alternative way to perform gene-based tests is to model the association between a phenotype and all CpG sites in a gene. We refer to these approaches, jointly modeling the association between a phenotype and methylation at a set of CpG sites, as collapsing methods. Four papers in our group used collapsing method approaches to directly model the association between a phenotype and set of CpG sites in the real data.
Xu et al. performed a score test to evaluate the association between simulated posttreatment methylation and TG in candidate genes [6]. All other collapsing method groups employed a VC-based approach. Wang et al., Wu et al., and Zhao et al. all applied a SKAT [16] or adjusted SKAT (ASKAT) [26] test to the real data, and Wu et al. proposed an extension to a VC-based approach, aSPU [18].
Zhao et al. restricted their analysis to genes on chromosomes 11 and 19 and employed 2 multimarker approaches, SKAT and PCEV [25], described in Theme 1. There was some overlap in the top results of the SKAT and PCEV approaches: OR8H3 was in the top 5 genes associated with change in HDL-C for both methods, and P2RX3 was in the top 5 genes associated with change in TG for both methods. Wang et al. also performed a SKAT test of association across the whole epigenome and identified 2 genes, GZF1 and C18orf19, where change in methylation before and after treatment was associated with change in HDL-C after correcting for multiple testing with a false discovery rate.
Wu et al. first performed an EWAS to quantify the association between methylation and pretreatment TG at each CpG site. They then performed 2 aSPU tests per gene, once using equal weights and once using inverse CpG sites variance weights, and compared to the EWAS for pretreatment TG. Additionally, 1 aSPU test per gene using the inverse variance weights was performed with change in methylation before and after treatment as the outcome. Using the inverse of the CpG site variance as the CpG site weights, Wu et al. identified 1 gene, CPT1A, associated with pretreatment TG after Bonferroni correction. This gene was also identified in the single-marker EWAS, but was not identified by aSPU when using equal CpG site weights.
Combination of EWAS summary statistics
Lent et al. compared 1 novel method of combining EWAS summary statistics to perform regional tests, GlobalP, to 3 existing methods: Bumphunter [8], comb-p [9] and DMRcate [10]. For GlobalP, 178,015 regions were defined from gene and CpG island annotations. A test statistic for each region using the EWAS z-statistics was calculated, taking into account partial between CpG site M-values, and region p values were corrected for multiple testing using a false discovery rate. There was no overlap in regions identified by Bumphunter and other methods. GlobalP and comb-p identified regions in CPT1A associated with TG both before and after treatment. The single-marker EWAS also identified an association in CPT1A.
Theme 3: Combining SNPs and DNA methylation probes
Three papers in our group combined both SNPs and DNA methylation probes (Huisman et al......., Sarnowski et al., Xu et al) to perform association analyses. One paper was an application of an existing method, the QTDT [12], in the real GAW20 data set (Sarnowski et al) and 2 papers proposed new methods (Huisman et al., Xu et al). In this section, we address 3 different questions: (a) Why use both SNPs and DNA methylation probes? (b) How to combine both SNPs and DNA methylation probes? and (c) What lessons did we learn from using both SNPs and DNA methylation probes?
Why use both SNPs and DNA methylation probes?
Sarnowski et al. jointly analyzed SNPs and DNA methylation probes to better understand the biological mechanisms underlying parent-of-origin effects. Huisman et al. compared the performances of novel multimarker methods versus single-marker test using statistics based on SNPs or DNA methylation probes. Xu et al. developed 3 strategies to search for both genetics variants and CpG site variants associated with a quantitative trait (TG or HDL-C) in multiple genes: an iterative regression (single-SNP or single-CpG-based method), an extreme values approach (gene-based method), and a hybrid approach.
How to combine both SNPs and DNA methylation probes?
We identified 2 common steps in the 3 papers: (a) a filtering of the number of SNPs and probes (CpG sites) to be tested and (b) a joint test of SNPs and DNA methylation probes.
Filtering of the number of SNPs and DNA methylation probes
One statistical reason for the filtering of the number of SNPs and DNA methylation probes is to reduce the number of SNP-CpG pairs tested and address the multiple testing issue. Another biological reason is to study the effect and interaction of SNPs and probes that are located in a same region (cis-effects). This first step was done based on either (a) GWAS and EWAS results (Sarnowski et al...) or (b) candidate genes (Huisman et al., Xu et al).
Sarnowski et al. tested the association of SNPs with TG under parent-of-origin effects using QTDT and selected suggestive associations based on an agnostic approach or a candidate approach (GWAS regions reported associated with TG). Then they selected CpG sites located nearby the suggestive SNPs (±50 kb) and tested the association with TG. Finally, they selected nominally associated CpG sites and performed a causal inference test [13] for each SNP-CpG pair. Huisman et al. selected candidate genes based on GAW20 simulation data. They defined 3 distinct groups of genes: 5 major effect genes with a causal SNP with high heritability, 34 minor effect genes with a causal SNP with modest heritability, and 39 randomly selected noncausal genes with no causal SNPs. For a genetic variant that was causal according to the GAW20 simulation [2], they selected the causal DNA methylation probe provided by the solutions, whereas for a noncausal genetic variant they selected the closest DNA methylation probe. Xu et al. used 2 different filtering strategies. In their iterative regression approach, few candidate SNPs and/or CpG sites highly correlated with trait values were tested first and a best variant was selected. The regression was repeated against the residual to select additional SNPs and/or CpG sites. In their extreme values strategy, the individuals with the top 5% value of the quantitative trait were used to select candidate genes with at least 1 CpG site.
Joint test of SNPs and DNA methylation probes
The second step was a joint test of SNPs and DNA methylation probes. Joint tests that combine both SNP and methylation data were used by Sarnowski et al. and Huisman et al.:
$$ CpG=\beta \mathrm{SNP}\ast SNP+\beta \mathrm{TG}\ast \Delta TG $$
(1)
$$ \Delta TG=\beta \mathrm{S}\mathrm{NP}\ast \mathrm{S} NP+\beta \mathrm{CpG}\ast CpG $$
(2)
Sarnowski et al. used eqs. (1) and (2) in the causal inference test to determine if a CpG probe mediates the association between a SNP and TG under a parent-of-origin effect model. Huisman et al. used test (2) as a data preprocessing step to get beta coefficients statistics or p values (for SNP and CpG sites) to aggregate in their multimarker methods. For each gene, they created 4 scores by aggregating beta coefficients or p values for SNPs or DNA methylation probes (sum of absolute values, sum of squares, maximum of absolute values, and median of absolute values in the spirit of other papers exploring aggregation methods [35,36,37]). Xu et al. used a multimarker approach to test the combined SNPs/CpG sites or candidate genes from the iterative regression and the extreme values strategy (a score test developed by Chapman et al. [6]). They compared the performances of the different approaches (iterative regression, extreme values strategy, or a hybrid approach) with the correlated method [23].
What lessons have we learned from the GAW20 data set when using both SNPs and DNA methylation probes?
Sarnowski et al. identified 22 SNPs with suggestive parent-of-origin effects on TG (P ≤ 10− 5) and 18 DNA methylation probes located nearby these SNPs were found associated with TG (P ≤ 0.05). One SNP-probe pair presented evidence of parent-of-origin effect: the SNP rs301621 was associated with the difference between pretreatment and posttreatment TG when transmitted by the father (P = 1.2 × 10− 5). This same SNP was associated with the methylation levels of cg10206250 when transmitted by the mother (P = 0.01). Using a causal inference test, the authors showed that the observed parent-of-origin effect of rs301621 was not mediated by DNA methylation at cg10206250. Huisman et al. performed gene-based tests based on SNP or CpG estimates from the joint test of SNP and CpG on TG. They found an average power of 0.48 and a Type I error of 0.04 when using the SNP statistics and a power of 0.06 and a Type I error of 0.04 when using the CpG statistics. Their results also suggested that methods run on major effect genes (with causal SNPs) were detecting more “signal” than on minor effect genes (with no causal SNPs). Finally, they found that single-marker tests outperformed gene-based tests in general. Xu et al. found that the correlated method and the hybrid approach had correct Type I errors for TG association analyses, but were conservative for HDL-C. The other methods were more conservative for both traits. The patterns of power comparison for TG and HDL-C were consistent. From the most powerful to the least powerful, the methods were the hybrid approach, the iterative regression strategy, the extreme values strategy, and the correlated method.