Sample
GOLDN study participants were recruited from the Family Heart Study in Minneapolis, MN, and Salt Lake City, UT [7]. All participants were self-reported to be white. Triglycerides (TGs) were measured at 4 visits, and all participants were treated with 160 mg/day of fenofibrate after 3 weeks, between visits 2 and 3 [8].
DNA methylation in CD4+ T cells was measured using the Illumina Infinium Human Methylation 450 K BeadChip (Illumina 450 K) array before treatment at the second visit and after treatment at the fourth visit. The Illumina 450 K array measures DNA methylation at approximately 480,000 cytosine-phosphate-guanine (CpG) sites [9]. From this array, we obtain approximately 480,000 beta values between 0 and 1 for every participant, which represent the percentage of DNA strands that are methylated at each of these CpG sites, or probes. Beta values are easily interpretable, but often have severe heteroskedasticity. Therefore, some methods use methylation M values instead, defined as \( M={\mathit{\log}}_2\left(\frac{\beta }{1-\beta}\right) \) [10]. There were 997 participants from 182 families with methylation data at 1 or 2 exams.
Methylation preprocessing
Analysis of Illumina 450 K array data is susceptible to several technical challenges. The array uses 2 different technologies, Infinium I and Infinium II probes, to measure DNA methylation. Infinium II probes are not as accurate as Infinium I probes for beta values close to 0 or 1 [9]. Differences between the 2 probe types must be corrected to avoid spurious enrichment of associations in Infinium I probes [11].
Data made available were adjusted for probe type by fitting a second-order polynomial to all Infinium I and Infinium II probe pairs within 50 base pairs (bp) of one another. Because there were remaining systemic differences in the range of Infinium I and Infinium II probe beta values, we further adjusted for probe type using beta-mixture quantile dilation (BMIQ) to avoid Infinium I probe enrichment bias [11]. Finally, we removed probes with a polymorphic C, G, or single-base extension with minor allele frequency greater than 0.05 and cross-reactive probes, defined as any probe with at least 46 of 50 bp in common with a sequence elsewhere in the genome [12]. This left 430,298 probes for analysis.
DMR methods
For each DMR method, we analyze the relationship between the natural log of TGs and methylation at visits 2 and 4, adjusting for age, sex, study center, and smoking status (current, former, or never smoker).
Bumphunter creates regions by combining all probes within a user-defined pairwise distance. Bumphunter uses a linear model for each probe to predict methylation M values from TG, adjusting for covariates, and smooths the effect size estimates of all probes within a region, ordered by chromosomal position. Potential bumps, or DMRs, are defined as the collection of smoothed effect sizes for any region with an effect size above a user-defined threshold. The statistical significance of the average height and area of each potential bump is calculated using a bootstrap approach and adjusted for multiple testing [5]. We perform 100 bootstrap replications using a maximum pairwise distance of 600 bp and use the 99th percentile of calculated effect size estimates as the threshold. Because Bumphunter does not account for family structure, we limit our analyses to an unrelated subset of 176 individuals in the pre- and posttreatment analyses for this method.
DMRcate, comb-p, and GlobalP use precomputed summary statistics as input. We perform an EWAS using a linear mixed-effects model, implemented with the lmekin function from the coxme R package, accounting for relatedness using a kinship matrix computed from known pedigree relationships. There are 993 individuals included in the pretreatment association analysis and 499 in the posttreatment analysis.
DMRcate applies Gaussian kernel weights to smooth EWAS z-statistics of all probes in a seed region and computes a region p value for the sum of weighted squared t-statistics using a Satterthwaite approximation [4]. We use the recommended bandwidth for this analysis, which collapses any 2 probes or regions within 1000 bp of one another into 1 region [4]. We also use the recommended definition of a seed region (at least 1 probe with a false discovery rate [FDR] < 0.05) and scaling factor, C = 2, to calculate the Gaussian kernel.
Using p values from any test of association between methylation and a phenotype, comb-p calculates autocorrelation between probes and uses this autocorrelation and neighboring p values to calculate Stouffer-Liptak-Kechris (SLK)-corrected p values for each probe. A peak-finding algorithm is used to identify regions with enriched SLK-corrected p values. The significance of each region is then determined by applying a Stouffer-Liptak correction to the original p values of all probes in the region. To correct for multiple testing, a Sidak correction, based on the number of possible regions of the same size, is applied to the Stouffer-Liptak p values [6]. Following the authors’ recommendation, we define regions in this analysis as all probes within 200 bp of another probe and only test for significant DMRs in regions with at least 1 unadjusted probe p value < 10− 3.
Finally, we develop our own algorithm, GlobalP. We calculate a z-score for each probe, \( z=\frac{\beta }{SE\left(\beta \right)} \), from the lmekin analysis. It can be shown that under the null hypothesis of no association between methylation and TG and assuming that there are no covariates in the model, the vector of probe z-scores in a region with m probes, zm, follows a multivariate normal distribution with mean 0 and covariance Σ, where Σ is the m × m correlation matrix between probes in the study sample [13]. It follows from the properties of the multivariate normal distribution that \( {z}_m^T{\Sigma}^{-1}{z}_m\sim {\chi}_m^2 \). In the presence of covariates, Σ is the partial correlation matrix between probes [13].
Rather than grouping probes by a set distance, we group probes by annotation. We map probes to subcategories of genes and CpG islands, which are areas in the epigenome with more CpG sites than expected by chance [14]. We map probes using the UCSC definitions of gene annotation boundaries and CpG islands provided in the IlluminaHumanMethylation450kanno.ilmn12.hg19 R package. There are 5 possible annotation categories within each CpG island: south shelf, south shore, island, north shelf, and north shore. CpG island shores are regions within 2 kb of a CpG island, and CpG island shelves are regions within 2 kb of a CpG island shore. For each gene, there are 6 “functional” annotations: the gene body; first exon; 5′ untranslated region (UTR); 3′ UTR; the region from the transcription start site (TSS) to 200 bp upstream; and the region from 200 bp to 1500 bp upstream of the TSS [9]. Of the probes we analyzed, 375,805 (87.3%) fall within at least one of these 178,015 annotations. For each annotation within a gene or CpG island, we calculate a χ2 statistic using the z-scores and partial correlation between probes. Because these annotations are not mutually exclusive, a Bonferroni correction is too conservative. We use an FDR cutoff of 0.05.