We analyzed the real GAW data set, comprising 407 individuals with complete TG, genotype, methylation, and covariate data. The sample of 679 individuals with TG, genotype, and covariate data was used for preliminary screening of SNPs for analysis. In the following, we present the details for an exposure A (SNP genotype alternate allele count), a continuous mediator M (difference in methylation posttreatment minus pretreatment), and a continuous outcome Y (difference in log TG posttreatment minus pretreatment). Relevant covariates C include age, sex, study center, and smoking status.
Mediation hypothesis
The counterfactual approach to mediation analysis provides methods to quantify these relationships [5, 6]. This approach is based on the potential outcomes of each subject, conditional on the levels of exposure and mediator. Only one of these potential outcomes is observed for each individual, but under certain assumptions, the others may be estimated from the data. Here, Yam represents the potential outcome for exposure level A = a and mediator level M = m, and M(a) represents the level of the mediator that would be observed for a given subject with exposure level a. The total contribution of mediation through M to the effect of A on Y is given by the natural indirect effect (NIE): \( NIE={Y}_{aM(a)}-{Y}_{aM\left({a}^{\ast}\right)} \), which is the difference in potential outcomes among individuals with exposure level a compared to those with observed mediator level M (a) and counterfactual mediator level M (a*) which they would have had if their exposure level had been a*. For notational simplicity, we take a = 1 and a* = 0 so the contrast is defined in terms of 1 additional alternate allele for the SNP under consideration. Note that this quantity will be zero if there is no effect of the exposure on the mediator [so that M(a) = M(a∗)] or no effect of the mediator on the outcome (so that \( {Y}_{a{m}_1}={Y}_{a{m}_2} \)for any values m1, m2 of the mediator). The NIE can be estimated from the simultaneous regression models as follows:
$$ E\left(M|A=a,\boldsymbol{C}=\boldsymbol{c}\right)={\beta}_0+{\beta}_1a+{\beta_2}^{\prime}\boldsymbol{c} $$
(1)
$$ E\left(Y|A=a,M=m,\boldsymbol{C}=\boldsymbol{c}\right)={\theta}_0+{\theta}_1a+{\theta}_2m+{\theta}_3a\ast m+{\theta}_4^{\prime}\boldsymbol{c} $$
(2)
Under the assumptions described below, the NIE=β1(θ2 + θ3). The SE of this estimate via the delta method is \( \sqrt{{\Gamma \Sigma \Gamma}^{\prime }} \)where Γ = (0, θ2 + θ3, 0′, 0, 0, β1, β1, 0′) and ∑ is the block-diagonal covariance matrix of the estimators from regression models (1) and (2).
This NIE estimator has a valid causal interpretation if models (1) and (2) are correctly specified and the following assumptions hold:
-
1.
No unmeasured confounding for the exposure–outcome relationship.
-
2.
No unmeasured confounding for the mediator–outcome relationship.
-
3.
No unmeasured confounding for the exposure–mediator relationship.
-
4.
No mediator-outcome confounder is affected by the exposure.
Similar assumptions are required for causal interpretation of any regression analysis.
Because the statistical power to detect indirect effects is low in studies with a small to moderate sample size, and because statistical hypothesis testing is not a valid method for qualitative assessment of confounding between the exposure and mediator, VanderWeele recommends comparing the magnitude of the total effect of the exposure on the outcome, estimated from a model that excludes the mediator, and the direct effect of exposure adjusting for the effect of the mediator and exposure–mediator interaction [6].
Interaction hypothesis
For the purpose of assessing mediation, the interaction term in model (2) is useful primarily to allow valid estimates in the presence of non-additive contributions of the genetic and methylation effects. However, we are also interested in the interaction coefficient θ3 in its own right. The null hypothesis of interaction, θ3 = 0, may be interpreted as follows: the effect of M on Y is the same at all levels of A. If this null hypothesis does not hold, we may identify genotypic subgroups with different methylation effects.
Implementation
The GAW20 real data set is drawn from a single-arm clinical trial of fenofibrate treatment in the Genetics of Lipid Lowering Drugs and Diet Network (GOLDN) study family-based cohort. We selected SNP-CpG site pairs by first running marginal association models with the phenotype:
$$ E\left(Y|A=a,C=c\right)={\gamma}_0+{\gamma}_1a+{\gamma_2}^{\prime }c $$
(3)
$$ E\left(Y|M=m,C=c\right)={\eta}_0+{\eta}_1m+{\eta_2}^{\prime }c $$
(4)
We then selected SNP-CpG site pairs with all the following 3 criteria:
-
1.
SNP p value <1e-3
-
2.
Methylation epigenome-wide association study p value < 0.05
-
3.
Distance between SNP and CpG site < 50 kb pairs
These criteria were chosen to balance the considerations of low statistical power resulting from multiple testing corrections against the possibility of failing to detect significant interactions when the marginal effects are negligible.
The mediation–interaction model described above was then estimated for these SNP-CpG site pairs. The total effect refers to the coefficient γ1 in regression model (3). Models (3) and (4) were estimated genome-wide using EPACTS, and models (1) and (2) were estimated only at selected SNP-CpG pairs using the kinship and coxme packages in R.
Because of missing data in the posttreatment methylation data set, the sample for mediation analysis was a subset of the GWAS screening sample.
Power calculations
We used simulation to investigate the statistical power to detect mediation between genotype and change in methylation. Based on the SNP allele frequency and distribution of change in methylation at the SNP-CpG site pair with strongest evidence of nonzero NIE, we simulated genotypes, change in methylation, and outcome measures varying the sample size, effect of SNP on change in methylation (β1), effect of methylation on outcome (θ2), and interaction effect (θ3), while holding all other model parameters constant at their observed point estimates. The simulated samples comprised unrelated individuals, so the parameters in models (1) and (2) were estimated by multiple linear regression rather than linear mixed models. All power calculations used a significance level of α = 0.05, with 500 replicates.