Motivation
The contributions from the Causal Modeling Working Group reflected considerable conceptual diversity. Broadly speaking, the teams aimed to strengthen the inferences that arise from observational studies reporting associations between DNA sequence or methylation variation and lipid phenotypes. In the case of Auerbach and associates, who only interrogated the genetic contribution to the phenotype, the team aimed to develop a novel R2 measure that would be robust to outliers. The remaining 5 studies also considered effects of epigenomic variation, which was the primary focus of the 2 MR studies [10, 12] that used genotype as the instrumental variable for methylation, phenotype, or both. In contrast to using genotype as a mere instrument, the studies by Li and colleagues [11] and Justice and associates [13] focused on sequence variation as the exposure, testing whether the total effect of the SNP on the phenotype also includes indirect effects mediated by neighboring CpG methylation or correlated lipid phenotypes. Finally, Howey and associates [9] sought to identify possible causal relationships between and within both omic layers and the phenotypes with the use of Bayesian networks. Overall, the GAW20 experience highlighted the utility of integrating across types of omic data to (a) aid causal inference and (b) paint a more complete and accurate picture of human lipid variation.
Defining causality in GAW20
Historically, causality has been defined under 1 of 2 main frameworks, commonly referred to by their most distinctive features: potential outcomes [14] and directed graphs [15]. Both paradigms were represented among the 6 GAW20 research teams. The potential outcomes framework treats randomized controlled experiments as the gold standard for estimating causal relationships. Randomization avoids complications that occur when the manner in which subjects are assigned a treatment (or subject to an exposure) accounts for the differences in outcomes in addition to the treatment (or exposure) itself. That is not to say randomization is a statistical panacea; Auerbach and associates showed how causal estimates may be sensitive to selection effects even when treatments are randomized. Nevertheless, randomization eliminates many of the sources of confounding that could create spurious relationships and biased effect estimation. The MR approaches implemented by Jiang and colleagues [10] and Sayols-Baixeras and associates [12] represent an extension of this framework to quasiexperimental design via instrumental variable analysis [16].
In contrast, the directed graphs framework relies on deterministic laws of science for describing causal relationships. It follows that complete knowledge of the underlying mechanism of a phenomenon reveals any cause and effect relationships. In practice, it is often impossible to account for every possible relationship that might exist between a set of variables. Directed graphs take advantage of probability distributions and sequential events in time or space to simplify characterization of the data generating process. Li and associates [11] used mechanistic modeling to explain how genetic factors influence phenotype, taking advantage of the fact that genetic factors precede the phenotype. Use of structural equations to describe an outcome, represented in GAW20 by Justice and associates [13], has its origins in path diagrams [17]. The power of the directed graph framework lies in its ability to depict complicated relationships. In GAW20, Howey and associates [9] represented this framework with a Bayesian network, which consists of a directed acyclic graph and a set of parameters in all conditional probability distributions. Although this procedure is computationally expensive, it allows simultaneous analytic consideration of a large number of possible mechanisms.
Even though the interpretations of causality varied within the Causal Modeling Working Group, the teams came to the consensus definition of causal inference as the process that evaluates (and potentially rules out) competing explanations for observed associations between exposures (eg, genomic variation) and outcomes (eg, metabolic phenotypes). All analyses took place in the multi-omic setting of GAW20 data, which held several causal possibilities as summarized in Fig. 1, including confounding and reverse causation scenarios. Additionally, directed acyclic graphs (Fig. 1) can be expanded to accommodate pleiotropic effects considered by multiple GAW20 analyses [10, 12, 13]. Similarly, these graphs (Fig. 1a-h) can be modified to include repeated measurements of both methylation and phenotypic data, adding fenofibrate treatment and/or baseline lipid concentrations as potential nodes. However, even though several teams used multiple lipid measurements in their analyses, longitudinal dynamics were not a major focus of the Causal Modeling Working Group. For example, no Causal Modeling Working Group team interrogated changes in epigenetic patterns over the treatment period, likely because of the inextricable confounding between fenofibrate and batch effects on methylation measurements, described in detail elsewhere [7]. Beyond GAW20, the question of temporal variation in epigenetic effects remains similarly unexplored, but an increasing number of large-scale cohorts are currently in process of obtaining serial methylation data, promising future opportunities for adapting current causal inference methods to longitudinal epigenetics.
Theoretical and practical challenges
Data
The first set of challenges for the Causal Modeling Working Group was presented by the structure of the GAW20 data set. The moderate sample size (N = 1105), particularly by the standards of MR analysis, hampered detection of statistically significant effects. An important step that all teams performed, but that can be often overlooked, was ensuring that the data was suitable for analysis through formatting and cleaning, such as checking Hardy-Weinberg equilibrium and/or minor allele frequencies and handling missing data. All teams adjusted for covariates (eg, age and sex) in their analyses to address bias resulting from confounding or the potential mediating effects of such variables. A special case of covariate adjustment necessary in GOLDN/GAW20 data is accounting for family relatedness, which is essential for producing valid estimates of effect in genetic studies. That was accomplished by implementing existing methods accounting for family structure [10,11,12,13], extending such methods, or not accounting for family structure while observing this limitation [9]. Potential technical artifacts in the DNA methylation data were addressed by including principal components in MWAS analyses [9, 10, 13]. Finally, as outlying observations can threaten the accuracy of estimating average effects, Auerbach and associates derived a weighted R2 measure that was resistant to such influences and could be used to strengthen inference from traditional statistics.
Analytic assumptions
The validity of causal effects estimated by all statistical methods hinges on satisfying the underlying assumptions, which are not always empirically testable. For example, MR estimators must meet the general assumptions for any instrumental variable, which include a robust association with the risk factor (testable), no common causes between the genotype and the phenotype of interest (not testable, but usually satisfied by random assortment of alleles—with the exception of population stratification), and no pleiotropic effects (ie, the genetic instrumental variable must only be associated with the phenotype of interest through the intermediate phenotype that is it meant to represent; not directly testable). To address the third assumption, Sayols-Baixeras and colleagues [12] used the widely accepted MR-Egger method [18] to rule out pleiotropy. In contrast, Jiang and associates [10] developed a novel method (constrained instrumental variables) that adaptively selects the optimal subset of instrumental variables that maximize associations with the intermediate phenotype of interest while accounting for potential pleiotropic effects. The constrained instrumental variables findings were comparable, albeit not identical, to MR-Egger and two-stage least-squares MR, identifying 2 additional causal associations as well as the 1 association detected by established methods. Meanwhile, Justice and associates [13] justified the concern about pleiotropic effects in the GAW20 data, reporting independent direct effects of rs964184 on both triglycerides and high-density lipoprotein cholesterol, and thus indicating existence of true pleiotropy.
Similar problems exist in Bayesian network analyses [9], requiring that suitable data are included in the analysis to anchor the correct direction of the causal relationships between variables of interest. Other assumptions behind Bayesian networks include acyclic relationships between variables; multinomial distribution of discrete variables and normal distribution of continuous variables; independence between variables conditional on their parents; and no parents for SNP variables. Of those, the normality assumption is the most problematic as genomic variation is coded as 0/1/2, but is still modeled continuously to avoid problems posed by low minor allele frequencies.
The SEM analysis by Justice and colleagues [13] assumes that all data are missing at random, which is especially unlikely in longitudinal data. Justice and associates [13] found no association between missingness and any informative variable in the data set (eg, sex, age, metabolic syndrome status). However, the GAW20 data set does not contain all potentially relevant confounders that may be predictive of missingness; consequently, the missing at random assumption may not be valid.
The likelihood inference proposal for indirect estimation (LIPID) developed by Li and colleagues [11] focuses on the CpG sites that are regulated by neighboring DNA sequence variants (methylation quantitative trait loci), and also have a causal effect on the phenotype. However, current estimates indicate methylation quantitative trait loci regulation at < 40% of CpG sites [19], limiting the applicability of LIPID in studies of DNA methylation. Additionally, prior MR studies of lipids [20] demonstrated effects of the phenotype on CpG methylation rather than vice versa (ie, in the direction assumed by LIPID). Although the GAW20 findings are consistent with either direction of effect (methylation ➔ lipids and lipids ➔ methylation) [9, 12], only one direction satisfies the analytic assumption of the LIPID method.
Furthermore, both Li and colleagues [11] and Auerbach and colleagues adjusted for familial relatedness in the GAW20 data using the theoretical kinship matrix, which assumes that the founder populations are completely unrelated (which is unlikely in the context of human population history [21], particularly in close-knit communities of Utah and Minnesota that served as the study base for GOLDN/GAW20) as well as correctly specified. These issues could be obviated by estimating kinship based on SNP data rather than self-reported pedigree information [21]. The methods implemented by Li and associates [11] and Auerbach and associates also assume independence of study participants conditional on their genotype. Because environmental variables within a household are likely to be correlated, this assumption likely does not hold, and merits further investigation with a fuller data set that includes such factors as diet, lifestyle, and other potential nongenetic effects.
Finally, as all teams used linear regression models, all methods used in the Causal Modeling Working Group are based on the standard assumptions of error independence, homoscedasticity, and multivariate normality, as well as a linear relationship between the genetic/epigenetic variants and phenotypes that is unlikely to completely capture the underlying biologic complexity.
Subjective choices
Related to the issue of methodologic assumptions, many of the analyses used by teams required some form of subjective choices, such as weighted covariance matrix, the size of methylation probe sets [10] or SNP windows [10, 12], Bayesian network variables [9], and imputation parameters [10]. Future studies are warranted to examine the sensitivity of the proposed methods to such arbitrary initial conditions.
Computation
All analyses performed by the Causal Modeling Working Group faced a number of computation challenges, included but not limited to bootstrapping [9, 10], imputation [10], optimization algorithms [9], and parallelization of analyses (all teams).