Indirect effect inference and application to GAW20 data

Background Association studies using a single type of omics data have been successful in identifying disease-associated genetic markers, but the underlying mechanisms are unaddressed. To provide a possible explanation of how these genetic factors affect the disease phenotype, integration of multiple omics data is needed. Results We propose a novel method, LIPID (likelihood inference proposal for indirect estimation), that uses both single nucleotide polymorphism (SNP) and DNA methylation data jointly to analyze the association between a trait and SNPs. The total effect of SNPs is decomposed into direct and indirect effects, where the indirect effects are the focus of our investigation. Simulation studies show that LIPID performs better in various scenarios than existing methods. Application to the GAW20 data also leads to encouraging results, as the genes identified appear to be biologically relevant to the phenotype studied. Conclusions The proposed LIPID method is shown to be meritorious in extensive simulations and in real-data analyses.


Background
In complex disease studies, genome-wide association studies (GWAS) [1] and epigenome-wide association studies [2] have been successful in identifying disease-associated single-nucleotide polymorphisms (SNPs) and DNA methylation loci. However, the mechanism of how these genetic loci affect the disease status remains unknown. To provide a possible explanation of the causal mechanisms of these genetic factors, integrative analyses using both types of data are important. Even though integration of multiple types of data sets is a promising method as it is generally more powerful than ordinary association studies [3], the method of integration itself is challenging.
Most existing methods use additional information to filter out nonsignificant loci and reduce the total number of tests, which, in return, improve power [4]. On the other hand, mediation analyses usually consider only a single mediator and require multiple testing correction [5]. An example is Zhao et al. [6] who proposed an integrative test, denoted as o-eSNP that was shown to be more powerful than traditional GWAS. Motivated by the data provided by GAW20, in which SNP and DNA methylation data for integrative analysis are available, we aim to characterize the effects of SNPs into direct and indirect effects. As the direct effect of SNPs is simple and straightforward, in this contribution we focus on the indirect effect of SNPs. With the prior knowledge that DNA methylation can be modulated by SNPs [7], we assume that some SNPs exert their effects by regulating the DNA methylation level. Hence the indirect effect of SNPs on a phenotype of interest here is taken as the combined effects of SNPs on DNA methylation and DNA methylation on the phenotype.
In this paper, we propose a novel method, LIPID (likelihood inference proposal for indirect estimation), to use both SNP and DNA methylation data to test whether there is an indirect effect of SNPs on a phenotype. The indirect effect and its variance-covariance matrix are derived, and a Wald test is conducted. An extensive simulation study was done to evaluate the properties and the performance of LIPID, which was also applied to analyze the GAW20 real data.

Methods
Suppose there are n independent subjects, and for each subject, its SNP, DNA methylation, covariates, and phenotype are measured. Specifically, let Y = (Y 1 ,…,Y n ) T be the vector of observed phenotypes; X = (X 1 ,…,X k ) be the n × k matrix denoting the observed values of k (nongenetic) covariates, including intercept; S = (S 1 ,..., S r ) and M = (M 1 ,..., M p ) be the n × r and n × p matrices regarding the genotypes of r SNPs and methylation levels of p cytosine-phosphate-guanine (CpG) sites, respectively. Assuming that phenotype Y is a continuous variable, we can use a set of linear models to capture the relationship among Y, X, S, and M as follows: is the vectorization operation; ⊗ is the Kronecker product; and ε Y and ε M are independent. Note that here β S and β X are r × p and k × p matrices respectively, and Σ M is a p × p positive definite matrix. Model (1) characterizes the relationship between phenotype and SNPs, DNA methylation, and covariates, while model (2) depicts the relationship between DNA methylation (as a response variable) and SNPs and covariates. It is concluded from models (1) and (2) that the direct effect of SNPs on the phenotype is α S and the indirect effect is γ = β S α M .

Estimation and inference
For linear models (1) and (2), we have the following maximum likelihood estimateŝ The variance-covariance matrices for α and vec ðβÞ are, respectively: Their corresponding block matrices are the variancecovariance matrices forα M and vec ð b β S Þ, respectively. According to the law of total variation, the variancecovariance matrix forγ is where (·) 11 represents the first r × r diagonal submatrix. As α M and β S are unavailable, their estimatesα M andβ S are used. After several lines of algebra we have So an unbiased estimate for the variance-covariance matrix ofγ iŝ whereΣ M andĈovðα M Þ are corresponding estimates of Σ M and Covðα M Þ. This estimate is different from o-eSNP [6] in the last component. The Wald statistic to test if indirect effect exists is which asymptotically follows a chi-squared distribution with r degrees of freedom.

Adaptation to correlated subjects
For subjects with correlation between each other, linear model with mixed effect is utilized. So model (1) is changed to where b is the random effect, with mean 0 and variance-covariance matrix (2) is not changed. The estimate α M and its variancecovariance matrix are derived similarly, and the LIPID statistic has the same form.

Simulation study
To evaluate the performance of LIPID, simulation under various scenarios are conducted. For simplicity, we assume there are no covariates, and that S, M, and Y are all univariate. In addition, the direct effect that we are not interested in does not exist in simulation. The simulated data are generated as follows. First, SNP S is generated with a minor allele frequency (MAF) under Hardy-Weinberg equilibrium. Then, DNA methylation M is generated from a normal distribution with mean Sβ S and variance σ 2 M . Finally, phenotypic value Y is generated from a normal distribution with mean Mα M and variance σ 2 Y . The number of individuals is set to be 100 and the number of replications is 10,000. As Table 1 shows, there are 5 scenarios of parameters designed to gauge the Type I error rates of LIPID, and 5 scenarios for the evaluation of power. The variance of Y and variance of M are fixed to 1 for simplicity. In scenario 1, the coefficients β S and α M are both 0; in scenarios 2 and 3, β S is nonzero while α M is 0; in scenarios 4 and 5, α M is nonzero with β S equal to 0. In these scenarios, the indirect effect is nonexistent, so we change the coefficient of one parameter to measure the Type I error rates under different situations. Under H a , the indirect effect is β S α M ≠ 0. The coefficients are chosen so that different methods have moderate powers. From scenario 1 to scenario 5, the β S and α M are increased. The MAF ranges from 0.1 to 0.4. Table 2 shows the Type I error rates in the 5 scenarios, from which we can see that the Type I error rates are more or less conservative in scenarios 1 to 5 for o-eSNP and LIPID, while regressing on SNPs only (denoted as SNP in Table 2) controls the Type I error rate well. For scenario 1, where both coefficients β S and α M are 0, the Type I error rate is conservative; for scenarios with 1 coefficient that is not 0 but relatively small, the Type I error rates are still conservative; for scenarios with 1 large coefficient, the Type I error rates are better controlled. Compared to o-eSNP [6], the Type I error rates of LIPID are favorable, as o-eSNP is more conservative in all scenarios. Figure 1 shows the powers of 3 methods in 5 scenarios. We can see that LIPID is the most powerful in all scenarios, while SNP-only method has the least power. From scenario 1 to scenario 5, as the indirect effect increases, the performance of LIPID and o-eSNP are very close to each other.

Real data analysis
The GAW20 real data package contains genomic, DNA methylation, and phenotypic data for more than 1000 individuals from 188 pedigrees. The phenotypic data include metabolic indices, lipoproteins, and triglyceride. Dense genome-wide SNP markers make up the genomic data. DNA methylation levels are also available on CpG sites genome-wide before and after individuals are treated with fenofibrate. The level of triglycerides and the methylation level before treatment are made use of. The covariates include gender, age, smoking status, high-density lipoprotein, metabolic disorder, and center. We use SOLAR (Sequential Oligogenic Linkage Analysis Routines) [8] to obtain the heritability for triglyceride level, using 1108 subjects with phenotypic data. The total number of subjects with SNP data and DNA methylation is 716. As LIPID considers a region of multiple genetic markers, SNPs and DNA methylation loci within the range of each gene are analyzed; we analyze a total of 13,968 genes. Genes that pass false discovery rate (FDR) correction are FAT1 (p value 9.4E-7) and DCTN6 (p value 1.3E-6), while o-eSNP fails to find any significant genes (FAT1 p value 2.5E-5; DCTN6 p value 3.7E-6). We further use the BIOS QTL (quantitative trait locus) browser [9] to validate our findings. We found that rs458021 on gene FAT1 is a cis-meQTL (methylation quantitative trait locus) with a p value of 3.8E-07, but we did not find any meQTL on gene DCTN6. FAT1 is associated with cholesterol in DAVID (Database for Annotation, Visualization, and Integrated Discovery), whereas DCTN6 is involved in lipid metabolism [10]. Because the eQTM (expression quantitative trait methylation) database are not widely available, we cannot further validate if these CpG sites on these genes can modulate the expression, and further influence the phenotype.

Discussion
Compared to o-eSNP [6], LIPID controls Type I error rates better in all scenarios, and the power is higher in all scenarios. The estimate itself is the same, no matter if we regress M or Mα M on S, but the variance-covariance estimates are different, and LIPID has a less-biased variance-covariance estimate, which leads to the improvement of performance of LIPID. Furthermore, we also adapt o-eSNP and LIPID to correlated subjects. Application of LIPID to the GAW20 real data indicates that LIPID is capable of detecting genes with indirect effect. The computation is efficient, and the process takes 30 min to analyze the whole GAW20 data set on a personal computer with an Intel Core i3-4150 CPU. The genes identified appear to be functionally relevant to the trait being considered, thereby substantiating the importance of these findings and leading to confidence of genes found being true discoveries.

Conclusions
For complex diseases, we propose a novel method to detect indirect effect of SNPs on a phenotype via methylation, and we demonstrate its superiority compared to 2 existing methods. LIPID is single-step and does not require multiple tests, compared to traditional mediation analysis; at the same time, multiple genetic loci can be used simultaneously to test indirect effect.

Availability of data and materials
The data that support the findings of this study are available from the Genetic Analysis Workshop (GAW) but restrictions apply to the availability of these data, which were used under links for the current study. Qualified researchers may request these data directly from GAW.

About this supplement
This article has been published as part of BMC Genetics Volume 19 Supplement 1, 2018: Genetic Analysis Workshop 20: envisioning the future of statistical genetics by exploring methods for epigenetic and pharmacogenomic data. The full contents of the supplement are available