 Research
 Open access
 Published:
Indirect effect inference and application to GAW20 data
BMC Genetics volumeÂ 19, ArticleÂ number:Â 67 (2018)
Abstract
Background
Association studies using a single type of omics data have been successful in identifying diseaseassociated genetic markers, but the underlying mechanisms are unaddressed. To provide a possible explanation of how these genetic factors affect the disease phenotype, integration of multiple omics data is needed.
Results
We propose a novel method, LIPID (likelihood inference proposal for indirect estimation), that uses both single nucleotide polymorphism (SNP) and DNA methylation data jointly to analyze the association between a trait and SNPs. The total effect of SNPs is decomposed into direct and indirect effects, where the indirect effects are the focus of our investigation. Simulation studies show that LIPID performs better in various scenarios than existing methods. Application to the GAW20 data also leads to encouraging results, as the genes identified appear to be biologically relevant to the phenotype studied.
Conclusions
The proposed LIPID method is shown to be meritorious in extensive simulations and in realdata analyses.
Background
In complex disease studies, genomewide association studies (GWAS) [1] and epigenomewide association studies [2] have been successful in identifying diseaseassociated singlenucleotide polymorphisms (SNPs) and DNA methylation loci. However, the mechanism of how these genetic loci affect the disease status remains unknown. To provide a possible explanation of the causal mechanisms of these genetic factors, integrative analyses using both types of data are important. Even though integration of multiple types of data sets is a promising method as it is generally more powerful than ordinary association studies [3], the method of integration itself is challenging.
Most existing methods use additional information to filter out nonsignificant loci and reduce the total number of tests, which, in return, improve power [4]. On the other hand, mediation analyses usually consider only a single mediator and require multiple testing correction [5]. An example is Zhao et al. [6] who proposed an integrative test, denoted as oeSNP that was shown to be more powerful than traditional GWAS. Motivated by the data provided by GAW20, in which SNP and DNA methylation data for integrative analysis are available, we aim to characterize the effects of SNPs into direct and indirect effects. As the direct effect of SNPs is simple and straightforward, in this contribution we focus on the indirect effect of SNPs. With the prior knowledge that DNA methylation can be modulated by SNPs [7], we assume that some SNPs exert their effects by regulating the DNA methylation level. Hence the indirect effect of SNPs on a phenotype of interest here is taken as the combined effects of SNPs on DNA methylation and DNA methylation on the phenotype.
In this paper, we propose a novel method, LIPID (likelihood inference proposal for indirect estimation), to use both SNP and DNA methylation data to test whether there is an indirect effect of SNPs on a phenotype. The indirect effect and its varianceâ€“covariance matrix are derived, and a Wald test is conducted. An extensive simulation study was done to evaluate the properties and the performance of LIPID, which was also applied to analyze the GAW20 real data.
Methods
Suppose there are n independent subjects, and for each subject, its SNP, DNA methylation, covariates, and phenotype are measured. Specifically, let Yâ€‰=â€‰(Y_{1},â€¦,Y_{n})^{T} be the vector of observed phenotypes; Xâ€‰=â€‰(X_{1},â€¦,X_{k}) be the nâ€‰Ã—â€‰k matrix denoting the observed values of k (nongenetic) covariates, including intercept; Sâ€‰=â€‰(S_{1},..., S_{r}) and Mâ€‰=â€‰(M_{1},..., M_{p}) be the nâ€‰Ã—â€‰r and nâ€‰Ã—â€‰p matrices regarding the genotypes of r SNPs and methylation levels of p cytosinephosphateguanine (CpG) sites, respectively. Assuming that phenotype Y is a continuous variable, we can use a set of linear models to capture the relationship among Y, X, S, and M as follows:
where Îµ_{Y} âˆ¼â€‰N (0,\( \kern0.3em {\sigma}_Y^2{I}_n \)), vec(Îµ_{M})â€‰âˆ¼â€‰N (Î£_{M} âŠ—I_{n}), vec(Â·) is the vectorization operation; âŠ— is the Kronecker product; and Îµ_{Y} and Îµ_{M} are independent. Note that here Î²_{S} and Î²_{X} are râ€‰Ã—â€‰p and kâ€‰Ã—â€‰p matrices respectively, and Î£_{M} is a pâ€‰Ã—â€‰p positive definite matrix. Model (1) characterizes the relationship between phenotype and SNPs, DNA methylation, and covariates, while model (2) depicts the relationship between DNA methylation (as a response variable) and SNPs and covariates. It is concluded from models (1) and (2) that the direct effect of SNPs on the phenotype is Î±_{S} and the indirect effect is Î³â€‰=â€‰Î²_{S}Î±_{M}.
Estimation and inference
For linear models (1) and (2), we have the following maximum likelihood estimates
where \( {G}_1=\left(S,M,X\right),{G}_2=\left(S,X\right),\kern0.5em \alpha ={\left({\alpha}_S^T,{\alpha}_M^T,{\alpha}_X^T\right)}^T,\mathrm{and}\ \beta ={\left({\beta}_S^T,{\beta}_X^T\right)}^T. \) The varianceâ€“covariance matrices for \( \widehat{\alpha} \)and vec \( \left(\widehat{\beta}\right) \) are, respectively:
Their corresponding block matrices are the varianceâ€“covariance matrices for \( {\widehat{\alpha}}_M \) and vec \( \left(\widehat{\beta_S}\right) \), respectively. According to the law of total variation, the varianceâ€“covariance matrix for \( \widehat{\upgamma} \) is
where (Â·)_{11} represents the first râ€‰Ã—â€‰r diagonal submatrix. As Î±_{M} and Î²_{S} are unavailable, their estimates \( {\widehat{\alpha}}_M \) and \( {\widehat{\beta}}_S \) are used. After several lines of algebra we have
So an unbiased estimate for the varianceâ€“covariance matrix of \( \kern0.1em \widehat{\upgamma} \)is
where \( {\widehat{\varSigma}}_M\ and\ \widehat{C} ov\left({\widehat{\alpha}}_M\right) \) are corresponding estimates of Î£_{M} and Cov\( \left({\widehat{\alpha}}_M\right) \). This estimate is different from oeSNP [6] in the last component. The Wald statistic to test if indirect effect exists is
which asymptotically follows a chisquared distribution with r degrees of freedom.
Adaptation to correlated subjects
For subjects with correlation between each other, linear model with mixed effect is utilized. So model (1) is changed to
where b is the random effect, with mean 0 and varianceâ€“covariance matrix \( 2{\sigma}_b^2\Phi \), where Î¦ is the kinship coefficient matrix, and bâ€‰+â€‰Îµ_{Y} âˆ¼ \( N\left(0,{\sigma}_Y^2{I}_n+2{\sigma}_b^2\varPhi \right). \) Model (2) is not changed. The estimate Î±_{M} and its varianceâ€“covariance matrix are derived similarly, and the LIPID statistic has the same form.
Simulation study
To evaluate the performance of LIPID, simulation under various scenarios are conducted. For simplicity, we assume there are no covariates, and that S, M, and Y are all univariate. In addition, the direct effect that we are not interested in does not exist in simulation. The simulated data are generated as follows. First, SNP S is generated with a minor allele frequency (MAF) under HardyWeinberg equilibrium. Then, DNA methylation M is generated from a normal distribution with mean SÎ²_{S} and variance \( {\sigma}_M^2 \). Finally, phenotypic value Y is generated from a normal distribution with mean MÎ±_{M} and variance \( {\sigma}_Y^2 \). The number of individuals is set to be 100 and the number of replications is 10,000. As Table 1 shows, there are 5 scenarios of parameters designed to gauge the Type I error rates of LIPID, and 5 scenarios for the evaluation of power. The variance of Y and variance of M are fixed to 1 for simplicity. In scenario 1, the coefficients Î²_{S} and Î±_{M} are both 0; in scenarios 2 and 3, Î²_{S} is nonzero while Î±_{M} is 0; in scenarios 4 and 5, Î±_{M} is nonzero with Î²_{S} equal to 0. In these scenarios, the indirect effect is nonexistent, so we change the coefficient of one parameter to measure the Type I error rates under different situations. Under H_{a}, the indirect effect is Î²_{S} Î±_{M} â‰ 0. The coefficients are chosen so that different methods have moderate powers. From scenario 1 to scenario 5, the Î²_{S} and Î±_{M} are increased. The MAF ranges from 0.1 to 0.4.
Results
Table 2 shows the Type I error rates in the 5 scenarios, from which we can see that the Type I error rates are more or less conservative in scenarios 1 to 5 for oeSNP and LIPID, while regressing on SNPs only (denoted as SNP in Table 2) controls the Type I error rate well. For scenario 1, where both coefficients Î²_{S} and Î±_{M} are 0, the Type I error rate is conservative; for scenarios with 1 coefficient that is not 0 but relatively small, the Type I error rates are still conservative; for scenarios with 1 large coefficient, the Type I error rates are better controlled. Compared to oeSNP [6], the Type I error rates of LIPID are favorable, as oeSNP is more conservative in all scenarios. Figure 1 shows the powers of 3 methods in 5 scenarios. We can see that LIPID is the most powerful in all scenarios, while SNPonly method has the least power. From scenario 1 to scenario 5, as the indirect effect increases, the performance of LIPID and oeSNP are very close to each other.
Real data analysis
The GAW20 real data package contains genomic, DNA methylation, and phenotypic data for more than 1000 individuals from 188 pedigrees. The phenotypic data include metabolic indices, lipoproteins, and triglyceride. Dense genomewide SNP markers make up the genomic data. DNA methylation levels are also available on CpG sites genomewide before and after individuals are treated with fenofibrate. The level of triglycerides and the methylation level before treatment are made use of. The covariates include gender, age, smoking status, highdensity lipoprotein, metabolic disorder, and center. We use SOLAR (Sequential Oligogenic Linkage Analysis Routines) [8] to obtain the heritability for triglyceride level, using 1108 subjects with phenotypic data. The total number of subjects with SNP data and DNA methylation is 716. As LIPID considers a region of multiple genetic markers, SNPs and DNA methylation loci within the range of each gene are analyzed; we analyze a total of 13,968 genes. Genes that pass false discovery rate (FDR) correction are FAT1 (p value 9.4E7) and DCTN6 (p value 1.3E6), while oeSNP fails to find any significant genes (FAT1 p value 2.5E5; DCTN6 p value 3.7E6). We further use the BIOS QTL (quantitative trait locus) browser [9] to validate our findings. We found that rs458021 on gene FAT1 is a cismeQTL (methylation quantitative trait locus) with a p value of 3.8E07, but we did not find any meQTL on gene DCTN6. FAT1 is associated with cholesterol in DAVID (Database for Annotation, Visualization, and Integrated Discovery), whereas DCTN6 is involved in lipid metabolism [10]. Because the eQTM (expression quantitative trait methylation) database are not widely available, we cannot further validate if these CpG sites on these genes can modulate the expression, and further influence the phenotype.
Discussion
Compared to oeSNP [6], LIPID controls Type I error rates better in all scenarios, and the power is higher in all scenarios. The estimate itself is the same, no matter if we regress M or MÎ±_{M} on S, but the varianceâ€“covariance estimates are different, and LIPID has a lessbiased varianceâ€“covariance estimate, which leads to the improvement of performance of LIPID. Furthermore, we also adapt oeSNP and LIPID to correlated subjects.
Application of LIPID to the GAW20 real data indicates that LIPID is capable of detecting genes with indirect effect. The computation is efficient, and the process takes 30 min to analyze the whole GAW20 data set on a personal computer with an Intel Core i3â€“4150 CPU. The genes identified appear to be functionally relevant to the trait being considered, thereby substantiating the importance of these findings and leading to confidence of genes found being true discoveries.
Conclusions
For complex diseases, we propose a novel method to detect indirect effect of SNPs on a phenotype via methylation, and we demonstrate its superiority compared to 2 existing methods. LIPID is singlestep and does not require multiple tests, compared to traditional mediation analysis; at the same time, multiple genetic loci can be used simultaneously to test indirect effect.
Abbreviations
 DAVID:

Database for annotation, visualization, and integrated discovery
 eQTM:

Expression quantitative trait methylation
 FDR:

False discovery rate
 GAW:

Genetic analysis workshop
 LIPID:

Likelihood inference proposal for indirect estimation
 MAF:

Minor allele frequency
 meQTL:

Methylation quantitative trait locus
 QTL:

Quantitative trait locus
 SOLAR:

Sequential oligogenic linkage analysis routines
References
Klein RGAWJ, Zeiss C, Chew EY, Tsai JY, Sackler RS, Haynes C, Henning AK, SanGiovanni JP, Mane SM, Mayne ST, et al. Complement factor H polymorphism in agerelated macular degeneration. Science. 2005;308(5720):385â€“9.
Rakyan VK, Down TA, Balding DJ, Beck S. Epigenomewide association studies for common human diseases. Nat Rev Genet. 2011;12(8):529â€“41.
Chen Y, Zhu J, Lum PY, Yang X, Pinto S, MacNeil DJ, Zhang C, Lamb J, Edwards S, Sieberts SK, et al. Variations in DNA elucidate molecular networks that cause disease. Nature. 2008;452(7186):429â€“35.
Ware JS, Petretto E, Cook SA. Integrative genomics in cardiovascular medicine. Cardiovasc Res. 2013;97(4):623â€“30.
Baron RM, Kenny DA. The moderatorâ€“mediator variable distinction in social psychological research: conceptual strategic and statistical considerations. J Pers Soc Psychol. 1986;51(6):1173â€“82.
Zhao SD, Cai TT, Li H. More powerful genetic association testing via a new statistical framework for integrative genomics. Biometrics. 2014;70(4):881â€“90.
Kerkel K, Spadola A, Yuan E, Kosek J, Jiang L, Hod E, Li K, Murty VV, Schupf N, Vilain E, et al. Genomic surveys by methylationsensitive SNP analysis identify sequencedependent allelespecific DNA methylation. Nat Genet. 2008;40(7):904â€“8.
Almasy L, Blangero J. Multipoint quantitativetrait linkage analysis in general pedigrees. Am J Hum Genet. 1998;62(5):1198â€“211.
Zhernakova DV, Deelen P, Vermaat M, van Iterson M, van Galen M, Arindrarto W, vanâ€™t Hof P, Mei H, van Dijk F, Westra HJ, et al. Identification of contextdependent expression quantitative trait loci in whole blood. Nat Genet. 2017;49(1):139â€“45.
Yang S, Chen C, Wang H, Rao X, Wang F, Duan Q, Chen F, Long G, Gong W, Zou MH, et al. Protective effects of acylCoA thioesterase 1 on diabetic heart via PPARÎ±/PGC1Î± signaling. PLoS One. 2012;7(11):50376.
Funding
Publication of this article was supported by NIH R01 GM031575. This work was supported in part by National Natural Science Foundation of China (11571082, 11171075), Key Research Project of the Ministry of Science and Technology of China (2016YFC0904400), the Scientific Research Foundation of Fudan University.
Availability of data and materials
The data that support the findings of this study are available from the Genetic Analysis Workshop (GAW) but restrictions apply to the availability of these data, which were used under links for the current study. Qualified researchers may request these data directly from GAW.
About this supplement
This article has been published as part of BMC Genetics Volume 19 Supplement 1, 2018: Genetic Analysis Workshop 20: envisioning the future of statistical genetics by exploring methods for epigenetic and pharmacogenomic data. The full contents of the supplement are available online at https://bmcgenet.biomedcentral.com/articles/supplements/volume19supplement1.
Author information
Authors and Affiliations
Contributions
LL, CW, and TL conceived the study and developed the statistical methods; LL and YQH conducted the data analysis; LL and YQH wrote the paper; SL and YQH reviewed the paper. All authors have read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Publisherâ€™s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Li, L., Wang, C., Lu, T. et al. Indirect effect inference and application to GAW20 data. BMC Genet 19 (Suppl 1), 67 (2018). https://doi.org/10.1186/s1286301806383
Published:
DOI: https://doi.org/10.1186/s1286301806383