 Methodology article
 Open Access
 Published:
An integrative U method for joint analysis of multilevel omic data
BMC Genetics volume 20, Article number: 40 (2019)
Abstract
Background
The advance of highthroughput technologies has made it costeffective to collect diverse types of omic data in largescale clinical and biological studies. While the collection of the vast amounts of multilevel omic data from these studies provides a great opportunity for genetic research, the high dimensionality of omic data and complex relationships among multilevel omic data bring tremendous analytic challenges.
Results
To address these challenges, we develop an integrative U (IU) method for the design and analysis of multilevel omic data. While nonparametric methods make less model assumptions and are flexible for analyzing different types of phenotypes and omic data, they have been less developed for association analysis of omic data. The IU method is a nonparametric method that can accommodate various types of omic and phenotype data, and consider interactive relationship among different levels of omic data. Through simulations and a real data application, we compare the IU test with commonly used variance component tests.
Conclusions
Results show that the proposed test attains more robust type I error performance and higher empirical power than variance component tests under various types of phenotypes and different underlying interaction effects.
Background
With rapidly evolving highthroughput technologies and everdecreasing costs, it has become feasible to systematically study diverse types of omic data in biological and clinical studies [1, 2]. The collection of multilevel omic data from these studies provides us a great opportunity to integrate information from different levels of omic data into association analysis [3–6]. Although omicbased association analysis holds great promise for discovering novel diseaseassociated biomarkers, the discovery process is hampered by the lack of appropriate statistical tools to consolidate and analyze multilevel omic data. The development of advanced statistical methods to address the analytical challenges faced by ongoing omic data analysis can enhance our ability to identify new diseaseassociated biomarkers.
Comprehensive reviews of integrative analysis on multilevel omic data are summarized in [3, 5, 7] and the references therein. Most of the existing methods for integrative analysis are developed based on scoretype tests or variance component tests. For instance, in the integrative analysis of singlenucleotide variants (SNVs) and transcript expression data, [6] used the estimating equations to estimate parameters of interest, and then proposed a Wald test to evaluate the association between the outcome and a set of genetic variants, considering possible interactions. In order to efficiently test the joint effects of SNVs and gene expression with a binary phenotype, [8] developed a combined variance component test in the mixed model framework. Based on this work, [9] further investigated a variance component score test for modeling multiple genomic data including SNVs, gene expression, and methylation data, each of which can come from different samples or studies. While those methods have attractive properties under various scenarios, most of these methods are parametricbased or semiparametricbased, which often rely on a distribution assumption (e.g., a normal distribution assumption). When this assumption is violated, these methods are subject to false positive results and/or power loss [10]. The diagnostic assessments of human diseases can often be of different types (e.g., binary, ordinal and continuous) and follow known or unknown distributions. This issue is, however, paid less attention by the existing methods.
Moreover, the molecular complexity of human diseases manifests itself at the genomic, transcriptomic, epigenomic and proteomic levels [11, 12]. Different levels of omic data can interact in the disease process. By considering interactions between different levels of omic data, the power of detecting diseaseassociated biomarkers can be potentially enhanced. While some of existing methods consider interactions between omic data [6, 8], they commonly assume a particular interaction model (e.g., a multiplicative model), and are subject to suboptimal performance if the underlying model has different forms (e.g., a threshold model).
To address these limitations, we propose a nonparametric framework for association analysis using multilevel omic data. The IU test is a Ustatisticbased test, which is constructed using the pairwise omic and phenotype similarities of subjects. It has several remarkable features worthy of attention: 1) it makes no distribution assumptions, and therefore provides a robust and powerful performance when analyzing phenotypes and omic data with unknown distributions; 2) it provides a unified framework for analyzing various types of phenotypes and omic data (binary, ordinal and continuous); and 3) it considers interactions among different levels of omic data without posing specific model assumptions.
The remaining of the paper is organized as follows. We begin with a detailed description of the proposed integrative U method in “Methods” section, and then present the simulation results of the IU method under different types of phenotypes and various genetic or interaction effects in “Simulation” section. Using the proposed method, we performed an integrative analysis of the DNA sequencing and gene expression data from a hypertension study in “An integrative analysis of gene and gene expression data of hypertension” section. “Conclusion” section summarizes the advantages and limitations of the IU test. Details of the proof of the main results can be found in the Additional file 1.
Methods
Suppose that we are interested in evaluating the joint association of M levels of omic data with a disease phenotype of interest. Without loss of generality, we illustrate the method with two levels of omic data (i.e., SNVs and gene expression data). The extension to more than 2 levels of omic data will be discussed later in “Conclusion” section. Let Y_{i} be a continuous or discrete disease phenotype, S_{i} be a scalar gene expression variable, and G_{i}=(G_{i}(t_{1}),G_{i}(t_{2}),...,G_{i}(t_{p})) be the genotypes of p SNVs (e.g., coding variants in a gene) for the ith individual (i=1,......,n), where t_{j} is the SNV location and G_{i}(t_{j})=0,1,2 is coded as the number of minor alleles.
Genetic smoothing
In recent literature, functional data analysis has been often applied to handle the genetic data. For instance, [13] proposed a functional linear model for quantitative traits using Bspline basis functions to expand the genotype functions. Vsevolozhskaya et al. [14] proposed a functional analysis of variance method to test the association of sequence variants in a genomic region with a qualitative trait. Functional data analysis has also been developed for different types of traits and study purposes in genetic research. For instance, [15] developed a Cox proportional hazard model with functional regression for genebased association analysis of survival traits. Moreover, [16] proposed a generalized functional linear model to perform metaanalysis of multiple studies to evaluate the association of genetic variants with dichotomous traits.
Here we adopt the functional data analysis to handle the SNVs. Rather than assuming G_{i}(t_{j}) as a random variable, we assume that G_{i}(t_{j}) is a discrete realization of a function G_{i}(t) generated from a stochastic process with mean function η(t) and covariance function Γ(s,t). The Bspine smoothing technique is then used to model the underlying function curve G_{i}(t). In other words, G_{i}(t) can be written as a linear sum of specified basis functions:
where {β_{k}(t),k=1,...,K} is the polynomial basis functions in L_{2} Hilbert space. The fitted smoothing curves are demonstrated in Fig. 1. Similar to other functionalbased methods [14], we implement the smoothing by scaling the locations to the interval [0,1], and use the penalization technique to determine the appropriate number of knots (i.e., smoothness).
Test statistic
With the assumptions of Y, G(t) and S mentioned above, we aim to test the hypotheses:
H_{0}: Y is independent of G(t) and S;
H_{a}: Y is associated with G(t) or S.
Since we do not assume any regression form of the association between Y and the genetic variables (G and S), to perform the hypothesis testing, we propose a nonparametric integrative U statistic defined as
where K_{1}(·,·) and K_{2}(·,·) are symmetric kernel matrices that measure the similarities of two individuals’ phenotypes and gene expression values, respectively. For simplicity, we use the cross product for both kernel matrices
In addition to the cross product kernel, other kernels, such as those proposed in [10] and [17] can also be used.
From the above equation, the proposed test statistic is a U statistic defined on all possible pairs of subjects (i,j), where the genetic similarity of subjects i and j is defined as the inner product of the smooth curves of the stochastic process, i.e., \(\int _{0}^{1} G_{i} (t) G_{j} (t) dt\). The phenotype similarity and gene expression similarity between the subjects i and j are simply products of two subjects’ phenotype and gene expression values, respectively.
Asymptotic property
Under the null hypothesis and the assumption G_{i}(t)∼SP(η(t),Γ(s,t)), we obtain
where Z_{i}=(Y_{i},S_{i},G_{i}), μ_{Y} and μ_{S} are the population means of Y and S, respectively. The asymptotic result of nondegenerated U statistics in [18] implies that
where \(\sigma _{1}^{2} = Var(r_{1})\). Moreover, by applying the result of stochastic processes in Section 4.2 of [19], we can further obtain that
where m is the number of eigenvalues of Γ, (λ_{k},ϕ_{k}(t)) are eigenvalues and eigenfunctions of the covariance function Γ, \(\delta _{k} = \int \phi _{k}(t) \eta (t) dt\), \(\sigma _{Y}^{2}\) and \(\sigma _{S}^{2}\) are the population variances of Y and S.
Because μ_{0} is unobservable, we propose to estimate it by substituting the population means of Y, S and G(t) by their corresponding sample means, i.e.,
Let \(s_{Y}^{2}\) and \(s_{S}^{2}\) be the sample variances of Y and S, \(\hat \lambda _{k}\) and \(\hat \phi _{k}\) be the eigenvalues and eigenfunctions of \(\hat \Gamma (s,t) = \frac {1}{n}\sum _{i=1}^{n} \left (G_{i}(t)\bar {G}(t)\right)\left (G_{i}(s)  \bar {G}(s)\right)\). By letting \(\hat \delta _{k} = \int _{0}^{1} \hat \phi _{k}(t)\bar G(t)dt\), we can obtain the asymptotic distribution of the test statistic under H_{0}:
Theorem 1.
Assume that both Y and S have finite means and variances and G(t)∼SP(η(t),Γ(s,t)). Moreover, μ_{Y}μ_{S}≠0and S is independent of G(t). Then, under H_{0},
where\(\hat \mu _{0}\)is defined in (1) and\(\hat \sigma ^{2} = 4\sum \hat \lambda _{k} \hat \delta _{k}^{2} \left \{\bar Y^{2} \bar S^{2} \left (\bar Y^{2} s_{S}^{2} + \bar S^{2} s_{Y}^{2} + s_{Y}^{2} s_{S}^{2}\right)\right \} + 4\bar Y^{2} \bar S^{2} \bar G^{4} s_{Y}^{2} s_{S}^{2}.\)
We would reject H_{0} if \({\sqrt {n}(U_{n}  \hat \mu _{0})}/{\hat \sigma }>z_{\alpha /2}\), where z_{α/2} is the upper α/2 quantile of the standard normal distribution.
Remark 1.
The assumption of the underlying stochastic process is very general. We do not need a specific condition on the pointwise distributions such as Gaussian, which is the required assumption in [14].
Remark 2.
The proposed test inherits the robustness property from U statistics, and is capable of handling both discrete and continuous phenotypes with various underlying distributions. Moreover, the proposed test does not need to specify any form of the regression function μ=E(YS,G), hence the test procedure is free of model assumptions.
The method can also be used for different study purposes. For instance, to only test the effect of SNVs (e.g., in a genetic association study), the corresponding integrative U test statistic can be simplified as \(U_{G} = \frac {1}{n(n1)}\sum \limits _{i\neq j} Y_{i} Y_{j} \int _{0}^{1} G_{i} (t) G_{j} (t) dt\) with \(\hat \mu _{G} = \bar Y^{2} \bar {G}(t)^{2}\) and variance estimator \(\hat \sigma _{G}^{2} = 4\hat \mu _{Y}^{2} s^{2}_{Y} \sum \hat \lambda _{k} \hat \delta _{k}^{2} \).
Power and sample size
While omicbased studies become increasingly popular in human genetic research, few statistical tools are available for power and sample size calculation. In this section, we investigate the power of the proposed method under certain alternative hypotheses and provide a convenient way for power/sample size calculation.
We have studied the IU test under the null hypothesis, \(E\left \{Y_{i} Y_{j} S_{i} S_{j} \int _{0}^{1} G_{i}(t) G_{j}(t)dt\right \} = \mu _{0} = \mu _{Y}^{2} \mu _{S}^{2} \eta ^{2}\). Under the alternative hypothesis, we assume that \(E\left \{Y_{i} Y_{j} S_{i} S_{j} \int _{0}^{1} G_{i}(t) G_{j}(t)dt\right \}\) =μ_{1}≠μ_{0} and Var(Y_{i}Y_{j}S_{i}S_{j}\( \left. \int _{0}^{1} G_{i}(t) G_{j}(t)dtY_{i}, S_{i}, G_{i}\right) = \tau _{1}^{2}\). Without loss of generality, we assume that μ_{1}>μ_{0}. Applying the Hoeffding projection in [18], we obtain that \(\frac {\sqrt {n}(U_{n}  \mu _{1})}{2\tau _{1}}\to N(0,1)\). Hence, under the alternative hypothesis, U_{n} can be written as
Since both \(\hat \mu _{0}\) and \(\hat \sigma \) are consistent estimators of μ_{0} and σ, for a sufficiently large n, we have
Therefore, the power of the proposed test can be calculated by
For desired power β, one can derive the minimal required sample size by setting the inequality \(\Phi \left (\frac {\sigma }{2\tau _{1}}\left (\frac {\sqrt {n}(\mu _{1}  \mu _{0})}{\sigma } z_{\alpha /2}\right)\right)\ge \beta \), therefore the required sample size can be calculated by
Results
Simulation
Through simulations, we compared the type I error and empirical power of the proposed test with those of two variance component methods: the adjusted kernel sequencing association test and variance component test proposed by [8]. Since the original kernel sequencing test developed by [17] is proposed for only sequencing SNVs, we slightly modify the method to incorporate gene expression data in the test. Recall that the sequence kernel association test (SKAT) proposed by [17] has the form
where Y=(Y_{1},...,Y_{n})^{T} and \(\mathbf K(G_{i},G_{j}) = \sum _{k=1}^{p} w_{k} G_{ik} G_{jk}\). To make the methods comparable, the adjusted SKAT (AdjSKAT) is modified as:
where the elements of \(\tilde {\mathbf {K}}_{S}\) are defined as \( \tilde {\mathbf K}_{S}(G_{i},G_{j}) = \sum _{k=1}^{p} w_{k} S_{i} S_{j} G_{ik} G_{jk}\).
Interestingly, the proposed U_{n} has a similar form of Q. If we define
and further define metric K_{S}=(K_{S}(G_{i},G_{j}))_{n×n} with zeros as diagonal elements, then the proposed U statistic U_{n} can be written as a similar form to AdjSKAT:
In addition to AdjSKAT, we also compared IU test with the component variance test (VCT) developed under the generalized linear mixed model framework by [8]. VCT is proposed to test the joint effects of SNVs and gene expression, and is defined as
where (a_{1},a_{2},a_{3}) are weight parameters and C is the product of SNVs and gene expression. In the simulation, we used the recommended weight, the inverse of the square root of the variance, as suggested by [8].
The genetic data was simulated from the 1000 Genome Project [20]. Specifically, we used a 1Mb region of the genome (Chromosome 17: 73443288344327) from 1092 individuals in 1000 Genome Project. In each simulation replicate, SNVs were generated by randomly choosing a segment with p=100 consecutive SNVs from the genome. Then the stochastic smoothing function curves were constructed by applying the functional data analysis to the SNV sequences. Gene expression data was generated from a normal distribution, N(1,1.2^{2}). The natural cubic spline smoothing with penalty parameter introduced in [14] was applied. All the results of type I error and empirical power were calculated based on 1000 simulated replicates.
Type I error performance
The phenotype assessments can often be of different types (e.g., binary and continuous phenotypes), with unknown underlying distributions. In this simulation, we evaluated the robustness of three methods against different phenotype distributions. Totally, four types of distributions: Bernoulli (binary), Gaussian, T and Doubleexponential (DE), were considered in this simulation. Both T and DE have heavier tails than Gaussian. The original VCT test is developed for binary phenotype, but can be extended for other types of phenotypes by using different link functions. For instance, by using the identity function, VCT can be used to analyze continuous phenotypes. Under H_{0}, Y is independently generated from Bernoulli(1/(1+e^{0.2})), N(1,1), T_{2}, T_{4} and DE(1), respectively. Table 1 summarizes the type I error performance of three methods. From Table 1, we find that type I error rates of both VCT and AdjSKAT are well controlled for Bernoulli and Gaussiantype of phenotypes, but are inflated under the heavy tailed distributions(T and DE). As we expect from the U statistic property, the IU test attains robust performance under all phenotype distributions.
We further investigated the empirical levels of the proposed test for different sample sizes. For this simulation, we considered both continuous and binary phenotypes and varied the study sample size from 100 to 500. A normal distribution was used to simulate the continuous phenotype, while balanced samples were generated for the binary phenotype. 1000 independent simulation replicates were used to obtain the empirical sizes. The empirical levels of the test for nominal sizes 0.05 and 0.01 are summarized in Table 2. As observed in Table 2, the type I error rates of the proposed test are well controlled for both types of phenotypes and different sample sizes.
Power performance
For the power comparison, we considered the scenarios with or without an interaction between SNVs and gene expression. For the scenarios with an interaction, we studied the performance of the three methods under various interaction models. Similar to the type I error simulation, the genetic data was obtained from the 1000 Genome Project and gene expression S_{i} was sampled from N(1,1.2^{2}). The binary response Y_{i} was then generated from a logistic regression model. In each simulation, we randomly chose 100 cases and 100 controls to form a balanced casecontrol sample. For continuous phenotypes, we simulated both Gaussiandistributed and Tdistributed phenotypes.
Case 1: No interaction effect
We first evaluated the power of three methods under the scenario when there is no interaction between SNVs and gene expression. In the binary case, the underlying model is similar to the one assumed in [8]
where β_{S} is a scalar parameter, β_{G} and γ are p×1 vector parameters. In the continuous case, a linear mixed model was used,
where β_{G} and β_{S} were defined as in (2), ε_{i}∼N(0,1) and e_{i}∼T(2), a Tdistribution with 2 degrees of freedom.
Similar to [8], we assume that β_{G} and γ are randomly generated from probability distributions with mean 0 and variances \(\sigma _{G}^{2}\) and \(\sigma _{\gamma }^{2}\). In this simulation, the genetic effects measured by β_{G} were generated from a normal distribution, \(N(0,\sigma _{G}^{2})\), while the interaction effects measured by γ were all set to be zero (\(\sigma _{\gamma }^{2} = 0\)) in order to study the marginal effects of genetic variables.
The power performance for the binary, Gaussiandistributed, and Tdistributed phenotypes under the Model 1 is summarized in Fig. 2. The figure shows the power performance of three methods when the effect of gene expression, β_{s}, increases and the effects of SNVs remain the same (σ_{G}=0.1). As shown in the figure, both IU and VCT have higher power than AdjSKAT as the effect of gene expression increases for binary phenotype. As expected, VCT achieves highest power for Gaussiandistributed phenotype among the three methods. For the Tdistributed phenotype, IU outperforms VCT and AdjSKAT while AdjSKAT has little power for Tdistributed phenotype. Overall, IU test is more robust than the other two methods to the phenotype distributions.
Case 2: Interaction effect
We then compared the performance of three methods under a more complex scenario when there is an interaction between SNVs and gene expression. In this simulation, we considered three types of interaction effects: multiplicative, threshold, and random interaction effects. Similar to the simulation with no interaction, we evaluated the methods under three different kinds of phenotypes.
A multiplicative interaction effect is simulated for binary, Gaussiandistributed and Tdistributed phenotypes based on the model below,
where c is a scale parameter, ε_{i}∼N(0,1), and e_{i}∼T(2). With W being a p×1 vector, each element of W equals to 1 if the corresponding genetic variant is a causal SNV, and 0 if the genetic variant is noncausal. Under the null hypothesis of no association, all elements in W are 0. Let m be the number of causal SNVs, then a larger m indicates increasing interaction effects on the phenotype. Similarly, a large c corresponds to a strong interaction effect. The upper panel of Fig. 3 shows the power comparison of three methods as m increases from 0 to 8 with scale parameter c=1, and the lower panel of Fig. 3 shows power performance as the scale parameter c increases from 0 to 1.5 with m=4. As we observe from the figure, IU test attains higher power than VCT and AdjSKAT for binary and Tdistributed phenotypes with the increasing interaction effect. For Gaussiandistributed phenotype, AdjSKAT has highest power among the three methods because the cross product kernel used in AdjSKAT perfectly captures the true underlying interaction model.
Next we considered a threshold interaction model, which has the following form:
where d is a threshold parameter, ε_{i} and e_{i} are the same as those in Model 2. A large d corresponds to a weak interaction effect. Under this model, the effect substantially increases when the product of SNV and gene expression exceeds the threshold d. Figure 4 shows that all the three tests have decreased power as the threshold parameter d increases from 0 to 8. Moreover, the IU test achieves much higher power than the other two methods for the binary and Tdistributed phenotypes while it has a similar performance as the other two methods for the Gaussiandistributed phenotype.
Finally, we considered a random effect interaction model based on the model 1 described in (2) with elements of γ generated from a uniform distribution U(−a,a),a>0. With no marginal effects (i.e., σ_{G}=0, β_{S}=0), Fig. 5 shows that IU test has a similar power performance as VCT for both binary and Gaussiandistributed phenotype. We also find that for the heavytailed phenotype (i.e., the Tdistributed phenotype), IU attains more robust type I error and higher power than VCT and AdjSKAT. With both marginal (σ_{G}=0.1, β_{S}=0,0.5,1,1.5,2) and interaction effects (a=0.1), the power performance shown in Fig. 6 is similar to that of Case 1 shown in Fig. 2.
In summary, the proposed IU test obtains higher power as the marginal or interaction effects increase. Unlike VCT or AdjSKAT, which show higher power only under some specific models (e.g., the random effect or crossproduct interaction models), the IU test showed more robust and stable performance for different phenotypes and various underlying models. These features make IU more appropriate to use when we have limited knowledge on the actual underlying model.
An integrative analysis of gene and gene expression data of hypertension
Hypertension is one of the most common chronic diseases, which affects a large proportion of human population worldwide. Despite decades of research in hypertension, the genetic etiology of hypertension remains largely unknown. The successful identification of genetic variants predisposing to hypertension holds promise for providing better understanding of genetic etiology of hypertension and promoting new therapeutic targets. In this application, we performed an integrative analysis of DNA sequencing and gene expression data from the San Antonio Family Heart Study (SAFHS) and the San Antonio Family Diabetes/Gallbladder Study (SAFDGS). SAFHS and SAFDGS include standardized diagnostic assessments of hypertension (i.e., Case vs. Control). Wholegenome sequencing (WGS) data were available on the odd numbered autosomes. In addition, gene expression was measured using peripheral blood mononuclear cells collected at the first examination. In total, there are 260 subjects with WGS data, gene expression data, and the binary hypertension (HTN) phenotype measured.
Prior to the integrative analysis, we performed a quality control and data preparation process. In this process, we assembled multiple SNVs into genes based on the Genome Reference Consortium release version 38 (GRCh38) and excluded genes without gene expression data. To deal with missing values in the genetic data, we imputed the genotype values from multinomial distribution using the sample proportions as the generating probabilities. After data processing step, 2389 genes and the corresponding gene expression remained for the integrative analysis. We then applied a generalized mixed model to the binary HTN phenotype with covariates AGE, MEDS, SMOKE, SEX and the kinship matrix to remove potential confounding effects and the familial correlations. The residuals were used as the responses in this integrative analysis. Eventually, the proposed IU test is applied to detect the joint effect of genes and gene expression data.
From the QQ plot of HTN in Fig. 7, we find no evidence of systematical inflation of the association result. While there is no significant findings after adjusting for multiple testing, there are a few genes reached marginal significance (e.g., UBAC1). Among the top findings, some genes may have biological plausibility related to hypertension. For instance, MFGE8 has been previously reported to upregulate the intake of Dietary Fats [21], and the Dietary Fats regulates blood pressure via Central Leptin mediated pathways [22]. Therefore, MFGE8 could be a potential risk factor for hypertension [23]. The expression of IFI44L is demonstrably increased upon contact of Nickel [24] or Nickel Chloride [25], which is associated with elevated prevalence of hypertension [26]. Although previous studies suggested that some genes in Table 3 may play a role in hypertension, further studies and biological experiments are needed to confirm the association and to further investigate the potential role of these genes in hypertension.
Conclusion
To facilitate the integrative analysis of omic data, we proposed a unified nonparametric method to detect the joint association of multilevel omic data with various types of phenotypes. There are three main contributions of the proposed IU method. First, it provides robust performance for various types of phenotypes, including binary, Gaussian and heavytailed distributions, due to the robustness of U statistics. Second, the proposed integrative U test achieves higher or comparable power compared to existing methods (e.g., VCT) under different types of interaction models. Finally, we also provide a simple sample size/power calculation to facilitate the design of multilevel omic studies.
The connection between the proposed method and variance component tests is that all test statistics are in the form of kernel quadratic framework as seen in “Simulation” section. It also connects to several other Ustatisticbased methods [10, 27]. As a similaritybased test, the IU method is proposed as a nondegenerated U statistic, which follows a normal distribution. One advantage of using a nondegenerated U statistic is the computational accuracy with no distribution approximation. If we centralize the phenotype, it becomes a degenerated U test, which follows a mixture chisquare distribution.
The IU test can be extended to handle more than 2 levels of omic data. For instance, a modified IU test can be applied for 3 levels of omic data. Besides the SNVs and gene expression, we further introduce R_{i} as the DNA methylation for subject i, and use kernel matrix K_{3}(·,·) to measure the DNA methylation similarities. The IU test statistic can then be defined as
The choice of K_{3} is similar to K_{1} and K_{2} as discussed in “Methods” section. Following the same argument for Theorem 1, we can show that this modified IU test also follows an asymptotically normal distribution. In addition, with multiple genes (e.g., genes in a biological pathway) and the corresponding gene expression levels, the gene expression level S can also be modeled as a function. For such purpose, the similarity measure K_{2}(S_{i},S_{j}) can be modified as \(\tilde K_{2}(S_{i}(t),S_{j}(t))\) where \(\tilde K_{2}(\cdot,\cdot)\) measures the similarity between two functions. The asymptotic property can be derived based on the same argument for Theorem 1.
One potential limitation of this study is that gene expression is assumed to be independent of SNVs. One technical reason of making such assumption is that, under the stochastic process setup, it is hard to model the association between the gene expression variable S and the underlying stochastic process SP(η(t),Γ(s,t)). Finding an appropriate way to model correlations among omic data is a challenging topic that is worth of further investigation. Nevertheless, if real data indicates correlations among different levels of omic data, one way to overcome this issue is to adopt methods introduced by [27] and [28].
Abbreviations
 AdjSKAT:

Adjusted sequence kernel association test
 DE:

Double exponential
 HTN:

Hypertension
 IU:

Integrative U test
 SAFDGS:

San Antonio Family Diabetes/Gallbladder Study
 SAFHS:

San Antonio Family Heart Study
 SKAT:

Sequence kernel association test
 VCT:

Variance component test
 WGS:

Wholegenome sequencing
References
Collins FS, Varmus H. A new initiative on precision medicine. New Eng J Med. 2015; 372(9):793–5.
Lappalainen T, Sammeth M, Friedlander MR, ‘t Hoen PA, Monlong J, Rivas MA, GonzalezPorta M, et al.Transcriptome and genome sequencing uncovers functional variation in humans. Nature. 2013; 501(7468):506–11.
Kristensen VN, Lingjærde OC, Russnes HG, Vollan HK, Frigessi A, BørresenDale A. Principles and methods of integrative genomic analyses in cancer. Nat Rev Cancer. 2014; 14(5):299–313.
Lin W, Feng R, Li H. Regularization Methods for HighDimensional Instrumental Variables Regression With an Application to Genetical Genomics. J Am Stat Assoc. 2015; 110(509):270–88.
Ritchie MD, Holzinger ER, Li R, Pendergrass SA, Kim D. Methods of integrating data to uncover genotypephenotype interactions. Nature Reviews Genetics. 2015; 16(2):85–97.
Zhao SD, Cai TT, Li H. More powerful genetic association testing via a new statistical framework for integrative genomics. Biometrics. 2014; 70(4):881–90.
Ainsworth HF, Shin S, Cordell HJ. A comparison of methods for inferring causal relationships between genotype and phenotype using additional biological measurements. Genet Epidemiol. 2017; 41(7):577–86.
Huang YT, Vanderweele TJ, Lin X. Joint analysis of SNP and gene expression data in genetic association studies of complex diseases. Ann Appl Stat. 2014; 8:352–76.
Huang YT. Integrative modeling of multiple genomic data from different types of genetic association studies. Biostatistics. 2014; 15(4):587–602.
Wei C, Li M, He Z, Vsevolozhskaya O, Schaid DJ, Lu Q. A weighted Ustatistic for genetic association analyses of sequencing data. Genet Epidemiol; 38(8):699–708.
The Cancer Genome Atlas Network. Comprehensive molecular portraits of human breast tumours. Nature. 2012; 490(7418):61–70.
Hawkins RD, Hon GC, Ren B. Nextgeneration genomics: an integrative approach. Nat Rev Genet. 2010; 11(7):476–86.
Luo L, Zhu Y, Xiong M. Quantitative trait locus analysis for nextgeneration sequencing with the functional linear models. J Med Genet. 2012; 49(8):513–24.
Vsevolozhskaya OA, Zaykin DV, Greenwood MC, Wei C, Lu Q. Functional analysis of variance for association studies. PLOS ONE. 2014; 9(9):e105074.
Fan R, Wang Y, Boehnke M, Chen W, Li Y, Ren H, Lobach I, Xiong M. Gene level metaanalysis of quantitative traits by functional linear models. Genetics. 2015; 200(4):1089–104.
Fan R, Wang Y, Chiu CY, Chen W, Ren H, Li Y, Boehnke M, Amos CI, Moore JH, Xiong M. Metaanalysis of complex diseases at gene level by generalized functional linear models. Genetics. 2015; 202(2):457–70.
Wu MC, Lee S, Cai T, Li Y, Boehnke M, Lin X. Rarevariant association testing for sequencing data with the sequence kernel association test. Am J Hum Genet. 2011; 89:82–93.
Serfling RJ. Approximation theorems of mathematical statistics. Wiley Series in Probability and Statistics. Hoboken: Wiley; 1981.
Zhang JT. Analysis of Variance for Functional Data. London: Chapman & Hall; 2013.
1000 Genomes Project Consortium, Abecasis GR, et al.A map of human genome variation from populationscale sequencing. Nature. 2010; 467(7319):1061–73.
KerleyHamilton JS, Trask HW, Ridley CJ, Dufour E, Ringelberg CS, Nurinova N, Wong D, Moodie KL, Shipman SL, Moore JH, Korc M, Shworak NW, Tomlinson CR. Obesity is mediated by differential aryl hydrocarbon receptor signaling in mice fed a Western diet. Environ Health Perspect. 2012; 120(9):1252–9.
Han C, Wu W, Ale A, Kim MS, Cai D, 2016. Central Leptin and Tumor Necrosis Factor α (TNF α) in Diurnal Control of Blood Pressure and Hypertension. Int J Biol Chem; 291(29):15131–42.
BrahmaNaidu P, Nemani H, Meriga B, Mehar SK, Potana S, Ramgopalrao S. Mitigating efficacy of piperine in the physiological derangements of high fat diet induced obesity in Sprague Dawley rats. Chem Biol Interact. 2014; 221:42–51.
Correa RJ, Malajian D, Shemer A, Rozenblit M, Dhingra N, Czarnowicki T, Khattri S, Ungar B, Finney R, Xu H, Zheng X, Estrada YD, Peng X, SuarezFarinas M, Krueger JG, GuttmanYassky E. Patients with atopic dermatitis have attenuated and distinct contact hypersensitivity responses to common allergens in skin. J Allergy Clin Immunol. 2015; 135(3):712–20.
TchouWong KM, Kiok K, Tang Z, Kluz T, Arita A, Smith PR, Brown S, Costa M. Effects of nickel treatment on H3K4 trimethylation and gene expression. PLOS ONE. 2011; 6(3):e17728.
Yang AM, Bai YN, Pu HQ, Zheng TZ, Cheng N, Li JS, Li HY, Zhang YW, Ding J, Su H, Ren XW, Hu XB. Prevalence of metabolic syndrome in Chinese nickelexposed workers. Biomed Environ Sci. 2014; 27(6):475–7.
Wei C, Elston RC, Lu Q. A weighted U statistic for association analyses considering genetic heterogeneity. Stat Med. 2016; 35(16):2802–14.
Jiang Y, Li N, Zhang H. Identifying Genetic Variants for Addiction via Propensity Score Adjusted Generalized Kendall’s Tau. J Am Stat Assoc. 2014; 109(507):905–30.
Acknowledgements
Not applicable.
Funding
This project was supported by the National Institute on Drug Abuse (Award No. R01DA043501) and the National Library of Medicine (Award No. R01LM012848).
Availability of data and materials
Data used in this article comes from the Genetic Analysis Workshop.
Author information
Authors and Affiliations
Contributions
PG and QL participate in the design of the study. PG implemented the methods and drafted the manuscript. XT was involved in the data analysis. QL participated in the conception of the study and in editing the manuscript. All authors read and approved the final manuscripts.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
Author Qing Lu is currently acting as an Editorial Board Member for BMC Genetics. All other authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Additional file
Additional file 1
The proof of Theorem 2.1 can be found in the Appendix. (PDF 94 kb)
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Geng, P., Tong, X. & Lu, Q. An integrative U method for joint analysis of multilevel omic data. BMC Genet 20, 40 (2019). https://doi.org/10.1186/s128630190742z
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s128630190742z
Keywords
 Nonparametric method
 Functional data analysis
 Integrative analysis