Integrative Bayesian variable selection with gene-based informative priors for genome-wide association studies

Zhang, Xiaoshuai; Xue, Fuzhong; Liu, Hong; Zhu, Dianwen; Peng, Bin; Wiemels, Joseph L; Yang, Xiaowei

doi:10.1186/s12863-014-0130-7

Research article
Open access
Published: 10 December 2014

Integrative Bayesian variable selection with gene-based informative priors for genome-wide association studies

Xiaoshuai Zhang²,
Fuzhong Xue²,
Hong Liu³,
Dianwen Zhu¹,
Bin Peng⁴,
Joseph L Wiemels⁵ &
…
Xiaowei Yang¹

BMC Genetics volume 15, Article number: 130 (2014) Cite this article

3244 Accesses
5 Citations
2 Altmetric
Metrics details

Abstract

Background

Genome-wide Association Studies (GWAS) are typically designed to identify phenotype-associated single nucleotide polymorphisms (SNPs) individually using univariate analysis methods. Though providing valuable insights into genetic risks of common diseases, the genetic variants identified by GWAS generally account for only a small proportion of the total heritability for complex diseases. To solve this “missing heritability” problem, we implemented a strategy called integrative Bayesian Variable Selection (iBVS), which is based on a hierarchical model that incorporates an informative prior by considering the gene interrelationship as a network. It was applied here to both simulated and real data sets.

Results

Simulation studies indicated that the iBVS method was advantageous in its performance with highest AUC in both variable selection and outcome prediction, when compared to Stepwise and LASSO based strategies. In an analysis of a leprosy case–control study, iBVS selected 94 SNPs as predictors, while LASSO selected 100 SNPs. The Stepwise regression yielded a more parsimonious model with only 3 SNPs. The prediction results demonstrated that the iBVS method had comparable performance with that of LASSO, but better than Stepwise strategies.

Conclusions

The proposed iBVS strategy is a novel and valid method for Genome-wide Association Studies, with the additional advantage in that it produces more interpretable posterior probabilities for each variable unlike LASSO and other penalized regression methods.

Background

Over the last decade, Genome-wide Association Studies (GWAS) have identified genetic loci associated for a variety of diseases [1]-[5]. Most studies aim to identify single nucleotide polymorphisms (SNPs) individually using univariate analysis methods [6]. Although current GWAS analyses have provided valuable insights into genetic risks of common diseases, the genetic variants identified by GWAS generally only account for a small proportion of the total heritability of complex diseases, illustrating a problem commonly referred to as “missing heritability” [7],[8]. Potential explanations for this problem include the underestimation of the effects of alleles identified, the existence of gene-gene joint effects, the contribution of rare variation, the possibility that inherited epigenetic factors lead to correlated phenotypes between relatives, and the possible overestimation of heritability of the complex traits [7],[9],[10]. Many diseases or phenotypes are likely caused by or associated with multiple SNPs, each having small effects individually, but collectively contributing a more significant genetic effect [11]. Therefore, multi-locus SNP models would offer one appealing solution in capturing the underlying genotypic-phenotypic relationship [12]-[14].

A typical GWAS study measures thousands or millions of SNPs, but the number of subjects is usually much smaller. This is known as the P > > N problem [15],[16]. One solution to this problem resorts to dimension reduction by identification of the optimal subset of predictors associated with the outcome variable. Determining the best model or selecting a subset of variables becomes an important statistical task for this method. Bayesian variable selection (BVS) provides a natural approach to solve this problem [17],[18]. Unlike penalized regression approaches, BVS naturally produces easily interpretable measures of confidence for variable selection, i.e., posterior selection probabilities. This is an appealing advantage in GWAS because the primary goal of the analysis is to identify the joint effect of SNPs, and to utilize these identifications to learn about underlying biology. BVS has been successfully applied to GWAS data; see the discussion by Guan [19]. The flexibility of the BVS approach allows for straightforward extensions to analyze both quantitative and qualitative data [20]-[23]. Alternative techniques such as the single-SNP test, Stepwise regression, and LASSO (Least Absolute Shrinkage and Selection Operator) were developed to address this statistical challenge. LASSO is a regression method that involves penalizing the absolute size of the regression coefficients, and Stepwise is a classic scheme for sequentially adding to or removing variables from the model. Many studies show that BVS has better performance than these alternative methods in other contexts [19],[23]-[25].

Diseases with complex inheritance may be influenced by multiple genes that interplay within genetic networks or pathways [26],[27]. Gene products interact with one another and work collaboratively within interconnected pathways explaining or associating with certain diseases. This idea led to the concept of network-based molecular biomarkers. Stingo et al. [28] proposed a Bayesian modeling strategy that addressed this concept by incorporating biological information, which was based on the structure or topology of regulatory gene-gene networks in the analysis of DNA microarray data. The method was further generalized into a 2-step framework, STS (screening, then selection) by our research team [29] where standard methods were applied in the screening step to identify a set of candidate genes which were further explored in the selection step using the BVS strategy. In addition to these two coherent steps, our strategy involves the mapping of genotype data to gene-gene networks constructed from various sources such as protein-protein interaction networks [30],[31]. We call this strategy of Bayesian biomarker discovery “integrated BVS” or iBVS. A partial least squares (PLS) g-prior for regression coefficients is also incorporated to solve the problem of non-positive deterministic covariance matrix when the sample size is smaller than the number of genes selected.

In this paper, we develop the strategy of iBVS for analyzing high dimensional GWAS data sets. The strategy is built upon a three-level hierarchical model as seen in Figure 1, where at the top level the PLS method is used to summarize the joint effect of selected SNPs within each gene. In the middle level, Markov Random Field (MRF) is employed to model the selection of genes in prediction of association with disease status. A focus of this article is on discovering SNPs within specific genes incorporating gene network information in GWAS under case–control design. Identification and prediction performance of this iBVS approach are then compared with those of LASSO and Stepwise selection strategies through simulation studies. We then apply iBVS to a GWAS data set for the prediction of leprosy, a skin disease, among Han Chinese.

Methods

We denote Y = (Y ₁, ⋯ Y _n) ' as independently observed binary outcomes in a GWAS data set, where n is the number of subjects and $Y_{i} = \{\begin{array}{c} 1 \\ 0 \end{array} \begin{array}{c} if i t h subject has target disease \\ otherwise \end{array}$ . Each outcome is associated with a set of predictor variables, which correspond to the genotype data. We denote x _ijk as the genotype of the k ^th SNP of the j ^th gene on the i ^th subject, for i = 1, ⋯, n, j = 1, ⋯, J, k = 1, ⋯, P _j, where P _j is the number of SNPs mapped to the j ^th gene, and $P = \sum_{j = 1}^{J} P_{j}$ denotes the total number of SNPs in the GWAS data set. Let A and a be the major and minor alleles at a SNP. The genotype may be coded according to different types of genetic effects: additive with 0, 1, 2 coding for the genotypes AA, Aa/aA, aa; dominant (with respect to the minor allele) with 0, 1, 1 coding for AA, Aa/aA, aa; Recessive (with respect to the minor allele) with 0, 0, 1 coding for AA, Aa/aA, aa.

iBVS with hierarchical modeling for GWAS data

Figure 1 shows the proposed three-level hierarchical model structure. iBVS for binary phenotypes is accomplished by introducing the latent variable vector Z to the linear regression model. Each component Z _i ~ N(0, 1) is defined such that

Y_{i} = \{\begin{array}{c} 0, \\ 1, \end{array} \begin{array}{c} if Z_{i} \leq 0 \\ if Z_{i} > 0 \end{array} .

In order to select genetic variants in both gene and SNP level simultaneously, we introduce two binary vectors, ξ = (ξ ₁, ⋯, ξ _J) and γ = (γ ₁, ⋯, γ _P), to indicate the selection of genes and SNPs respectively into a model for predicting Z _i, i. e,

ξ_{j} = \{\begin{array}{c} 1 \\ 0 \end{array} \begin{array}{c} if j t h gene is selected \\ otherwise \end{array} (j = 1, \dots, J)

and

γ_{p} = \{\begin{array}{c} 1 \\ 0 \end{array} \begin{array}{c} if p t h S N P is selected \\ otherwise \end{array} (p = 1, \dots, P) .

For GWAS data with iBVS analysis, we propose the following hierarchical model,

Z_{i} = {(T_{(ξ, γ)} β_{(ξ, γ)})}_{i} + ε_{i}, ε_{i} ~ N (0, 1),

where $β_{(ξ, γ)} = (β_{J_{1}, γ}, \dots, β_{J_{|ξ|}, γ}), T_{(ξ, γ)} = (T_{(J_{1}, γ)}, \dots, T_{(J_{|ξ|}, γ)}), |ξ|$ denotes the number of selected genes in predicting $Z_{i}, T_{(J_{l}, γ)}$ denotes the vector of the first PLS component of $X_{J_{l}, γ},$ and J _l(l = 1, ⋯, |ξ|) indexes the selected gene. Note that $X_{J_{l}, γ}$ is a sub-matrix of X, consisting of only the columns that correspond to selected SNPs in the selected genes.

Prior specification

The indicator γ for SNP selection is assumed to follow an independent Bernoulli prior distributions with the same parameter π over all the γ _i values.

p (γ) = \prod_{p = 1}^{P} π^{γ_{p}} {(1 - π)}^{1 - γ_{p}} 0 \leq π \leq 1

Choice of π reflects a user’s prior belief in terms of the numbers of causal SNPs out of P candidates.

Zellner’s g-prior is commonly used for the regression coefficient parameters β [32]. Yang and Song [33] generalized the g-prior by modifying the matrix inverse to the Moore-Penrose generalized matrix inverse. Since the multicollinearity problem is commonly encountered in GWAS data because of strong linkage disequilibrium between SNPs, we took a similar prior as that of Yang and Song,

β_{(ξ, γ)} | ξ, γ ~ N (0, c {(T_{(ξ, γ)}^{'} T_{(ξ, γ)})}^{+}),

where ${(T_{(ξ, γ)}^{'} T_{(ξ, γ)})}^{+}$ denotes the Moore-Penrose generalized inverse of $T_{(ξ, γ)}^{'} T_{(ξ, γ)}$ . This idea was first adopted by our research team for microarray gene expression data [29].

To take into account the pathway membership information for each gene as well as the biological relationships between genes within pathways, we follow Li and Zhang [34] and Stingo et al. [28] to use an MRF to describe the prior distribution on each component of the gene selection indicator, i.e.,

p (ξ_{j} | ξ_{i}, i \in N b_{j}) \propto \exp (ξ_{j} (μ + η \sum_{i \in N b_{j}} ξ_{i}))

where μ and η are tuning parameters, and Nb _j is the set of neighbors of gene j within the selected pathway. The corresponding multivariate form is given by:

P (ξ | μ, η) \propto \exp (μ 1_{J}^{'} ξ + η ξ^{'} R ξ),

where $1_{J}^{'}$ is the vector consisting of J 1’s. We denote matrix R to reflect gene-gene network topological structure, where elements R _ij = 1 if there is a direct edge between the i ^th and j ^th genes, and R _ij = 0 otherwise.

Posterior distributions

The joint posterior distribution of θ = (γ, ξ, β _(ξ,γ), Z) given (Y, X) is

P (γ, ξ, β_{(ξ, γ)}, Z | Y, X) \propto (\prod_{i = 1}^{N} I (A_{i})) \times exp [- \frac{{(Z - T_{(ξ, γ)} β_{(ξ, γ)})}^{'} (Z - T_{(ξ, γ)} β_{(ξ, γ)})}{2}] \times exp [- \frac{β_{(ξ, γ)}^{'} T_{(ξ, γ)}^{'} T_{(ξ, γ)} β_{(ξ, γ)}}{2 c} \prod_{i = 1}^{m_{ξ}} λ_{i}^{- \frac{1}{2}}] \times \prod_{p = 1}^{P} π_{p}^{γ_{p}} {(1 - π_{p})}^{1 - γ_{p}} \times exp (μ 1_{J}^{'} ξ + η ξ^{'} R ξ)

where I(A _i) is the indicator function and A _i is either equal to {Z _i : Z _i > 0} or {Z _i : Z _i ≤ 0} corresponding to Y _i = 1 or Y _i = 0, respectively, and $λ_{1}, \dots λ_{m_{ξ}} (m_{ξ} \leq J)$ are the nonzero eigenvalues of ${(T_{(ξ, γ)}^{'} T_{(ξ, γ)})}^{+}$ .

Since β is a nuisance parameter, we can integrate it out to obtain the joint posterior distribution of (Z, ξ, γ|Y, X) as follows:

P (Z, ξ, γ | Y, X) \propto [\prod_{i = 1}^{n} I (A_{i})] \times \frac{1}{{|Σ_{(ξ, γ)}|}^{1 / 2}} exp (- \frac{Z^{'} Σ_{(ξ, γ)}^{- 1} Z}{2}) \times \prod_{p = 1}^{P} {π_{p}}^{γ_{p}} {(1 - π_{p})}^{1 - γ_{p}} \times exp (μ 1_{J}^{'} ξ + η ξ^{'} R ξ)

with $Σ_{(ξ, γ)}^{- 1} = c T_{(ξ, γ)} {(T_{(ξ, γ)}^{'} T_{(ξ, γ)})}^{+} T_{(ξ, γ)}^{'} + I_{n}$ . From this posterior joint distribution, we can derive the posterior conditional distributions

P (Z | ξ, γ, Y, T) \propto N (0, Σ_{(ξ, γ)}) \prod_{i = 1}^{n} I (A_{i}),

which is a multivariate truncated normal distribution, and

P (ξ, γ | T, Z) \propto N (0, Σ_{(ξ, γ)}) \times \prod_{p = 1}^{P} {π_{p}}^{γ_{p}} {(1 - π_{p})}^{1 - γ_{p}} \times exp (μ 1_{J}^{'} ξ + η ξ^{'} R ξ) .

Posterior inference via MCMC sampling

Markov chain Monte Carlo (MCMC) sampling is used to generate samples for the posterior distribution of the model parameters. The MCMC sampling procedure consists of the following two steps:

I.
Sample ξ and γ from P(ξ, γ|Y, T, Z): the selection parameters (ξ, γ) are updated using a Metropolis-Hastings algorithm which is modified from Stingo’s method [28] (details of the MCMC moves to update (ξ, γ) are given in Additional file 1). The method consists of randomly picking one of the following moves: (1) change the inclusion status of SNP and gene by randomly choosing from adding or removing a gene and a SNP at the same time; (2) change the inclusion status of SNP only by randomly choosing from adding or removing a SNP.
II.
Sample Z from P(Z|Y, T, γ, ξ): directly sampling from this distribution is known to be difficult. In this article, we follow the method given in Devroye [35] to simulate samples from the univariate truncated normal distribution P(Z _i|Z _− i, Y, T, γ, ξ), where Z _− i is the vector of Z without the i ^th element.

Simulation

Simulation studies were conducted to assess the performances of iBVS, LASSO regression, and Stepwise regression using a proprietary set of MatLab codes, an R package glmnet, and the R package lars. We simulated three scenarios of varying different proportion of variance of Z explained by the genetic factors, labeled as follows: (1) H70: genes with network, 70% explained variance; (2) H50: genes with network, 50% explained variance; and (3) H30: genes with network, 30% explained variance.

For each scenario, 50 sets of genotypes without disease status were simulated using software Hapgen2 [36] based on the genotype data from Hapmap project (http://hapmap.ncbi.nlm.nih.gov/). We subsequently generated phenotypes corresponding to each scenario, with 400 individuals and 300 SNPs assorted into 22 genes for each data set. The first 200 individuals were used to fit the iBVS hierarchical model and to evaluate the performance of identifying causal SNPs of the three methods, while the other 200 individuals were used to assess the prediction performance of each method. All the simulations were run under the additive and dominant genetic model respectively to check the model flexibility of the proposed iBVS.

We simulated sets of phenotypes in the following way: First, we specified 10 SNPs x _j(j = 1 … 10) as causal SNPs, which were positioned within 5 genes. In order to add network information to gene levels, a network was simulated between the genes. We then conducted a precision matrix Ω which contains the network relationship between the genes. If there is an edge between p ^th gene and q ^th gene in the network, ω _pq and ω _qp would be assigned with a non-zero value, otherwise 0. The vector t _i was generated from a multivariate normal distribution with zero mean vector and covariance matrix Σ = Ω ^− 1. Then the causal gene score T _k (k = 1…5) was calculated by considering both genotype data and gene network information, $T_{i k} = \sum_{S N P j \in gene k} b_{j} x_{k j} + t_{i k},$ where b _j’s were carefully chosen to indicate different PLS configurations. We subsequently simulated the latent phenotypes score for each individual using z _i = ∑ϕ _jT _ik + ε _i, ε _i ~ N(0, 1). Finally, binary phenotypes Y _i for each individual was generated using $Y_{i} = \{\begin{array}{c} 1 \\ 0 \end{array} \begin{array}{c} if z_{i} \geq 0 \\ if z_{i} < 0 \end{array}$ .

Application

We applied iBVS, LASSO and Stepwise approaches to analyze a GWAS data set designed to identify genetic variants associated with leprosy [37]. The genotype data consisted of 492,109 SNPs from 706 cases and 514 controls after removing genetically unmatched controls, to obviate the need for correction for population stratification. All subjects were Han Chinese from eastern China. In order to select variables and assess the performance of the three variable selection strategies, we randomly divided the data set into two parts: a training set and a testing set, each with 610 samples. The training set was used for SNP selection, while the testing set for validating the selection results and comparing the three methods. The genotype was first coded under the additive genetic model, and we re-coded the genotypes following dominant genetic components and re-analyzed this real data set to check the model flexibility under different genetic effects.

Results

Simulation studies

Performance of variable selection

The average area under the curve (AUC) was calculated to evaluate the performance of casual SNP identification in each scenario. For SSVS, the AUC is calculated using the following formula [38],[39]. $A U C = \frac{1}{n^{D} n^{C}} \sum_{i \in D, j \in C} I \{γ_{i} > γ_{j}\},$ where D is the set of the causal SNPs and C is the set of the non-causal SNPs; n ^D and n ^C are the number of causal and non-causal SNPs, respectively.

For the LASSO method, a simple modification of the Least Angle Regression (LAR) algorithm was implemented that calculates the entire LASSO path, which is an efficient way of computing the solution to any LASSO problem especially when P ≫ N [40]. Using the modified LARS algorithm, one may generate all LASSO solutions corresponding to different values of the penalty parameter. Selecting the active model at a given iteration would give one LASSO solution corresponding to a particular value of the penalty parameter. Hence, one can control the penalty parameter using a cutoff for the number of iterations [38]. For LASSO and Stepwise approaches, the following formula was used to calculate the AUC: $A U C = \frac{1}{n^{D} n^{C}} \sum_{i \in D, j \in C} I \{s_{i} < s_{j}\},$ where s is the iteration at which the i ^th marker enters the model [38].

Table 1 shows the averaged AUC of the three methods in variable selection under the additive genetic model. It can be seen that the AUC of the three methods all increase monotonically by the proportion of variance of Z explained by the genetic factors. Obviously, under scenario H70 and H50, the iBVS has the highest averaged AUC (0.911 and 0.894) followed by Lasso (0.891 and 0.882), while the AUC of Stepwise is relative low (0.869 and 0.853). The AUC of iBVS drops to 0.792 with low explained variance (H30), with LASSO and Stepwise both approximating 0.77. The results revealed that iBVS has superior performance compared with that of LASSO and Stepwise regression. Similar trends can also be found under the dominant genetic model (Additional file 1: Table S2).

Table 1 Average AUC values of iBVS, LASSO and Stepwise

Full size table

Performance of outcome prediction

We subsequently assessed the prediction performance of the three methods using the remaining 200 individuals. Prediction performances were evaluated by considering correctly/incorrectly predicted positive/negative outcomes. We calculated specificity and sensitivity for thresholds from 0.01 to 0.99, with steps of 0.01. Then the ROC was plotted using mean specificity and sensitivity for a given threshold. For iBVS, we use a two-stage strategy. First the posterior probabilities of all the SNPs were estimated by iBVS. The top i SNPs (i = 1…300) were subsequently fitted into a PLS logistic regression model, and 10-fold cross-validation was conducted to choose the optimal number of predictors i. For LASSO and Stepwise approaches, the optimal model was determined by a 10-fold cross-validation.

Figure 2 demonstrates the ROC of the three methods in scenarios H70 and H50. One can see that the prediction performance of iBVS was slightly greater than that of LASSO, and had an obvious superiority to Stepwise regression.

Application

We first conducted screening of all SNPs, one by one by fitting the single-SNP logistic regression model with additive coding. By sorting on the p-values from the univariate analysis, we identified 100 genes, each containing at least one SNP with P-value smaller than 0.0004. We subsequently extracted all of the SNPs in each significant gene to comprise the joint effect of SNPs per one gene. The 100 genes selected above contained a total of 3,388 SNPs.

iBVS was applied using the above 3,388 SNPs. First, we constructed the R matrix using the KEGG database (details please see the Additional file 1). We subsequently specified prior distributions as described in the Methods Section, with hyper parameters set as: π = 0.01, μ = − 2, η = 0.8. Finally, the MCMC was conducted to make posterior inferences with 10,000 iterations as burn-in and 50,000 additional runs. Figure 3a shows the posterior SNP selection probabilities via our iBVS strategy with integrated biological priors.

A ten-fold cross-validation approach was employed to set a cut-off for determining the optimal prediction model. The top i SNPs were added into a PLS logistic model one by one, to estimate the cross-validation error, shown in Figure 3b. It can be seen that the smallest classification error appeared when the top 94 SNPs were selected as predictors. The classification error stabilizes after 94 SNPs were selected into the PLS logistic model, with some slight increase shown. Therefore the top 94 SNPs were selected as ‘significant’ predictors whose information was listed in (see the Additional file 1: Table S1).

In addition, we ran LASSO and Stepwise regression on this leprosy GWAS dataset, with the optimal model determined by 10-fold cross-validation. The LASSO selected 100 SNPs, which only included 24 SNPs selected by iBVS. The Stepwise regression approach yielded a more parsimonious model with only 3 SNPs. Table 2 shows the detailed information of 24 common SNPs selected by both iBVS and Lasso. Specifically, the three SNPs selected by Stepwise also had high corresponding posterior probability in iBVS and relative large coefficient in LASSO, and SNP rs9270984 is most significant in all three methods. Finally, we assessed the ability of the three methods to correctly predict binary responses (case versus control) of the test data set. Figure 4a shows the ROC curves of the three selected models under the additive genetic model, while Figure 4b demonstrates that under the dominant model. This indicates that iBVS has comparable performance to the LASSO model, but has an performance advantage over the Stepwise regression method no matter what the genetic model is.

Table 2 Information of 24 common SNPs selected by both iBVS and Lasso

Full size table

Discussion

GWAS analyses typically approach data as a list of single SNPs, a strategy which has yielded a catalog of susceptibility loci for complex diseases. However, the statistically significant variants detected so far account for only a small proportion of the total phenotypic variation. Gene-based tests for association are increasingly being seen as useful complements to GWASs, demonstrating several attractive features compared with traditional SNP-based analysis [12]-[14]. Beyond gene-based methods, there is an increasing recognition of the potential contributions of pathway-based analyses, in which variants in groups of genes within specific pathways are considered together to predict the phenotype [41]-[43]. In this paper, we followed an integrative biomarker identification scheme to conduct a novel hierarchical model incorporating a gene-gene network or pathway information for SNP identification in GWAS via the Bayesian inference paradigm.

Three scenarios of data sets were simulated, each considering different proportions of variance of outcome explained by genetic factors. Simulation results show our iBVS method outperformed the LASSO and Stepwise methods in identifying causal SNPs in each scenario (Table 1). In addition, we also evaluate the prediction ability of the three methods using additional data sets, and show that iBVS had advantageous performance (Figure 2). The advantages of the proposed iBVS strategy is verified when the real network is known and explicitly employed through a prior specification with the MRF distribution.

After applying iBVS to an actual GWAS dataset, a panel of 94 SNPs were selected as predictors of leprosy. The LASSO method selected 100 SNPs, which included only 24 SNPs in common with iBVS. The Stepwise regression yielded a very parsimonious model with only 3 SNPs. The results indicate that the iBVS method has comparable prediction performance with LASSO, and advantageous with Stepwise. Moreover, all the results are quite stable under the different genetic models. Stepwise regression searches the model space by adding or removing one SNP at a time and therefore the searching is partial, leading to convergence at a local optimum. The reason iBVS does not outperform LASSO may be due to deficiencies in pathway information from existing databases that do not reflect complete signaling pathways. The performance of iBVS can be improved by developing a stochastic inference of the gene-gene networks from the data and merging it into the current iBVS MCMC algorithm, which remains a future goal. Compared with LASSO and other penalized regression methods, which lack appropriate interpretation, iBVS has an additional advantage in that it produces posterior probabilities for each variable. This is a particularly important advantage in GWAS because the primary goal of the analysis is to identify the effect of SNPs. Comparing to metrics like p-values, posterior probabilities have clear interpretation. A reviewer of this article suggested that many important traits are generally quantitative and are often controlled by multiple genes in shared biology pathways; our model could be naturally extended to analyze quantitative data by removing the latent variable Z from the hierarchical model.

Efficiency of stochastic algorithms often diminishes as the total number of variables increases [19],[21]. It would be appealing to remove noisy data points or those with lower quality before using a stochastic search. Therefore, we first screened the total number of SNPs using a conventional SNP-based model to filter the number of SNPs included in the Bayesian hierarchical model. This set of candidate SNPs and their associated genes is called the ‘signature set’ in the sense that they are possibly signaling SNPs/genes (in other words, causal or marker SNPs/genes). We extracted all the SNPs in each significant gene to comprise the joint effect of a gene, leaving the weaker candidate genes out of the iBVS. The screening step should be viewed as a general step not only for dimension reduction but for constructing a functional context before conducting BVS.

A few issues regarding our model choices and computation should be highlighted. We mainly adopted the perspective of subjective Bayesian analysis due to the fact that we want to incorporate informative priors from relevant scientific sources. Choosing an objective prior that satisfies some fundamental principles as summarized in Bayarri et al. [44] would be theoretically appealing in future work. Another issue concerns computational burden. With a large number of parameters in the model, the inference is mainly based on Monte Carlo simulation, which may take a prolonged time. Running over a single computer with 3.3GH CPU computer and 8GB memory, 6 days were required to finish the leprosy data analysis. With the advent of high-speed cluster computers and the existence of cloud computing technologies, it is becoming increasingly feasible to apply full iBVS methods for biomarker identification.

Conclusions

We proposed an iBVS method to analyze high dimensional GWAS data sets based on a hierarchical model that incorporates an informative prior on networked gene interrelationships. Simulation studies showed that our iBVS method had better performance in both biomarker identification and disease prediction than LASSO and Stepwise models. A leprosy GWAS analysis showed iBVS method demonstrated a comparable performance with LASSO, and better than Stepwise methods. iBVS did not outperform LASSO, which may be due to deficiencies in existing signaling pathway databases that are likely to be improved as the knowledge base increases. In summary, the proposed iBVS strategy is a valid method for GWAS, having an additional advantage in the production of posterior probabilities for each variable that are again subject to continued refinement.

Additional file

Abbreviations

iBVS:: Integrative Bayesian variable selection
LASSO:: Least absolute shrinkage and selection operator
STS:: Screening, then selection
AUC:: The average area under the curve

References

Kettunen J, Tukiainen T, Sarin AP, Ortega-Alonso A, Tikkanen E, Lyytikäinen LP, Kangas AJ, Soininen P, Würtz P, Silander K, Dick DM, Rose RJ, Savolainen MJ, Viikari J, Kähönen M, Lehtimäki T, Pietiläinen KH, Inouye M, McCarthy MI, Jula A, Eriksson J, Raitakari OT, Salomaa V, Kaprio J, Järvelin MR, Peltonen L, Perola M, Freimer NB, Ala-Korpela M, Palotie A, et al: Genome-wide association study identifies multiple loci influencing human serum metabolite levels. Nat Genet. 2012, 44 (3): 269-276. 10.1038/ng.1073.
Article PubMed CAS PubMed Central Google Scholar
Chasman DI, Schürks M, Anttila V, de Vries B, Schminke U, Launer LJ, Terwindt GM, van den Maagdenberg AM, Fendrich K, Völzke H, Ernst F, Griffiths LR, Buring JE, Kallela M, Freilinger T, Kubisch C, Ridker PM, Palotie A, Ferrari MD, Hoffmann W, Zee RY, Kurth T: Genome-wide association study reveals three susceptibility loci for common migraine in the general population. Nat Genet. 2011, 43 (7): 695-698. 10.1038/ng.856.
Article PubMed CAS PubMed Central Google Scholar
Goode EL, Chenevix-Trench G, Song H, Ramus SJ, Notaridou M, Lawrenson K, Widschwendter M, Vierkant RA, Larson MC, Kjaer SK, Birrer MJ, Berchuck A, Schildkraut J, Tomlinson I, Kiemeney LA, Cook LS, Gronwald J, Garcia-Closas M, Gore ME, Campbell I, Whittemore AS, Sutphen R, Phelan C, Anton-Culver H, Pearce CL, Lambrechts D, Rossing MA, Chang-Claude J, Moysich KB, Goodman MT, et al: A genome-wide association study identifies susceptibility loci for ovarian cancer at 2q31 and 8q24. Nat Genet. 2010, 42 (10): 874-879. 10.1038/ng.668.
Article PubMed CAS PubMed Central Google Scholar
Smyth DJ, Cooper JD, Bailey R, Field S, Burren O, Smink LJ, Guja C, Ionescu-Tirgoviste C, Widmer B, Dunger DB, Savage DA, Walker NM, Clayton DG, Todd JA: A genome-wide association study of nonsynonymous SNPs identifies a type 1 diabetes locus in the interferon-induced helicase (IFIH1) region. Nat Genet. 2006, 38 (6): 617-619. 10.1038/ng1800.
Article PubMed CAS Google Scholar
Cooper JD, Smyth DJ, Smiles AM, Plagnol V, Walker NM, Allen JE, Downes K, Barrett JC, Healy BC, Mychaleckyj JC, Warram JH, Todd JA: Meta-analysis of genome-wide association study data identifies additional type 1 diabetes risk loci. Nat Genet. 2008, 40 (12): 1399-1401. 10.1038/ng.249.
Article PubMed CAS PubMed Central Google Scholar
McCarthy MI, Abecasis GR, Cardon LR, Goldstein DB, Little J, Ioannidis JPA, Hirschhorn JN: Genome-wide association studies for complex traits: consensus, uncertainty and challenges. Nat Rev Genet. 2008, 9 (5): 356-369. 10.1038/nrg2344.
Article PubMed CAS Google Scholar
Manolio TA, Collins FS, Cox NJ, Goldstein DB, Hindorff LA, Hunter DJ, McCarthy MI, Ramos EM, Cardon LR, Chakravarti A, Cho JH, Guttmacher AE, Kong A, Kruglyak L, Mardis E, Rotimi CN, Slatkin M, Valle D, Whittemore AS, Boehnke M, Clark AG, Eichler EE, Gibson G, Haines JL, Mackay TF, McCarroll SA, Visscher PM: Finding the missing heritability of complex diseases. Nature. 2009, 461 (7265): 747-753. 10.1038/nature08494.
Article PubMed CAS PubMed Central Google Scholar
Visscher PM: Sizing up human height variation. Nat Genet. 2008, 40 (5): 489-490. 10.1038/ng0508-489.
Article PubMed CAS Google Scholar
Gibson G: Hints of hidden heritability in GWAS. Nat Genet. 2010, 42 (7): 558-560. 10.1038/ng0710-558.
Article PubMed CAS Google Scholar
Eichler EE, Flint J, Gibson G, Kong A, Leal SM, Moore JH, Nadeau JH: Missing heritability and strategies for finding the underlying causes of complex disease. Nat Rev Genet. 2010, 11 (6): 446-450. 10.1038/nrg2809.
Article PubMed CAS PubMed Central Google Scholar
Stranger BE, Stahl EA, Raj T: Progress and promise of genome-wide association studies for human complex trait genetics. Genetics. 2011, 187 (2): 367-383. 10.1534/genetics.110.120907.
Article PubMed CAS PubMed Central Google Scholar
Liu JZ, McRae AF, Nyholt DR, Medland SE, Wray NR, Brown KM, Investigators AMFS, Hayward NK, Montgomery GW, Visscher PM, Martin NG, Macgregor S: A versatile gene-based test for genome-wide association studies. Am J Hum Genet. 2010, 87 (1): 139-145. 10.1016/j.ajhg.2010.06.009.
Article PubMed CAS PubMed Central Google Scholar
Yang J, Ferreira T, Morris AP, Medland SE, Madden PA, Heath AC, Martin NG, Montgomery GW, Weedon MN, Loos RJ, Frayling TM, McCarthy MI, Hirschhorn JN, Goddard ME, Visscher PM, Genetic Investigation of ANthropometric Traits (GIANT) Consortium, DIAbetes Genetics Replication And Meta-analysis (DIAGRAM) Consortium: Conditional and joint multiple-SNP analysis of GWAS summary statistics identifies additional variants influencing complex traits. Nat Genet. 2012, 44 (4): 369-375. 10.1038/ng.2213. S1-S3
Article PubMed CAS PubMed Central Google Scholar
Li M, Gui H, Kwan JS, Sham PC: GATES: a rapid and powerful gene-based association test using extended Simes procedure. Am J Hum Genet. 2011, 88 (3): 283-293. 10.1016/j.ajhg.2011.01.019.
Article PubMed CAS PubMed Central Google Scholar
Prentice RL, Lihong QI: Aspects of the design and analysis of high-dimensional SNP studies for disease risk estimation. Biostatistics. 2006, 7 (3): 339-354. 10.1093/biostatistics/kxj020.
Article PubMed Google Scholar
Sölkner J: Very many variables and limited numbers of observations; The p>> n problem in current statistical applications. Information Technology Interfaces (ITI). Proceedings of the ITI 2012 34th International Conference 25-28 June 2012. 2012, 13-14.
Google Scholar
Tadesse MG, Sha N, Vannucci M: Bayesian variable selection in clustering high-dimensional data. J Am Stat Assoc. 2005, 100 (470): 602-617. 10.1198/016214504000001565.
Article CAS Google Scholar
Mitchell TJ, Beauchamp JJ: Bayesian variable selection in linear regression. J Am Stat Assoc. 1988, 83 (404): 1023-1032. 10.1080/01621459.1988.10478694.
Article Google Scholar
Guan Y, Stephens M: Bayesian variable selection regression for genome-wide association studies and other large-scale problems. Ann Appl Stat. 2011, 5 (3): 1780-1815. 10.1214/11-AOAS455.
Article Google Scholar
Fridley BL: Bayesian variable and model selection methods for genetic association studies. Genet Epidemiol. 2008, 33 (1): 27-37. 10.1002/gepi.20353.
Article Google Scholar
Wilson MA, Iversen ES, Clyde MA, Schmidler SC, Schildkraut JM: Bayesian model search and multilevel inference for SNP association studies. Ann Appl Stat. 2010, 4 (3): 1342-10.1214/09-AOAS322.
Article PubMed PubMed Central Google Scholar
Banerjee S, Yandell BS, Yi N: Bayesian quantitative trait loci mapping for multiple traits. Genetics. 2008, 179 (4): 2275-2289. 10.1534/genetics.108.088427.
Article PubMed PubMed Central Google Scholar
Russu A, Malovini A, Puca AA, Bellazzi R: Stochastic model search with binary outcomes for genome-wide association studies. J Am Med Inform Assn. 2012, 19 (e1): e13-e20. 10.1136/amiajnl-2011-000741.
Article Google Scholar
Hoggart CJ, Whittaker JC, De Iorio M, Balding DJ: Simultaneous analysis of all SNPs in genome-wide and re-sequencing association studies. PLoS Genet. 2008, 4 (7): e1000130-10.1371/journal.pgen.1000130.
Article PubMed PubMed Central Google Scholar
Kwon S, Wang D, Guo X: Application of an iterative Bayesian variable selection method in a genome-wide association study of rheumatoid arthritis. BMC Proc. 2007, 1 (Suppl 1): S109-10.1186/1753-6561-1-s1-s109.
Article PubMed PubMed Central Google Scholar
Torkamani A, Schork NJ: Pathway and network analysis with high-density allelic association data. Methods Mol Biol. 2009, 563: 289-301. 10.1007/978-1-60761-175-2_16.
Article PubMed CAS Google Scholar
Baranzini SE, Galwey NW, Wang J, Khankhanian P, Lindberg R, Pelletier D, Wu W, Uitdehaag BMJ, Kappos L, Polman CH: Pathway and network-based analysis of genome-wide association studies in multiple sclerosis. Hum Mol Genet. 2009, 18 (11): 2078-2090. 10.1093/hmg/ddp120.
Article PubMed CAS PubMed Central Google Scholar
Stingo FC, Chen YA, Tadesse MG, Vannucci M: Incorporating biological information into linear models: a Bayesian approach to the selection of pathways and genes. Ann Appl Stat. 2011, 5 (3): 1978-2002. 10.1214/11-AOAS463.
Article PubMed Google Scholar
Peng B, Zhu D, Ander BP, Zhang X, Xue F, Sharp FR, Yang X: An Integrative Framework for Bayesian variable selection with informative priors for identifying genes and pathways. PLoS One. 2013, 8 (7): e67672-10.1371/journal.pone.0067672.
Article PubMed CAS PubMed Central Google Scholar
Chuang H, Lee E, Liu Y, Lee D, Ideker T: Network-based classification of breast cancer metastasis. Mol Syst Biol. 2007, 3: 140-10.1038/msb4100180.
Article PubMed PubMed Central Google Scholar
Lee E, Chuang H, Kim J, Ideker T, Lee D: Inferring pathway activity toward precise disease classification. PLoS Comput Biol. 2008, 4 (11): e1000217-10.1371/journal.pcbi.1000217.
Article PubMed PubMed Central Google Scholar
Zellner A: On assessing prior distributions and Bayesian regression analysis with g-prior distributions. Bayesian Inference Decision Techniques. 1986, 6: 233-243.
Google Scholar
Ai-Jun Y, Xin-Yuan S: Bayesian variable selection for disease classification using gene expression data. Bioinformatics. 2010, 26 (2): 215-222. 10.1093/bioinformatics/btp638.
Article Google Scholar
Li F, Zhang NR: Bayesian variable selection in structured high-dimensional covariate spaces with applications in genomics. J Am Stat Assoc. 2010, 105 (491): 1202-1214. 10.1198/jasa.2010.tm08177.
Article CAS Google Scholar
Devroye L: Sample-based Non-uniform random variate generation. Proceedings of the 18th conference on Winter simulation. ACM. 1986, 260-265. 10.1145/318242.318443.
Chapter Google Scholar
Su Z, Marchini J, Donnelly P: HAPGEN2: simulation of multiple disease SNPs. Bioinformatics. 2011, 27 (16): 2304-2305. 10.1093/bioinformatics/btr341.
Article PubMed CAS PubMed Central Google Scholar
Zhang FR, Huang W, Chen SM, Sun LD, Liu H, Li Y, Cui Y, Yan XX, Yang HT, Yang RD: Genomewide association study of leprosy. New Engl J Med. 2009, 361 (27): 2609-2618. 10.1056/NEJMoa0903753.
Article PubMed CAS Google Scholar
Srivastava S, Chen L: Comparison between the stochastic search variable selection and the least absolute shrinkage and selection operator for genome-wide association studies of rheumatoid arthritis. BMC Proc. 2009, 3 (Suppl 7): S21-10.1186/1753-6561-3-s7-s21.
Article PubMed PubMed Central Google Scholar
Ma S, Huang J: Combining multiple markers for classification using ROC. Biometrics. 2007, 63 (3): 751-757. 10.1111/j.1541-0420.2006.00731.x.
Article PubMed Google Scholar
Efron B, Hastie T, Johnstone I, Tibshirani R: Least angle regression. Ann Appl Stat. 2004, 32 (2): 407-499. 10.1214/009053604000000067.
Article Google Scholar
Ramanan VK, Shen L, Moore JH, Saykin AJ: Pathway analysis of genomic data: concepts, methods, and prospects for future development. Trends Genet. 2012, 28 (7): 323-332. 10.1016/j.tig.2012.03.004.
Article PubMed CAS PubMed Central Google Scholar
Consortium IMSG: Network-based multiple sclerosis pathway analysis with GWAS data from 15,000 cases and 30,000 controls. Am J Hum Genet. 2013, 92 (6): 854-10.1016/j.ajhg.2013.04.019.
Article Google Scholar
Mukherjee S, Kim S, Ramanan VK, Gibbons LE, Nho K, Glymour MM, Ertekin-Taner N, Montine TJ, Saykin AJ, Crane PK: Gene-based GWAS and biological pathway analysis of the resilience of executive functioning. Brain Imaging Behav. 2014, 8 (1): 110-118. 10.1007/s11682-013-9259-7.
Article PubMed PubMed Central Google Scholar
Bayarri MJ, Berger JO, Forte A, García-Donato G: Criteria for Bayesian model choice with application to variable selection. Ann Appl Stat. 2012, 40 (3): 1550-1577. 10.1214/12-AOS1013.
Article Google Scholar

Download references

Acknowledgments

This research was supported by Eunice Kennedy Shriver National Institute of Child Health & Human Development, NIH, USA [5R01HD061404]; National Institute on Drug Abuse, NIH, USA [R44DA026683]; National Natural Science Foundation of China, China [81273177]. The authors would like to thank the editor and reviewers for their valuable comments and suggestions to improve the quality of the paper.

Author information

Authors and Affiliations

Bayessoft, Inc., 2221 Caravaggio Drive, Davis, 95618, CA, USA
Dianwen Zhu & Xiaowei Yang
School of Public Health, Shandong University, Jinan, 250012, Shandong, China
Xiaoshuai Zhang & Fuzhong Xue
Shandong Provincial Institute of Dermatology and Venereology, Shandong Academy of Medical Science, Jinan, 250022, Shandong, China
Hong Liu
School of Public Health, Chongqing Medical University, Chongqing, 400016, China
Bin Peng
Department of Epidemiology and Biostatistics, University of California San Francisco, San Francisco, 94158, CA, USA
Joseph L Wiemels

Authors

Xiaoshuai Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Fuzhong Xue
View author publications
You can also search for this author in PubMed Google Scholar
Hong Liu
View author publications
You can also search for this author in PubMed Google Scholar
Dianwen Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Bin Peng
View author publications
You can also search for this author in PubMed Google Scholar
Joseph L Wiemels
View author publications
You can also search for this author in PubMed Google Scholar
Xiaowei Yang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiaowei Yang.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

XSZ, FZX, DWZ, BP and XWY conceptualized the study, analyzed the data and prepared for the manuscript. HL acquired and interpreted the data. JLW contributed on the manuscript revision and thoroughly copyedited the manuscript. All authors approved the final manuscript.

Electronic supplementary material

12863_2014_130_MOESM1_ESM.pdf

Additional file 1: MCMC Scheme for Sampling ( $ξ$ , y ). Table S1. Information of the selected 94 SNPs. Table S2. Additional simulation results under dominant genetic model. Figure S1. Pairwise correlation coefficient R square of the 94 top SNPs. Figure S2. Posterior selection probabilities of SNPs under dominant genetic model in leprosy GWAS analysis. Constructing R Matrix Using KEGG. (PDF 200 KB)

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.

The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

To view a copy of this licence, visit https://creativecommons.org/licenses/by/4.0/.

The Creative Commons Public Domain Dedication waiver (https://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article

Zhang, X., Xue, F., Liu, H. et al. Integrative Bayesian variable selection with gene-based informative priors for genome-wide association studies. BMC Genet 15, 130 (2014). https://doi.org/10.1186/s12863-014-0130-7

Download citation

Received: 20 May 2014
Accepted: 17 November 2014
Published: 10 December 2014
DOI: https://doi.org/10.1186/s12863-014-0130-7

Integrative Bayesian variable selection with gene-based informative priors for genome-wide association studies

Abstract

Background

Results

Conclusions

Background

Methods

iBVS with hierarchical modeling for GWAS data

Prior specification

Posterior distributions

Posterior inference via MCMC sampling

Simulation

Application

Results

Simulation studies

Performance of variable selection

Performance of outcome prediction

Application

Discussion

Conclusions

Additional file

Abbreviations

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors’ contributions

Electronic supplementary material

12863_2014_130_MOESM1_ESM.pdf

Authors’ original submitted files for images

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Genomic Data

Contact us