Empirical Bayesian LASSO-logistic regression for multiple binary trait locus mapping

Background Complex binary traits are influenced by many factors including the main effects of many quantitative trait loci (QTLs), the epistatic effects involving more than one QTLs, environmental effects and the effects of gene-environment interactions. Although a number of QTL mapping methods for binary traits have been developed, there still lacks an efficient and powerful method that can handle both main and epistatic effects of a relatively large number of possible QTLs. Results In this paper, we use a Bayesian logistic regression model as the QTL model for binary traits that includes both main and epistatic effects. Our logistic regression model employs hierarchical priors for regression coefficients similar to the ones used in the Bayesian LASSO linear model for multiple QTL mapping for continuous traits. We develop efficient empirical Bayesian algorithms to infer the logistic regression model. Our simulation study shows that our algorithms can easily handle a QTL model with a large number of main and epistatic effects on a personal computer, and outperform five other methods examined including the LASSO, HyperLasso, BhGLM, RVM and the single-QTL mapping method based on logistic regression in terms of power of detection and false positive rate. The utility of our algorithms is also demonstrated through analysis of a real data set. A software package implementing the empirical Bayesian algorithms in this paper is freely available upon request. Conclusions The EBLASSO logistic regression method can handle a large number of effects possibly including the main and epistatic QTL effects, environmental effects and the effects of gene-environment interactions. It will be a very useful tool for multiple QTLs mapping for complex binary traits.


Additional file Derivation of equation (12)
First we note that Summarizing the results in three cases, we obtain * i  given in (12).

Replicates 2 and 3 for the simulations with only main effects
The genotypes of m = 481 markers of n = 500 individuals were generated independently for Replicate 2 and 3 using the procedure described in the main text. Twenty QTLs and their IDs and effect sizes were the same as those of Replicate 1 given in Table 1. The two replicates were analyzed with the EBLASSO-NE, EBLASSO-NEG, LASSO, HyperLasso, BhGLM, RVM as well as the single QTL method.
The EBLASSO-NE method: The two-step procedure for the EBLASSO-NE both identified the optimal values λ = 0.0224 for replicates 2 and 3. A summary of results for several values of λ and the corresponding logL is given in Table S3. Using the optimal values of λ, the EBLASSO-NE detected 13 true and 1 false, and 12 true and 0 false positive effects that have pvalue ≤ 0.05 for replicates 2 and 3. The estimated sizes of the true effects and their standard errors for the two replicates are given in Tables S1 and S2, respectively.
The EBLASSO-NEG method: The three-step procedure for the EBLASSO-NEG identified the optimal pair of parameters that maximized logL at (a, b) = (-0.5, 10) and (-0.4, 6) for the replicates 2 and 3, respectively. A summary of results for several pairs of a and b and the corresponding logL is given in Table S3. The whole dataset was then analyzed with the optimal parameters, which identified 13 true and 2 false positive effects, 12 true and 2 false positive effects that have p-value ≤ 0.05 for the two replicates, respectively. The estimated sizes of the true effects and their standard errors are depicted in Tables S1and S2 respectively for the two replicates.
The LASSO method: Cross validation identified the optimal values λ = 0.0197 and 0.0213 for replicates 2 and 3, respectively. The whole dataset was then analyzed using the optimal λ to estimate QTL effects, which identified 38 and 35markers with nonzero regression coefficients for the two replicates, respectively. After refitted with an ordinary logistic regression model, 9 and 11 true positive effects, 4 and 2 false positive effects were among those markers with a pvalue 0.05  for the two replicates, respectively. A summary of results from cross validation for the EBLASSO-NE, the EBLASSO-NEG and LASSO is given in Table S3. The estimated sizes of true effects and their standard errors are depicted along with all 20 true effects in Tables S1 and S2 for the two replicates, respectively.
The HyperLasso method: Hyperparameters for the HyperLasso and the corresponding number of effects identified for the two replicates are presented in Table S4. The best results in Table S4 include 10 true and 0 false positive effects for replicates 2 and 3. The estimated sizes of true effects and their standard errors for the best results in Table S4 are depicted along with all 20 true effects in Tables S1and S2 for the two replicates.
The BhGLM method: The 15 different pairs of values for ν i and τ i as listed in Table S5 were examined for the BhGLM method. With ν i = 10 -3 , 10 -4 and 10 -5 , τ i = 10 -4 , the best results include 11 true and 0 false positive effects for replicates 2, and 9 true and 1 false positive effects for replicate 3. The corresponding estimated sizes of the true effects and their standard errors are depicted in Tables S1 and S2 for the two replicates, respectively.
The RVM method: The MATLAB implementation of the RVM identified 30 and 32 nonzero effects with a p-value ≤ 0.05 for the two replicates, respectively. Among these effects, we obtained 15 true and 15 false positive for replicate 1, 16 true and 16 false positive for replicate 2. The estimated sizes of true effects and their standard errors are depicted in Tables S1and S2. When we reduced the number of false positive effects to three, at a level slightly higher than that of our EBLASSO-NEG and EBLASSO-NE, by reducing the cutoff for the p-value, we obtained 11 and 10 positive effects for the two replicates, which are smaller than the number of true effects identified with our EBLASSO-NEG and EBLASSO-NE.
The single QTL method: After Bonferroni correction, effects with a p-value ≤ 0.05/481 = 1.04×10 -4 were considered as significant, which identified 7 true and 22 false positive effects for replicate 2, and 7 true and 13 false positive effects for replicate 3. The estimated sizes of true effects and their standard errors are depicted in Tables S1and S2, respectively, for the two replicates. When we increased the number of true positive effects to 13, at a level same as our EBLASSO-NEG or EBLASSO-NE, by increasing the cutoff for the p-value to 3.10×10 -3 for replicate 2 and 2.68 ×10 -2 for replicate 3, we obtained 27 and 41 false positive effects, respectively.
As shown in Tables S1 and S2, the EBLASSO-NE and the EBLASSO-NEG identified more true effects than four other methods except the RVM, and yielded a number of false positive effects comparable to those of the LASSO, the HyperLasso and the BhGLM but much smaller than those of the RVM and the single-QTL mapping method.
All simulations were performed on a personal computer (PC) with a 3.4 GHz Intel PentiumD CPU and 2Gb memory running Windows XP, except that the HyperLasso was ran on an IBM BladeCenter cluster including computing nodes with 2.6 GHz Xeon or 2.2 GHz Opteron CPU running Linux. The speeds of the EBLASSO-NEG, the LASSO and the HyperLasso were comparable and faster than the other methods. The speeds of the EBLASSO-NE, the BhGLM and the single-QTL mapping method were comparable but are much faster than the RVM.

Replicates 2 and 3 for the simulations with main and epistatic effects
In this simulation setup, 10 main and 10 epistatic effects were simulated. The genotypes of m = 481 markers of n = 1,000 individuals were generated independently using the procedure described in the main text. Locations and effects of the QTLs and QTL pairs are shown in about 116 times of the sample size, and the design matrix X was of size 1000 × 115,921. QTL mapping was performed with all methods described in the previous simulations. However, like the simulations for replicate 1, the BhGLM method did not converge and the RVM ran out of memory due to a large number of nonzero effects included in the model, and thus, they did not yield any result.
The same cross-validation procedures described earlier were performed to choose the optimal values of the hyperparameters for the EBLASSO-NE, the EBLASSO-NEG and LASSO, and the results for several values of hyperparameters are presented in Table S8. Optimal values of the hyperparameters were then used to analyze the whole dataset. The number of true and false positive effects identified and the estimated effect sizes of the detected true effects are given in Tables S6 and S7, respectively, for the two replicates.
For the HyperLasso, we again examined 12 pairs of values for hyperparameters a and b as listed in Table S9, and identified a = 0.1 and b = 4.9×10 -4 as the optimal values that gave best tradeoff between the true and false positive effects. For the two-locus test, effects with a p-value ≤ 0.05/k = 4.31×10 -7 were considered as significant. Detailed QTL mapping results for the HyperLasso and the single-QTL mapping method are also given in Tables S6 and S7, respectively, for the two replicates.
As shown in Tables S6 and S7, the EBLASSO-NE and the EBLASSO-NEG detected the same or a larger number of true effects but a smaller number of false positive than the other methods, which demonstrates that our EBLASSO-NE and EBLASSO-NEG offer the best overall performance in terms of power of detection and the false positive rate. The LASSO is the fastest, and the EBLASSO-NEG and the HyperLasso are the second and the third fastest.   The estimated marker effect is denoted by  and the standard deviation is denoted by ŝ  . b The estimated marker effect was obtained from a neighboring marker (≤ 20 cM) rather than from the marker with true effect.     The average log likelihood and standard error were obtained from ten-fold cross validation.
c The optimal log likelihood and corresponding parameter(s) chosen for comparison with other methods.     The estimated marker effect is denoted by  and the standard deviation is denoted by ŝ  . b The estimated marker effect was obtained from a neighboring marker (≤ 20 cM) rather than from the marker with true effect.  The estimated marker effect is denoted by  and the standard deviation is denoted by ŝ  . b The estimated marker effect was obtained from a neighboring marker (≤ 20 cM) rather than from the marker with true effect.  The average log likelihood and standard error were obtained from ten-fold cross validation.
c The optimal log likelihood and corresponding parameter(s) chosen for comparison with other methods. 9/0 9/0 a Effects with p-value ≤ 0.05 were considered as significant different from zero.