 Methodology article
 Open access
 Published:
Multipopulation genomic prediction using a multitask Bayesian learning model
BMC Genetics volume 15, Article number: 53 (2014)
Abstract
Background
Genomic prediction in multiple populations can be viewed as a multitask learning problem where tasks are to derive prediction equations for each population and multitask learning property can be improved by sharing information across populations. The goal of this study was to develop a multitask Bayesian learning model for multipopulation genomic prediction with a strategy to effectively share information across populations. Simulation studies and real data from Holstein and Ayrshire dairy breeds with phenotypes on five milk production traits were used to evaluate the proposed multitask Bayesian learning model and compare with a singletask model and a simple data pooling method.
Results
A multitask Bayesian learning model was proposed for multipopulation genomic prediction. Information was shared across populations through a common set of latent indicator variables while SNP effects were allowed to vary in different populations. Both simulation studies and real data analysis showed the effectiveness of the multitask model in improving genomic prediction accuracy for the smaller Ayshire breed. Simulation studies suggested that the multitask model was most effective when the number of QTL was small (n = 20), with an increase of accuracy by up to 0.09 when QTL effects were lowly correlated between two populations (ρ = 0.2), and up to 0.16 when QTL effects were highly correlated (ρ = 0.8). When QTL genotypes were included for training and validation, the improvements were 0.16 and 0.22, respectively, for scenarios of the low and high correlation of QTL effects between two populations. When the number of QTL was large (n = 200), improvement was small with a maximum of 0.02 when QTL genotypes were not included for genomic prediction. Reduction in accuracy was observed for the simple pooling method when the number of QTL was small and correlation of QTL effects between the two populations was low. For the real data, the multitask model achieved an increase of accuracy between 0 and 0.07 in the Ayrshire validation set when 28,206 SNPs were used, while the simple data pooling method resulted in a reduction of accuracy for all traits except for protein percentage. When 246,668 SNPs were used, the accuracy achieved from the multitask model increased by 0 to 0.03, while using the pooling method resulted in a reduction of accuracy by 0.01 to 0.09. In the Holstein population, the three methods had similar performance.
Conclusions
Results in this study suggest that the proposed multitask Bayesian learning model for multipopulation genomic prediction is effective and has the potential to improve the accuracy of genomic prediction.
Background
Genomic prediction has become a new tool for selection of candidates based on genomic estimated breeding values (GEBV) through the use of dense markers covering the whole genome [1]. To predict GEBV, a training data set with genotypes and phenotypes is used to derive the prediction equations, where all marker effects are estimated simultaneously. GEBV for selection candidates that have genotypes are then predicted by summing up all the marker effects. The accuracy of GEBV is affected by several factors [2, 3], of which the number of individuals in the training data set and the marker density are of crucial importance [2, 3].
In Holstein dairy cattle, genomic prediction has been successfully applied using the Illumina BovineSNP50 single nucleotide polymorphism (SNP) panel [4, 5]. For smaller populations such as Ayrshire in dairy cattle, acquisition of a large number of animals to be included in the training data set for genomic prediction still remains a challenge. One strategy is to combine data of the small populations with data from other populations to increase the size of the training set. However, simply pooling data from different populations may result in unfavorable accuracies if the marker density is low or the populations have diverged for a long time [6–8]. Increasing the marker density is one of the possible solutions because the linkage disequilibrium (LD) phase persistence between markers and quantitative trait loci (QTL) among different populations would likely to improve. However, a recent study in Jersey and Holstein dairy cattle reported a very limited advantage by using a very high density SNP panel [9]. Few studies have been dedicated to research new methods or strategies other than simply pooling data together for genomic prediction. Brondum et al. [10] proposed an approach called BayesRS for multipopulation genomic prediction, where a location specific genetic variance derived in one population were used as priors for another population. They found that for some traits, BayesRS might be advantageous compared to the approach of simply pooling training data sets for distantlyrelated populations; but for closely related populations the method did not perform better than simply pooling data together.
Multitask learning, the term first coined by R Caruana [11], aims to improve learning performance by simultaneously learning from related tasks. In text and speech recognition, image reconstruction and many other areas where data are collected from multiple sources, multitask learning has been successfully applied [12–14]. Recently, the multitask learning has attracted a growing interest in biological science for sequence and gene expression analyses [15–17] as well as genomewide association studies (GWAS) [18]. To our knowledge, the application of multitask learning in genomic selection has not been reported so far.
Bayesian learning models using stochastic search variable selection (SSVS) has been widely used and proven effective for genomic prediction in a single population [1, 19–21]. Normally, SSVS uses some types of spike and slab distributions as priors for SNP effects. A latent indicator variable (0 or 1) is associated with each SNP, with 0 indicating that the SNP is irrelevant to the trait and is excluded from the model, and 1 indicating that the SNP is associated to the trait phenotype and is included in the model. In this study, it was assumed that multiple populations share the same set of latent indicator variables which can be learned jointly. The goal was to develop a multitask Bayesian learning model for multipopulation genomic prediction and to evaluate its performance on both simulated and real data.
Methods
In this section, singletask Bayesian learning model, the simple data pooling method, and the multitask Bayesian learning model were introduced. A Gibbs sampling algorithm was designed to implement the multitask Bayesian learning model. Monte Carlo simulation studies were conducted to evaluate the performance of the proposed multitask model. Real data on five milk production traits from Holstein and Ayrshire dairy breeds were also used to test the effectiveness of the multitask model.
Singletask bayesian learning model
In a single reference population of n animals with genotypes on m SNP markers, the statistical model can be written as:
Where y_{ i } is the phenotypic value for the i^{th} animal (i = 1,…,n), μ is the intercept, x_{ ij } is the genotype for the j^{th} SNP locus (j = 1,…,m) of the i^{th} animal, which is coded 0, 1 or 2, depending on the number of copies from a specified allele, a_{ j } is the regression coefficient for the j^{th} SNP (allele substitution effect), and e_{ i } is the random residual error.
A flat prior distribution is assigned to μ. a_{ j } is assumed a mixture of a normal distribution \mathit{N}\left(0,{\mathit{\sigma}}_{\mathit{a}}^{2}\right) and a point mass density at zero (denoted by a Dirac delta function δ_{0}(a_{ j }) hereinafter). The weights for the two distributions are (1w) and w, respectively, so that \left({\mathit{a}}_{\mathit{j}}\mathit{w},{\mathit{\sigma}}_{\mathit{a}}^{2}\right)~\left(1\mathit{w}\right)\mathit{N}\left(0,{\mathit{\sigma}}_{\mathit{a}}^{2}\right)+\mathit{w}{\mathit{\delta}}_{0}\left({\mathit{a}}_{\mathit{j}}\right).\phantom{\rule{0.5em}{0ex}}\mathit{w} follows a uniform prior distribution. A latent indicator variable γ_{ i } is introduced for each SNP, so that when γ_{ i }=1 , {\mathit{a}}_{\mathit{j}}~\mathit{N}\left(0,{\mathit{\sigma}}_{\mathit{a}}^{2}\right) , and when γ_{ i }=0 , a_{ j }=0. Prior distribution for each γ_{ i } is assumed i.i.d. and follows Bernoulli distribution with probability (1w). So the joint prior density for γ is \mathit{f}\left(\mathit{\gamma}\mathit{w}\right)={\displaystyle \prod _{\mathit{j}}}{\mathit{w}}^{\left(1{\mathit{\gamma}}_{\mathit{j}}\right)}{\left(1\mathit{w}\right)}^{{\mathit{\gamma}}_{\mathit{j}}}. Residual errors are assumed from a multivariate normal distribution \mathit{N}\left(0,\mathit{I}{\mathit{\sigma}}_{\mathit{e}}^{2}\right) . The prior distribution for {\mathit{\sigma}}_{\mathit{a}}^{2}\phantom{\rule{0.25em}{0ex}}\left({\mathit{\sigma}}_{\mathit{e}}^{2}\right) is a scaled inverse Chisquare distribution with degree of freedom v_{ a }(v_{ e }) and scale factor {\mathit{s}}_{\mathit{a}}^{2}\left({\mathit{s}}_{\mathit{e}}^{2}\right).
Simple data pooling method
Suppose animals are from a number of c different populations. In a simple data pooling method, animals from multiple populations are pooled together to form a single training data set. It is assumed that the population origin for each individual is known prior to the analysis. Population origin is included as a fixed effect. The effect of each SNP is assumed to be the same across populations.
Multitask Bayesian learning model
For c populations with n_{ k } animals in the kth population, the statistical model can be written as:
In matrix notation, this can be written as:
where y_{ ik } is the phenotypic value for the i^{th} animal in the k^{th} population; μ_{ k } is the general mean of population k; x_{ ijk } is the genotype for the j^{th} SNP locus of the i^{th} animal in the k^{th} population; a_{ jk } is the j^{th} SNP effect in population k, and e_{ ik } is the random residual effect; and in the matrix notation, y_{k}, a_{k}, and e_{k} are vectors of phenotypic values, SNP effects, and residual effects, respectively; 1 is a vector with all elements set to 1; and X_{k} is the design matrix relating y_{k} to a_{k}. In the model, a_{ jk } allows the j^{th} SNP effect to have a different value in population k. To share information among different populations, a common latent indicator variable indicating whether SNP j is associated with a QTL is used across populations. Accommodating these features into a Bayesian model produces the multitask Bayesian learning model.
The following prior distributions for the unknown parameters and hyperparameters are assumed in the multitask Bayesian learning model:
The likelihood function of the whole data given all the parameters in the model is:
So the joint posterior density function is:
Gibbs sampling algorithm
A Gibbs sampling algorithm was designed to draw samples for unknown (hyper) parameters from their full conditional posterior distributions. To avoid reducibility of Markov chain, γ_{ j } and a_{ jk } are jointly sampled by first sampling γ_{ j } from \mathit{f}\left({\mathit{\gamma}}_{\mathit{j}}{\mathit{\theta}}_{{\mathit{j}}^{}},\mathit{y}\right) followed by sampling a_{ jk } from \mathit{f}\left({\mathit{a}}_{\mathit{jk}}{\mathit{\gamma}}_{\mathit{j}},{\mathit{\theta}}_{{\mathit{j}}^{}},\mathit{y}\right), where {\mathit{\theta}}_{{\mathit{j}}^{}} represents all parameters except γ_{ j } and a_{ jk }. Full conditional posterior distributions for {\mathit{\mu}}_{\mathit{k}},\mathit{w},{\mathit{\sigma}}_{\mathit{ak}}^{2}\phantom{\rule{0.25em}{0ex}}\mathrm{and}\phantom{\rule{0.25em}{0ex}}{\mathit{\sigma}}_{\mathit{ek}}^{2}\phantom{\rule{0.5em}{0ex}} can be derived by picking up the relevant parts from the joint posterior distribution. Derivation for the density function \mathit{f}\left({\mathit{\gamma}}_{\mathit{j}}{\mathit{\theta}}_{{\mathit{j}}^{\mathbf{}}},\mathit{y}\right) and sampling of γ_{ j } are described in the Appendix. The Gibbs sampling steps are described as below:
Step 1. Initialize the parameters
Step 2. For j=1,···, m

a.
Sample γ_{ j } from Bernoulli distribution with probability 1/(1+q _{ j }),
{\mathit{q}}_{\mathit{j}}=\frac{\mathit{w}}{1\mathit{w}}\sqrt{{\displaystyle \prod _{\mathit{k}}\left({\mathbf{x}}_{\mathit{jk}}^{\mathbf{\prime}}{\mathbf{x}}_{\mathit{jk}}\frac{{\mathit{\sigma}}_{\mathit{ak}}^{2}}{{\mathit{\sigma}}_{\mathit{ek}}^{2}}+1\right)}}exp\left(\frac{1}{2}{\displaystyle \sum _{\mathit{k}}\frac{{\widehat{\mathit{\mu}}}_{{\mathit{a}}_{\mathit{jk}}}^{2}}{{\widehat{\mathit{\sigma}}}_{{\mathit{a}}_{\mathit{jk}}}^{2}}}\right)
in which
and
(see Appendix for details).

b.
For k=1,···c, sample a _{ jk } from
\mathit{f}\left({\mathit{a}}_{\mathit{jk}}{\mathit{\gamma}}_{\mathit{j}},{\mathbf{\theta}}_{{\mathit{j}}^{}},\mathbf{y}\right)=\left\{\begin{array}{cc}\hfill {\mathit{\delta}}_{0}\left({\mathit{a}}_{\mathit{jk}}\right)\hfill & \hfill \mathit{if}{\mathit{\gamma}}_{\mathit{j}}=0\hfill \\ \hfill \mathit{N}\left({\widehat{\mathit{\mu}}}_{{\mathit{a}}_{\mathit{jk}}},{\widehat{\mathit{\sigma}}}_{{\mathit{a}}_{\mathit{jk}}}^{2}\right)\hfill & \hfill \mathit{if}{\mathit{\gamma}}_{\mathit{j}}=1\hfill \end{array}\right.
Step 3. Sample w from Beta distribution\mathit{f}\left(\mathit{w}\mathbf{\gamma}\right)=\frac{1}{\mathit{B}\left(\mathit{\alpha},\mathit{\beta}\right)}{\mathit{w}}^{\mathit{\alpha}1}{\left(1\mathit{w}\right)}^{\mathit{\beta}1}, where α = m  ∑ γ_{ j } and β = ∑ γ_{ j }.
Step 4. For k=1,···, c, sample {\mathit{\sigma}}_{\mathit{ak}}^{2} from scaled inverse Chisquare distribution
Step 5. For k=1,···, c, sample {\mathit{\sigma}}_{\mathit{ek}}^{2} from scaled inverse Chisquare distribution
Step 6. For k=1,···, c,
Sample μ_{ k } from \mathit{N}\left(\frac{{\left({\mathbf{y}}_{\mathit{k}}{\mathbf{X}}_{\mathit{k}}{\widehat{\mathbf{a}}}_{\mathit{k}}\right)}^{\text{'}}\left({\mathbf{y}}_{\mathit{k}}{\mathbf{X}}_{\mathit{k}}{\widehat{\mathbf{a}}}_{\mathit{k}}\right)}{{\mathit{n}}_{\mathit{k}}},\frac{{\widehat{\mathit{\sigma}}}_{{\mathit{e}}_{\mathit{k}}}^{2}}{{\mathit{n}}_{\mathit{k}}}\right) Repeat Step 2 to 6 until a set number of iterations are reached.
It can be shown in step 2 that information from all populations are used to generate the latent variable γ_{ j }, and the SNP effect a_{ jk } is generated for each population.
Computer program
Computer programs were written in C language to implement the multitask Bayesian learning model, singletask Bayesian learning and the simple data pooling method. Programs and source codes are available upon request.
Monte Carlo simulations
The aim of the simulation was to evaluate the performance of the proposed multitask Bayesian learning model and to compare with the singletask model and the simple data pooling method. Different scenarios were considered that differ in number of QTL affecting the trait, correlations of the QTL effects between different populations, and the density of SNP panels used for genomic prediction.
Real genotypes from 458 Ayrshire and 2,298 Holstein bulls were used for simulations. All Ayrshire animals and 690 Holstein animals were genotyped on the Illumina BovineHD BeadChip (800 k) SNP panel, and the remaining 1,608 Holstein animals were genotyped on the Illumina BovineSNP50 BeadChip (50 k) SNP panel. SNPs meeting one of the following criteria were excluded: minor allele frequency (MAF) lower than 0.05, missing genotype rate greater than 0.10, highly correlated with any other SNP genotype (95% genotypes from two loci identical or in complementary). SNP locations were determined against Bovine genome assembly UMD3.1 [22]. SNPs with unknown locations or on sex chromosomes were discarded. SNPs filtered in one breed were also removed from the other breed. After editing, 246,668 SNPs from the 800 k panel and 28,206 SNPs from the 50 k panel were kept for analyses.
For ease of computation, only SNPs from the first 10 chromosomes were used. Four scenarios were considered with combinations of QTL number being either 20 or 200 and the correlation of QTL effects between the two populations being low (ρ = 0.2) or high (ρ = 0.8). For each scenario, QTL were randomly sampled from the 50 k panel. For each QTL j, allele substitution effects in the two populations, α_{j1} and α_{j2}, were sampled from a bivariate normal distribution with mean 0 and variancecovariance structure \mathit{\sum}=\left(\begin{array}{cc}\hfill 1\hfill & \hfill \mathit{\rho}\hfill \\ \hfill \mathit{\rho}\hfill & \hfill 1\hfill \end{array}\right){\mathit{\sigma}}^{2} in which σ^{2} = 1. Breeding value of each animal i, was calculated as {\mathit{a}}_{\mathit{i}}={\displaystyle \sum _{\mathit{j}}{\mathit{X}}_{\mathit{j}}{\mathit{\alpha}}_{\mathit{j}}} where X_{ j } is the QTL genotypes coded as 0, 1 or 2, number of copies on an arbitrarily chosen allele. Total additive genetic variance were calculated within each breed as {\mathit{\sigma}}_{\mathit{a}}^{2}={\displaystyle \sum _{\mathit{j}}2{\mathit{p}}_{\mathit{j}}\left(1{\mathit{p}}_{\mathit{j}}\right){\mathit{\alpha}}_{\mathit{j}}^{2}}, where p_{ j } is the QTL allele frequency. For each animal i, a residual effect e_{ i }, was sampled from a normal distribution with mean 0 and variance {\mathit{\sigma}}_{\mathit{e}}^{2}, so that phenotypic value of the animal y_{ i }=a_{ i }+e_{ i }. {\mathit{\sigma}}_{\mathit{e}}^{2} was equal to {\mathit{\sigma}}_{\mathit{a}}^{2} to give a trait with heritability of 0.5. Each scenario was replicated 10 times.
Four SNP panels of different density, the original 800 k panel, and the mimicked 400 k, 200 k, 100 k by selecting every 2nd, 4th, and 8th SNP, respectively, from the 800 k panel were used for genomic prediction. For each scenario described above, the simulated QTL were removed from the SNP panels. A special scenario was designed to keep all QTL genotypes in the 800 k panel for genomic prediction. The 50 k genotypes of 1,608 Holstein animals were imputed to 800 k using genotype imputation software FImpute developed by Sargolzaei et al. [23]. Imputation accuracy was evaluated on a set of 126 animals that have been genotyped on both the 50 k and 800 k panel, and the ratio of the genotypes that were correctly imputed was 0.9930.
For training and validation purposes, 393 Ayrshire and 2,084 Holstein animals born before 2004 were used as training set, 65 Ayrshire and 214 Holstein animals born in and after 2004 were used for validation. The number of animals used for genomic prediction is shown in Table 1. The simulated phenotypic values for training animals were used to derive SNP effects using different models. Degree of freedom of the inverse chisquare distributions for variances of SNP effects and residual effects were set to 4 and 10, respectively. The scale parameter {\mathit{S}}_{\mathit{a}}^{2} was derived from the expected value of a scaled inverse chisquare distributed random variable, i.e., E(σ^{2}) = S^{2}v/(v  2) and hence {\mathit{S}}_{\mathit{a}}^{2}={\mathit{\sigma}}_{\mathit{a}}^{2}\left({\mathit{v}}_{\mathit{a}}2\right)/{\mathit{v}}_{\mathit{a}} where {\mathit{\sigma}}_{\mathit{a}}^{2} is the true additive genetic variance. {\mathit{S}}_{\mathit{e}}^{2} was derived similarly. The Gibbs chain was run for 50,000 cycles with the first 10,000 discarded as burnin. Marker effects were estimated as averages from all post burnin samples. Genomic breeding values for animals in the validation set were estimated as the sum of population mean and all marker effects. Accuracy was evaluated as Pearson’s correlation coefficient between genomic estimated breeding values and true breeding values for validation animals, i.e. r(GEBV, TBV). Regression of true breeding value on genomic estimated breeding values, b(TBV, GEBV), was also calculated to evaluate the bias of the genomic estimated breeding values.
Real data
Five milk production traits including milk yield, fat yield, protein yield, fat percentage and protein percentage were used for the same set of animals in the simulation study. Bull proofs (estimated breeding values from progeny testing, EBV) from April 2008 were used as phenotypes for training animals, and bull proofs from December 2011 were used for validation animals. All proofs were provided by Canadian Dairy Network (CDN) and had reliability above 0.65.
Two SNP panels with 28,206 SNPs from the 50 k panel and 246,668 SNPs from the 800 k panel were used for genomic prediction. Degree of freedom of the inverse chisquare distributions for variances of SNP effects and residual effects were set to 4 and 10, respectively. Scale parameters for the two distributions were derived in the same way as described in the simulation study, but instead of using true additive genetic variance and residual variance, estimated variances were used. Estimated additive genetic variances and residual variance were obtained from ASReml [24] by fitting an animal model with a population mean, animal effect, and random residual effect. The pedigree contained 15,731 animals for the Holstein population and 4,926 animals for Ayrshire. For 28,206 SNPs, the Gibbs sampling procedure was run for 50,000 iterations with the first 10,000 discarded as burnin; and for 246,668 SNPs, the Gibbs sampling procedure was run for 100,000 iterations with the first 50,000 discarded as burnin. Burnin period was determined by visually inspecting the Gibbs chain. All samples after the burnin period were kept. SNP effects were estimated by averaging all samples after the burnin period. After the estimation of SNP effects, the GEBV was calculated for animals in the validation set by summing up the population mean and all the SNP effects. Accuracy was measured as Pearson’s correlation coefficient between GEBV and the 2011 bull proofs for validation animals.
Results
Monte Carlo simulation study
Table 2 shows the accuracy of genomic prediction with simulated 20 QTL. In the Ayrshire validation set, multitask Bayesian learning model performed the best among the three methods within each SNP panel used under the scenario with either a low (ρ = 0.2) or high (ρ = 0.8) correlation of simulated QTL effects between Ayrshire and Holstein populations. The greatest increase of accuracy was observed when ρ was 0.8 and when density of the SNP panel was the highest (accuracy increased from 0.67 for the singletask method to 0.83 for the multitask method). Simply pooling data together substantially reduced the prediction accuracy in Ayshire when ρ was 0.2. The greatest reduction of accuracy was observed when ρ was 0.2 and when density of the SNP panel was the highest (accuracy decreased from 0.71 for the singletask method to 0.56 for the pooling method). When ρ was 0.8, the pooling method had an increased accuracy in Ayrshire compared with the singletask method. The pooling method produced substantially lower accuracy than the multitask model in the Ayrshire validation set, especially when QTL effects were lower correlated between the two populations. In the Holstein validation set, the multitask model performed similar or slightly better than the singletask model. The pooling method had a slightly reduction of accuracy when ρ was 0.2, and had a similar accuracy compared with the singletask method when ρ was 0.8. Within each method, increasing density of the SNP panel generally improved prediction accuracy in both the Ayrshire and Holstein populations. Table 2 also shows the slopes of regression of true breeding values on the GEBV. Regression coefficients of true breeding values on GEBV were less than one for the pooling method indicating that the GEBV were inflated.
Table 3 shows the accuracy of genomic prediction with simulated 200 QTL. In the Ayrshire validation set, the multitask model had slightly higher prediction accuracy within each SNP panel compared with the singletask model under either scenario with low (ρ =0.2) or high (ρ =0.8) correlated QTL effects between the Ayrshire and Holstein populations. Pooling method performed similar or slightly worse compared with singletask model when ρ was 0.2, and generally performed better when ρ was 0.8. The pooling method also performed slightly better than the multitask model when ρ was 0.8 and when density of the SNP panel was relatively high. When ρ was 0.2, regression coefficients of true breeding values on GEBV were less than one for the pooling method indicating that the GEBV were inflated. The three methods performed similar in the Holstein validation set. Overall, the accuracy was lower compared with scenarios when only 20 QTL were simulated regardless of the methods used.
To evaluate the performance of the three methods under situations where all QTL genotypes can be acquired and included for genomic prediction, Table 4 shows the accuracy of genomic prediction when QTL genotypes are included together with SNP marker genotypes for training and validation. When 20 QTL were simulated, using multitask model improved the accuracy by 0.16 and 0.22 in the Ayrshire validation set for scenarios where correlation of QTL effects between the Ayrshire and Holstein populations was 0.2 and 0.8, respectively. Using the pooling method reduced the accuracy in Ayrshire by 0.12 when ρ was 0.2, but increased the accuracy by 0.17 when ρ was 0.8. When 200 QTL were simulated, using the multitask model increased the prediction accuracy by 0.03 and 0.07, respectively, for ρ equal to 0.2 and 0.8. The pooling method reduced the accuracy by 0.04 when ρ was 0.2, and increased the accuracy by 0.11 when ρ was 0.8. The pooling method outperformed the multitask model only for the scenario where 200 QTL were simulated with their effects highly correlated between the two populations. In the Holstein validation set, the multitask model had similar or slightly higher accuracy compared with the singletask model. The pooling method had similar accuracy as the singletask model when ρ was 0.8. When ρ was 0.2, the pooling method had slightly lower accuracy compared with the singletask model. Regression coefficients of true breeding values on GEBV were less than one for the pooling method except for the scenario of 200 simulated QTL with effects highly correlated between the two populations, indicating that the GEBV predicted by the pooling method were inflated.
Real data analysis
Table 5 shows the accuracy of genomic prediction for milk production traits using real data. For the Ayrshire validation set, the multitask model increased the accuracy by up to 0.07 compared with the singletask model, while the simple data pooling method resulted in reduced accuracy in general. The greatest increase in accuracy using multitask model compared with singletask model was for fat percentage (0.06) and protein percentage (0.07) when 28,206 SNPs were used. For the Holstein validation set, the singletask model, simple data pooling method, and the multitask model performed similar regardless of the traits studied.
Discussion
Traditionally, genomic prediction with data from multiple populations were implemented either by running genomic prediction within each population (singletask) or by simply pooling data together. Singletask genomic prediction cannot utilize information from other populations and therefore, the accuracy of genomic prediction is largely determined by the size of training data set [2, 3]. For breeds with only a small number of animals having both DNA marker genotypes and phenotype data, the accuracy of genomic prediction can be low [25]. Combining the data with other breed populations has the potential to improve the prediction accuracy. It is however, difficult to effectively account for the differences of SNP effects among different populations by simply pooling data together. If the marker density is low or the populations are divergent from each other, simply pooling data together may result in unfavorable prediction accuracies [6]. The multitask Bayesian learning model proposed in this study uses information from all populations simultaneously while allowing the SNP effects to vary in different populations. Different populations share information through a common set of latent indicator variables. When the target trait has a similar genetic background in related populations, it is reasonable to assume that some shared QTL affecting a common trait in different populations. However, the linkage disequilibrium phase between SNP markers and QTL are likely to be inconsistent, especially when the marker density is low. Therefore, the multitask Bayesian learning model is more flexible about the SNP effects and is likely to have better performance than a simple data pooling method.
Results from simulation studies support the use of multitask Bayesian model for multipopulation genomic prediction especially when there are a few QTL affecting the trait. For the scenario where a few QTL affect the trait, the increase of accuracy by using multitask model was greater when QTL effects had a higher correlation between two populations. The accuracy was further increased when QTL genotypes were included for training and validation. These results are expected as the higher the correlation of QTL effects between two populations, the more informative information the two populations can share. Including QTL genotypes also increased the amount of information to be shared between the two populations. Results suggest that the proposed multitask Bayesian learning model is effective in combining information from multipopulations to improve accuracy of genomic prediction.
Results from simulation also showed that simply pooling data together may reduce the accuracy of genomic prediction if QTL effects were lowly correlated between two populations or if a relatively low density SNP panel was used. When QTL effects had a high correlation between two populations and the SNP panel was high, the pooling method increased the accuracy compared with singletask model. When the correlation of QTL effects between two populations was high, the pooling method was inferior to the multitask model if the number of QTL was small, but outperformed the multitask model if the number of QTL was large. These results are reasonable since the pooling method assumes the same SNP effects among different populations. This assumption will be violated if the correlation of QTL effects is low between two populations and may result in poor performance of the pooling method. The multitask model proposes to share information through the same latent indicator variables. Correlations of QTL effects among different populations are not explicitly used and therefore, information across populations may not be fully utilized if such correlations are high. This might be why the pooling method still outperformed the multitask model when the number of QTL is large and the correlation of QTL effects between two populations was high.
Results from real data analyses in this study showed that the multitask Bayesian learning model produced a similar or higher accuracy compared to the singletask model, and that simply pooling data together resulted in a reduced accuracy when the marker density was low. SNP effects could be different across populations, especially when lower density markers were used [6]. These results are in agreement with results from simulation studies.
Gains of accuracy by using the multitask model were higher for fat percentage and protein percentage traits than for other traits. This is likely due to that large QTL or genes such as DGAT1, have larger influence on the percentage traits than on the yield traits [26, 27]. These results agreed with simulation studies which showed that the multitask model performed better when there are fewer QTL affecting the trait.
In this study, only two dairy cattle breeds were considered, and the multitask Bayesian learning model was shown to be effective and more beneficial to the population with a smaller data size. For the larger population, using multitask model did not produce much improvement. The little improvement in the larger population could be due to that the smaller population is too small to be able to have a significant impact on the larger population. In practice, there are situations where many populations are to be combined for analysis with each one contributing a small amount of data, a typical example as in beef cattle production. It would be interesting to test the performance of the multitask model under such scenarios.
The current multitask model considered additive genetic effects only. Nonadditive genetic effects, which have gained growing interests with availability of genomics information [28–30], can also be accommodated into the multitask model. A similar strategy used in this study to sharing information across populations for additive genetic effects may also be applied to nonadditive genetic effects. Inclusion of nonadditive genetic effects in the multitask model warrants further investigation.
The proposed multitask Bayesian learning model used a spike and slab mixture distribution to conduct variable selection and shrinkage for SNP effects. This prior setting is similar to that in the BayesCπ method proposed by Habier et al. [21]. Other types of mixture distributions currently being used for genomic prediction [1, 19, 20], can also be adapted to the multitask model. The strategy used in the multitask Bayesian learning model allows different populations to share information through a common latent variable assumed for each SNP. Such strategy has been shown effective in both simulation and real data studies; however, other strategies may also be exploited, for example, by modeling the joint distribution of the SNP effects among different populations. Further investigations are required to evaluate alternative strategies for sharing information across populations.
In this study, the Bayesian models were implemented via a Gibbs sampling algorithm adopting the residualupdate computing strategy proposed by Legarra and Misztal [31]. Computing time for this algorithm is proportional to the number of animals and number of SNPs in the model. For the training sample size of 2,477 animals and genotypes of 246,668 SNPs, the proposed multitask model requires 44 CPU hours to complete 150,000 Gibbs sampling cycles on a Linux cluster system with Intel X5675 3.07GHz CPU. With increased training sample size and increasing marker density to a very high density panel or even sequence data, computing burdens would become a concern. A recent study [32] has improved the residualupdate algorithm resulting in the CPU time reduced by 35.3 to 43.3%. The authors in the same study [32] also proposed an alternative algorithm which reduced CPU time by 74.5 to 93.0%. Approximation algorithms based on expectationmaximization (EM) [33–35] and variational Bayes [36] have also been proposed to replace the time consuming Markov chain Monte Carlo (MCMC) sampling based algorithms. Adapting these algorithms to accommodate the multitask Bayesian learning model would be of great interest.
Conclusions
A multitask Bayesian learning model was proposed for multipopulation genomic prediction. The multitask model shares information across populations through a common set of latent indicator variables while allowing the SNP effects to vary in different populations. Simulation studies and real data analysis suggest that the proposed multitask Bayesian learning model is effective and beneficial to populations where a small number of training animals are available. Accuracy of genomic prediction in small populations can be improved by using the multitask model especially for traits affected by a few QTL with large effects.
Appendix
Sampling of r _{ j } in the multitask Bayesian learning model
With the assumption of independence between {\mathit{\theta}}_{{\mathit{j}}^{}} and r_{ j }, by Bayes’ theorem, one has
where
Similarly,
Next, the conditional likelihoods \mathit{f}\left({\mathbf{y}}_{\mathbf{k}}{\mathit{r}}_{\mathit{j}}=0,{\mathbf{\theta}}_{{\mathbf{j}}^{\mathbf{}}\mathbf{k}}\right) and \mathit{f}\left({\mathbf{y}}_{\mathbf{k}}{\mathit{r}}_{\mathit{j}}=1,{\mathbf{\theta}}_{{\mathbf{j}}^{\mathbf{}}\mathbf{k}}\right) will be derived.
Suppose one is at the (l+1)^{th} Gibbs sampling iteration, and wants to sample γ_{ j } and a_{ jk }, the linear regression model given all parameters except γ_{ j } and a_{ jk } can be written as:
Where
Given the priors that \left({\mathit{a}}_{\mathit{jk}}{\mathit{\gamma}}_{\mathit{j}}\right)~{\mathit{\gamma}}_{\mathit{j}}\mathit{N}\left(0,{\mathit{\sigma}}_{\mathit{ak}}^{2}\right)+\left(1{\mathit{\gamma}}_{\mathit{j}}\right){\mathit{\delta}}_{0}\left({\mathit{a}}_{\mathit{jk}}\right) and \left({\mathbf{e}}_{\mathbf{k}}{\mathit{\sigma}}_{\mathit{ek}}^{2}\right)~\mathit{N}\left(\mathbf{0},\mathbf{I}{\mathit{\sigma}}_{\mathit{ek}}^{2}\right), the conditional likelihood of y_{ k } can be written as:
And
Where {\mathit{V}}_{{\mathit{y}}_{\mathit{k}}^{*}}={\mathbf{x}}_{\mathbf{jk}}{\mathbf{x}}_{\mathbf{jk}}^{\mathbf{\text{'}}}{\mathit{\sigma}}_{\mathit{ak}}^{2}+\mathit{I}{\mathit{\sigma}}_{\mathit{ek}}^{2} is the (co) variance matrix of {\mathbf{y}}_{\mathbf{k}}^{\mathbf{*}} given that r_{ j }=1.
Then, \left{\mathbf{V}}_{{\mathit{y}}_{\mathit{k}}^{*}}\right and {\mathbf{V}}_{{\mathit{y}}_{\mathit{k}}^{*}}^{1} are derived.
By applying Sylvester’s determinant theorem, one has
It can be easily verify that {\mathit{V}}_{{\mathit{y}}^{\mathit{*}}}^{1} is:
Substituting A.5 and A.6 into A.4, one has
Denote {\widehat{\mathit{\mu}}}_{{\mathit{a}}_{\mathit{jk}}}=\frac{{\mathbf{x}}_{\mathbf{jk}}^{\mathbf{\text{'}}}\left({\mathbf{y}}_{\mathbf{k}}{\mathbf{X}}_{\mathbf{1}\mathbf{k}\mathbf{:}\mathbf{jk}\mathbf{}\mathbf{1}}{\widehat{\mathbf{a}}}_{\mathbf{1}\mathbf{k}\mathbf{:}\mathbf{jk}\mathbf{}\mathbf{1}}^{\mathbf{l}\mathbf{+}\mathbf{1}}{\mathbf{X}}_{\mathbf{jk}\mathbf{+}\mathbf{1}\mathbf{:}\mathbf{mk}}{\widehat{\mathbf{a}}}_{\mathbf{jk}\mathbf{+}\mathbf{1}\mathbf{:}\mathbf{mk}}^{\mathbf{l}}\right)}{{\mathbf{x}}_{\mathbf{jk}}^{\mathbf{\text{'}}}{\mathbf{x}}_{\mathbf{jk}}+{\mathit{\sigma}}_{\mathit{ek}}^{2}/{\mathit{\sigma}}_{\mathit{ak}}^{2}},{\widehat{\mathit{\sigma}}}_{{\mathit{a}}_{\mathit{jk}}}^{2}=\frac{{\mathit{\sigma}}_{\mathit{ek}}^{2}}{{\mathbf{x}}_{\mathbf{jk}}^{\mathbf{\text{'}}}{\mathbf{x}}_{\mathbf{jk}}+{\mathit{\sigma}}_{\mathit{ek}}^{2}/{\mathit{\sigma}}_{\mathit{ak}}^{2}}, and substitute (A.7) and (A.3) into (A.2), one gets
Finally, γ_{ j } can be drawn from a Bernoulli distribution with probability 1/(1+q_{ j }).
References
Meuwissen TH, Hayes BJ, Goddard ME: Prediction of total genetic value using genomewide dense marker maps. Genetics. 2001, 157 (4): 18191829.
Goddard M: Genomic selection: prediction of accuracy and maximisation of long term response. Genetica. 2009, 136 (2): 245257. 10.1007/s1070900893080.
Hayes BJ, Goddard ME: Technical note: prediction of breeding values using markerderived relationship matrices. J Anim Sci. 2008, 86 (9): 20892092. 10.2527/jas.20070733.
Hayes BJ, Bowman PJ, Chamberlain AJ, Goddard ME: Invited review: genomic selection in dairy cattle: progress and challenges (vol 92, pg 433, 2009). J Dairy Sci. 2009, 92 (3): 13131313.
VanRaden PM, Van Tassell CP, Wiggans GR, Sonstegard TS, Schnabel RD, Taylor JF, Schenkel FS: Invited review: Reliability of genomic predictions for north american holstein bulls. J Dairy Sci. 2009, 92 (1): 1624. 10.3168/jds.20081514.
de Roos APW, Hayes BJ, Goddard ME: Reliability of genomic predictions across multiple populations. Genetics. 2009, 183 (4): 15451553. 10.1534/genetics.109.104935.
Hayes B, Bowman P, Chamberlain A, Verbyla K, Goddard M: Accuracy of genomic breeding values in multibreed dairy cattle populations. Genet Sel Evol. 2009, 41 (1): 51
Pryce JE, Gredler B, Bolormaa S, Bowman PJ, EggerDanner C, Fuerst C, Emmerling R, Solkner J, Goddard ME, Hayes BJ: Short communication: genomic selection using a multibreed, acrosscountry reference population. J Dairy Sci. 2011, 94 (5): 26252630. 10.3168/jds.20103719.
Erbe M, Hayes BJ, Matukumalli LK, Goswami S, Bowman PJ, Reich CM, Mason BA, Goddard ME: Improving accuracy of genomic predictions within and between dairy cattle breeds with imputed highdensity single nucleotide polymorphism panels. J Dairy Sci. 2012, 95 (7): 41144129. 10.3168/jds.20115019.
Brondum RF, Su GS, Lund MS, Bowman PJ, Goddard ME, Hayes BJ: Genome position specific priors for genomic prediction. BMC Genomics. 2012, 13 (1): 54310.1186/1471216413543.
Caruana R: Multitask learning. Mach Learn. 1997, 28 (1): 4175.
Li X, Bilmes J: Regularized adaptation of discriminative classifiers. Proceedings of the 2006 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP): 1419 May 2006; Toulouse. 2006, 237240. IEEE 2006(vol. 1), IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP): 1419 May 2006; Toulouse
Lu Y, Lu F, Sehgal S, Gupta S, Du J, Tham CH, Green P, Wan V: Multitask Learning In Connectionist Speech Recognition. Proceedings of the Tenth Australian International Conference on Speech Science & Technology: 810 December 2004; Sydney. Edited by: Cassidy S, Cox F, Mannell R, Palethorpe S. 2004, Canberra: Australian Speech Science and Technology Association Inc, 312315.
Yuan XT, Yan S: Visual classification with multitask joint sparse representation. Proceedings of the 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR): 1318 June 2010; San Francisco. 2010, IEEE, 34933500.
Jacob L, Vert JP: Efficient peptide–mhci binding prediction for alleles with few known binders. Bioinformatics. 2008, 24 (3): 358366. 10.1093/bioinformatics/btm611.
Widmer C, Leiva J, Altun Y, Rätsch G: Leveraging sequence classification by taxonomybased multitask learning. Research in Computational Molecular Biology. 2010, Heidelberg: Springer Berlin, 522534.
Yang WH, Dai DQ, Yan H: Finding correlated biclusters from gene expression data. Knowl Data Eng, IEEE T. 2011, 23 (4): 568584.
Puniyani K, Kim S, Xing EP: Multipopulation gwa mapping via multitask regularized regression. Bioinformatics. 2010, 26 (12): i208i216. 10.1093/bioinformatics/btq191.
Calus MPL, Meuwissen THE, de Roos APW, Veerkamp RF: Accuracy of genomic selection using different methods to define haplotypes. Genetics. 2008, 178 (1): 553561. 10.1534/genetics.107.080838.
Verbyla KL, Hayes BJ, Bowman PJ, Goddard ME: Accuracy of genomic selection using stochastic search variable selection in australian holstein friesian dairy cattle. Genet Res. 2009, 91 (5): 307311. 10.1017/S0016672309990243.
Habier D, Fernando RL, Kizilkaya K, Garrick DJ: Extension of the bayesian alphabet for genomic selection. BMC Bioinformatics. 2011, 12 (1): 186
Zimin AV, Delcher AL, Florea L, Kelley DR, Schatz MC, Puiu D, Hanrahan F, Pertea G, Van Tassell CP, Sonstegard TS: A wholegenome assembly of the domestic cow, bos taurus. Genome Biol. 2009, 10 (4): R4210.1186/gb2009104r42.
Sargolzaei M, Chesnais JP, Schenkel FS: Fimpute  an efficient imputation algorithm for dairy cattle populations. J Dairy Sci. 2011, 94 (ESuppl. 1): 421
Gilmour AR, Gogel BJ, Cullis BR, Thompson R: Asreml user guide release 3.0. Hemel Hempstead, HP1 1ES. 2009, UK: VSN International Ltd
Brito FV, Neto JB, Sargolzaei M, Cobuci JA, Schenkel FS: Accuracy of genomic selection in simulated populations mimicking the extent of linkage disequilibrium in beef cattle. BMC Genet. 2011, 12 (1): 8010.1186/147121561280.
Grisart B, Farnir F, Karim L, Cambisano N, Kim JJ, Kvasz A, Mni M, Simon P, Frere JM, Coppieters W, et al: Genetic and functional confirmation of the causality of the dgat1 k232a quantitative trait nucleotide in affecting milk yield and composition. Proc Natl Acad Sci U S A. 2004, 101 (8): 23982403. 10.1073/pnas.0308518100.
Pimentel Eda C, Erbe M, Konig S, Simianer H: Genome partitioning of genetic variation for milk production and composition traits in holstein cattle. Front Genet. 2011, 2: 19
Su G, Christensen OF, Ostersen T, Henryon M, Lund MS: Estimating additive and nonadditive genetic variances and predicting genetic merits using genomewide dense single nucleotide polymorphism markers. PLoS One. 2012, 7 (9): e4529310.1371/journal.pone.0045293.
Toro MA, Varona L: A note on mate allocation for dominance handling in genomic selection. Genet Sel Evol. 2010, 42: 3310.1186/129796864233.
Vitezica ZG, Varona L, Legarra A: On the additive and dominant variance and covariance of individuals within the genomic selection scope. Genetics. 2013, 195 (4): 12231230. 10.1534/genetics.113.155176.
Legarra A, Misztal I: Technical note: computing strategies in genomewide selection. J Dairy Sci. 2008, 91 (1): 360366. 10.3168/jds.20070403.
Calus MP: Righthandside updating for fast computing of genomic breeding values. Genet Sel Evol. 2014, 46 (1): 24
Hayashi T, Iwata H: Em algorithm for bayesian estimation of genomic breeding values. BMC Genet. 2010, 11: 3
Meuwissen THE, Solberg TR, Shepherd R, Woolliams JA: A fast algorithm for bayesb type of prediction of genomewide estimates of genetic value. Genet Sel Evol. 2009, 41: 110.1186/12979686411.
Shepherd RK, Meuwissen TH, Woolliams JA: Genomic selection and complex trait prediction using a fast em algorithm applied to genomewide markers. BMC Bioinformatics. 2010, 11 (1): 52910.1186/1471210511529.
Li ZT, Sillanpaa MJ: Estimation of quantitative trait locus effects with epistasis by variational bayes algorithms. Genetics. 2012, 190 (1): 231249.
Acknowledgements
The authors thank the financial supports from the Natural Sciences and Engineering Research Council of Canada, L’Alliance Boviteq Inc., the AAFC Livestock Genetics & Genomics Program funds, and the AAFC Abase peerreviewed research project (RBPI1139). Canadian dairy network was acknowledged for providing data for this study. The authors also thank Drs Mehdi Sargozaei and Niel Karrow from University of Guelph, Dr Yuanming Zhang from journal of BMC Genetics and the other anonymous reviewer for their critical review and comments on the manuscript. This research has been enabled by the use of computing resources provided by WestGrid and Compute/Calcul Canada.
Author information
Authors and Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
LC, CL and FS contributed to the design of the study. LC, CL, SM and FS contributed to the development of the methods and discussions of analysis and results. LC performed the data analysis and drafted the manuscript. All authors read, provided comments and agreed on the contents of the manuscript.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited.
About this article
Cite this article
Chen, L., Li, C., Miller, S. et al. Multipopulation genomic prediction using a multitask Bayesian learning model. BMC Genet 15, 53 (2014). https://doi.org/10.1186/147121561553
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/147121561553