Simulated datasets
Two random mating populations in linkage equilibrium were crossed generating a population (of size 5,000, coming from 100 families) with LD, which was subjected to five generations of random mating without mutation, selection or migration. The resultant population is an advanced generation composite, which presents Hardy-Weinberg equilibrium and LD. According to [29], the LD value in a composite population is \( {\Delta}_{\mathrm{a}\mathrm{b}}\kern0.5em =\kern0.5em \left(\frac{1\kern0.5em -\kern0.5em 2{\uptheta}_{\mathrm{a}\mathrm{b}}}{4}\right)\;\left(\;{\mathrm{p}}_{\mathrm{a}}^1\kern0.5em -\kern0.5em {\mathrm{p}}_{\mathrm{a}}^2\right)\;\left(\;{\mathrm{p}}_{\mathrm{b}}^1\kern0.5em -\kern0.5em {\mathrm{p}}_{\mathrm{b}}^2\right) \), where a and b are two SNPs, two QTLs, or one SNP and one QTL, θ is the frequency of recombinant gametes, and p1 and p2 are the allele frequencies in the parental populations (1 and 2). Notice also that the LD value depends on the allele frequencies in the parental populations. Thus, regardless of the distance between the SNPs and/or QTLs, if the allele frequencies are equal in the parental population, Δ = 0. The LD is maximized (|Δ| = 0.25) when θ = 0 and |p1 - p2| = 1. In this case, the LD value is positive with coupling and negative with repulsion [30].
From the advanced generation of the composite, 1,000 individuals were generated with diploid genomes having a length of 200 centimorgans (cM) (L = 2 Morgans) and assuming ten equally sized chromosomes, each one with two haplotypes. We simulated a marker density by assigning 2,000 equidistant SNP markers that were separated by 0.1 cM across the ten chromosomes. One hundred of the 2,000 markers were actually genes (QTL). A total of 1,000 individuals that came from the same generation and from 20 full-sib families (each one with 50 individuals) were genotyped and phenotyped. This simulation provides a typical small effective population size (Ne = 39.22) and a large LD in the breeding populations. Ne of approximately 40 and the use of 50 individuals per family are typical values in elite breeding populations of plant species.
The QTLs were distributed in the regions covered by the SNPs. For each trait, we informed the degree of dominance (d/a) and the direction of dominance (positive and/or negative). The obtained genotypic values for homozygotes were within the limits of Gmax = 100(m + a) and Gmin = 100(m - a), which are the maximum and minimum values, respectively.
Goddard et al. [31] presented the realized proportion (r
2
mq
) of genetic variation explained by the markers as \( {r}_{mq}^2=\frac{n}{n+{n}_{QTL}} \), where n
QTL
is the number of QTL. With n = 2,000 markers and n
QTL
= 100, we have r
2
mq
= 0.95. An alternative [14] takes n
QTL
= 2NeL = 2 39.22 2 = 156.88, producing r
2
mq
= 0.93. Another approach [32] provides r
2
mq
as \( {r}_{mq}^2=\frac{1}{1+4NeS}=\frac{1}{1+4\kern0.5em 39.22\kern0.5em 0.001}=0.86. \) L is the total length of the genome, and S is the spacing between markers (both in Morgans). These values reveal that the genome was sufficiently saturated by markers.
Traits with two genetic architectures were simulated, one following the infinitesimal model and the other with five major effects genes accounting for 50 % of the genetic variability. For the former, to each of 100 QTL one additive effect of small magnitude on the phenotype was assigned (under the Normal Distribution setting). For the latter, small additive effects were assigned to the remaining 95 loci. The effects were normally distributed with zero mean and variance, allowing the desired heritability level. The phenotypic value was obtained by adding to the genotypic value a random deviate from a normal distribution N (0, σ
2
e
), where the variance σ
2
e
was defined according to two levels of broad-sense heritability, 0.30 and 0.50, associated with narrow-sense heritabilities of approximately 0.20 and 0.35, respectively. Heritability levels were chosen to represent one trait with low heritability and another with moderate heritability, which addressed the cases where genomic selection is expected to be superior to phenotypic selection. The magnitudes of the narrow-sense and broad-sense heritabilities are associated with an average degree of dominance level (d/a) of approximately 1 (complete dominance) in a population with intermediate allele frequencies. Simulations assumed independence of additive and dominance effects, with dominance effects having the same distribution as the additive effects (both were normally distributed with zero mean). In the simulation, it was also observed that marker alleles had MAF (minor allele frequency) greater than 5 %.
Scenarios
For the populations of full-sib families, four scenarios were studied: two broad-sense heritability levels (approximately 0.30 and 0.50) × two genetic architectures. The scenarios were analyzed using 10 statistical methods (Table 1).
Additive-dominance model for the REML/G-BLUP method
A mixed linear model for individual additive breeding values (u
a
) and dominance deviations (u
d
) is as follows: y = Xb + Zu
a
+ Zu
d
+ e, with the variance structure given by \( {u}_a\sim N\left(0,{G}_a{\sigma}_{u_a}^2\right) \); \( {u}_d\sim N\left(0,{G}_d{\sigma}_{u_d}^2\right) \); e ~ N(0, Iσ
2
e
). An equivalent model [33] at the marker level is given by y = Xb + ZWm
a
+ ZSm
d
+ e, where:
$$ \begin{array}{l}{u}_a=W{m}_a;\kern0.5em \\ {}Var\left(W{m}_a\right)=WI{\sigma}_{m_a}^2W\hbox{'}=WW\hbox{'}{\sigma}_{m_a}^2;\\ {}{u}_d=S{m}_d;\\ {}\kern0.5em Var\left(S{m}_d\right)=SI{\sigma}_{m_d}^2S\hbox{'}=SS\hbox{'}{\sigma}_{m_d}^2.\end{array} $$
W and S are the incidence matrices for the vectors of additive (m
a
) and dominance (m
d
) marker genetic effects. The variance components associated to these effects are \( {\sigma}_{m_a}^2 \) and \( {\sigma}_{m_d}^2 \), respectively. G
a
and G
d
are the genomic relationship matrices for the additive and dominance effects. The quantity m
a
in one locus is the allele substitution effect and is given by m
ai
= α
i
= a
i
+ (q
i
− p
i
)d
i
, where pi and qi are allelic frequencies and a
i
and d
i
are the genotypic values for one homozygote and heterozygote, respectively, at locus i. In turn, the quantity m
d
can be directly defined as m
di
= d
i
.
The matrices W and S, which will be defined later, are based on the values 0, 1 and 2 for the number of one of the alleles at the i marker locus (putative QTL) in a diploid individual. Several parameterizations are available, and the one that matches well with classical quantitative genetics theory [34] is as follows [5, 24, 25, 35].
The correct parameterization of W and S is as follows, according to the marker genotypes at a locus m.
$$ W=\left\{\begin{array}{l}\mathrm{If}\kern0.24em \mathrm{M}\mathrm{M},\ \mathrm{then}\kern0.48em 2-2p\kern0.6em \to 2q\\ {}\mathrm{If}\;\mathrm{M}\mathrm{m},\ \mathrm{then}\kern0.5em 1\kern0.36em -2p\kern0.48em \to q-p\\ {}\mathrm{If}\kern0.24em \mathrm{m}\mathrm{m},\ \mathrm{then}\kern0.36em 0-2p\kern0.6em \to -2p\end{array}\right. $$
(1)
$$ S=\left\{\begin{array}{l}\mathrm{If}\kern0.24em \mathrm{M}\mathrm{M},\ \mathrm{then}\kern0.24em 0\kern0.48em \to -2{q}^2\\ {}\mathrm{If}\kern0.24em \mathrm{M}\mathrm{m},\ \mathrm{then}\kern0.48em 1\to 2pq\\ {}\mathrm{If}\kern0.24em \mathrm{m}\mathrm{m},\ \mathrm{then}\kern0.6em 0\kern0.36em \to -2{p}^2\end{array}\right. $$
(2)
The covariance matrix for the additive effects is given by G
a
σ
2
a
= Var(Wm
a
) = WW ' σ
2
ma
, which leads to \( {G}_a=WW\hbox{'}/\left({\sigma}_{ma}^2/{\sigma}_a^2\right)=WW\hbox{'}/{\displaystyle \sum_{i=1}^n\left[2{\mathrm{p}}_i\left(1-{p}_i\right)\right]} \), as \( {\sigma}_a^2={\displaystyle \sum_{i=1}^n\left[2{\mathrm{p}}_i\left(1-{p}_i\right)\right]}{\sigma}_{ma}^2 \). The covariance matrix for the dominance effects is given by G
d
σ
2
d
= Var(Sm
d
) = SS ' σ
2
md
. Thus, \( {G}_d=SS\hbox{'}/\left({\sigma}_{md}^2/{\sigma}_d^2\right)=SS\hbox{'}/{\displaystyle \sum_{i=1}^n{\left[2{\mathrm{p}}_i\left(1-{p}_i\right)\right]}^2} \), as \( {\sigma}_d^2={\displaystyle \sum_{i=1}^n{\left[2{\mathrm{p}}_i\left(1-{p}_i\right)\right]}^2}{\sigma}_{md}^2 \).
The additive-dominance G-BLUP method was fitted using GVC-BLUP software [26] via REML through mixed model equations.
Bayesian Ridge Regression (BRR) method
A Bayesian additive-dominance G-BLUP or Bayesian Ridge Regression (BRR) method was fitted using GS3 software [36] via MCMC-REML/BLUP assigning flat (i.e., with degrees of freedom equal to −2, which turns the inverted chi-square into a uniform distribution) prior distributions for variance components. (The a priori flat is a the noninformative one).
BayesA and BayesB methods
The BayesA and BayesB methods, described by [1], are advantageous because they can potentially provide information on the genetic architecture of the quantitative trait.
In these methods, specific variances are allowed at each locus. Additionally, BayesB performs variable selection because the majority of the markers are not in LD with the genes. Thus, a set of markers associated with a trait must be identified. The BayesB method subjectively determines π, the proportion of markers having effects. Using the indicator variable I, in the BayesA and BayesB models, the additive genetic effect of an individual j is defined as \( {a}_j={\displaystyle \sum_{i=1}^n{m}_{ai}{w}_{ij}{I}_{ai}} \), where I
ai
= (0, 1). The distribution of I
a
= (I
a1 … I
an
) is binomial with a probability π, which is 1 for BayesA and is subjectively determined for BayesB. The quantities of w
ij
are elements of the marker genotype matrix W. Dominance effects are coded in a similar way: \( {d}_j={\displaystyle \sum_{i=1}^n{m}_{di}{s}_{ij}{I}_{di}} \).
These Bayesian methods assume that the conditional distribution of each marker effect (given its variance) follows a normal distribution, i.e., m
ai
|σ
2
mai
~ N(0, σ
2
mai
). The variances of the marker effects are assumed to be a scaled inverse chi-square distribution with v degrees of freedom and scale parameter \( {S}_{m_a}^2 \), i.e., σ
2
mai
~ χ
− 2(ν
ma
, S
2
ma
). This assumption implies that a larger number of markers has small effects and a small number of markers has large effects, which leads to a univariate t-distribution of the marker effects with mean zero [37]. Gianola et al. [2] proved that fitting a variance by locus in this way is equivalent to postulating a t distribution for all loci. Thus, the identification of relevant marker effects is more likely in the t-BayesA model than in the normal-RR-BLUP model.
For the Bayes methods, the marginal prior distribution for additive marker effects is \( {m}_{ai}\Big|{\nu}_{m_a},{S}_{m_a}^2\sim t\left(0,{\nu}_{m_a},{S}_{m_a}^2\right) \). The combination of normal (for marker effects) and inverse chi-square distributions (for variances) leads to a t distribution for m
ai
, and thus a longer tail than that for normal distribution. In this paper, the values 6 and 8 were assigned for v to provide sufficiently thick tails associated to t distributions [38], and \( {S}_{m_a}^2 \) was calculated from the additive variance according to the method of [39].
For dominance effects at the intra-population level, the distributions are similar to what was described for additive effects. Thus, m
di
|σ
2
mdi
~N(0, σ
2
mdi
) for the marker dominance effects; σ
2
mdi
~χ
− 2(ν
md
, S
2
md
) for the marker dominance variance, with the marginal of the prior distribution for marker dominance effects given by \( {m}_{di}\Big|{\nu}_{m_d},{S}_{m_d}^2\sim t\left(0,{\nu}_{m_d},{S}_{m_d}^2\right) \).
Additive and dominance variances are given by \( {\sigma}_a^2={\displaystyle \sum_{i=1}^n2{\mathrm{p}}_i\left(1-{p}_i\right)}{m}_{ai}^2 \) and \( {\sigma}_d^2={\displaystyle \sum_{i=1}^n{\left[2{\mathrm{p}}_i\left(1-{p}_i\right)\right]}^2}{m}_{di}^2 \), respectively, according to the parameterizations in W and S. The full conditional distributions for the parameters of the BayesA and BayesB models were presented in detail by [18].
BayesA*B* or IBLASSOt method
According to [40], a strong influence of prior parameters on predictive ability was observed in the BayesA and BayesB models. Variation in the scale parameters S
2
ma
and S
2
md
in these methods had a strong impact on prediction. An overlarge scale (S
2
ma
or S
2
md
) for the prior distribution of variance led to overfitting of the data, while a scale parameter that was too small led to underfitting due to excessive shrinkage of the effects. In both cases, the predictive ability is considerably reduced. Consequently, to obtain good predictive abilities, an appropriate choice of hyperparameters is necessary to prevent both over- and underfitting.
The differences between the explicit regression GWS methods are mainly due to the type and extent of the shrinkage imposed by the method, the ability to learn from the data, and the influence of prior distributions. In the case of N < <<n (n is the number of markers and N is the number of individual observations), learning from the data is difficult to verify because the data (likelihood) do not dominate the posterior distribution. Thus given the same sampling model postulated by the methods, likelihood shrinkage properties are not very different. Thus, any differences in posterior inferences between these methods must be because priors are influential and very different [38]. Based on this analysis, it can be asserted that different methods can be fitted with the same machinery only by somehow drastically altering the prior distribution.
The Bayesian Lasso method provides better learning from the data than BayesA and BayesB [2, 38]. The difference between the Bayesian LASSO and the Bayesian approaches (BayesA and BayesB) developed by [1] is derived from the different specifications of the prior variance of the marker-specific regression coefficient as well as the type and extent of shrinkage effected.
For this reason, we chose to implement BayesA using the BLASSO framework by specifying the prior distribution through appropriate degrees of freedom (6 and 8) for the scaled inverse chi-square distribution associated with marker genetic variance (and then with the penalization parameter λ). This produces a t-like distribution, which is an intermediate between the normal (of the RR-BLUP) and double exponential (of the Lasso) distributions and provides desirable shrinkage estimates for the QTL effects, as does BayesA.
By fitting in this way (via BLASSO), BayesA has better learning properties. This improved BayesA can be called BayesA* and can turn out to be Bayes B* if the BLASSO machinery effectively leads a large number of markers to zero effects. In this case, the method will be called Bayes A*-B* (or t-Bayesian Lasso) because it conjugates the priors of BayesA and the type and extent of shrinkage (covariable selection) of the Blasso method. In their fast BayesB method, [41] changed the prior distribution of marker effects from a Student-t distribution to a double exponential of Laplace, which improved the model and perhaps made it closer to the BLASSO method. Kärkkäinen and Sillanpää [42] discussed the interchange of Student-t and Laplace (DE) as prior distributions of marker effects. Another possible name for Bayes A*-B* is t-BLASSO, meaning Bayesian Lasso [43] with a t distribution as the prior for marker effects.
Bayes A*-B* methods were fitted using GS3 software [36] via MCMC assigning with 6 and 8° of freedom for the inverted chi-square distribution for genetic variance (and then with the penalization parameter λ), which converts the prior for marker effects into a t distribution. This approach is expected to produce results similar to the Bayes methods of [1] but with the learning ability of the BLASSO method. Additionally, the BLASSO is asymptotically free of prior information and more consistent than BayesB and does not require tuning.
BLASSO and IBLASSO methods
In the Bayesian Lasso [44], the prior assigned to marker effects is a Laplace (double exponential, DE) distribution. All marker effects are assumed to be independently and identically distributed as a DE. This prior assigns the same variance or prior uncertainty to all marker effects, but it possesses thicker tails than the normal or Gaussian prior. Comparative discussions of the DE prior are in [45] and [46].
With two variance components (σ
2
e
and σ
2
ma
), the model is called an improved Bayesian Lasso (IBLASSO) [43]. The practical implementation of this model via Gibbs sampling, including the full posterior conditional distributions, was described by [43]. For dominance effects, similar distributions hold as described for additive effects.
Concerning the IBLASSO of [43], [38] criticizes the choice of a uniform flat prior on the regularization parameter λ. Because of this criticism, our paper used two alternative priors: a similar flat prior and also a prior with 4° of freedom on the parameter λ, as in the case of the BLASSO. Computations were performed in the GS3 Software.
Ridge Regression with heterogeneity of variances (RR-HET)
An additive-dominance Ridge Regression (RR-BLUP) method can also be implemented that considers the heterogeneity of variances between markers, called RR-HET. In our paper, the matrices with specific variances for each marker, D
a
= diag(τ
21a
, τ
22a
, …, τ
2
na
) and D
d
= diag(τ
21d
, τ
22d
, …, τ
2
nd
), were obtained by the BLASSO method (4, −2) using GS3 software.
Fitting models
Each type of population was simulated 10 times under the same parameter settings, which preserved the same features and provided samples that were effectively of the same conceptual population. Nine replicates were used as training populations, and one replicate was used as a validation population. The estimations based on each of the nine replicates were validated by obtaining estimates of the parameters accuracy and bias. Validation and reference individuals belonged to the same population but to different families.
In each replicate, marker effects were estimated and used to estimate the genetic values of individuals in the tenth population. These estimated genetic values were correlated with the parametric genetic values of individuals of the tenth population, providing the accuracy values. The results from the nine analyses were averaged across replicates to obtain final accuracies and heritabilities for each scenario.
Methods for computing parametric accuracies under the additive-dominance models were derived following the method of [6]. The following formulas were obtained:
Additive accuracy: \( {r}_{a\widehat{a}}=\sqrt{\frac{r_{mq}^2\left(N{r}_{mq}^2{h}_a^2/{n}_{QTL}\right)}{1+N{r}_{mq}^2{h}_g^2/{n}_{QTL}}} \)
Dominance accuracy:
\( {r}_{d\widehat{d}}=\sqrt{\frac{r_{mq}^2\left(N{r}_{mq}^2{h}_d^2/{n}_{QTL}\right)}{1+N{r}_{mq}^2{h}_g^2/{n}_{QTL}}} \)
Genotypic accuracy:
\( {r}_{g\widehat{g}}=\sqrt{r_{a\widehat{a}}^2+{r}_{d\widehat{d}}^2} \),
where nQTL is the number of QTL, N is the number of individuals in the estimation dataset, and h
2
a
, h
2
d
and h
2
g
are additive, dominance and total heritability, respectively.
For Bayesian methods, we used 120,000 iterations for the MCMC algorithms of the different models, with the first 20,000 iterations discarded as burn in. After every set of 10 iterations (thin) were performed, a sample was retained to calculate a posteriori statistics. Hence, 10,000 MCMC samples were used to construct the posterior densities. The convergence of the Markov chains was checked with a [47] diagnostic and also by visualizing the trace plot and running repeated progressive analyses until convergence was met. Posterior distributions were plotted (Fig. 1) to view the Bayesian learning of the methods. A summary of the fitted models is presented in Table 1.
Decomposing the quantitative genetic information
The three types of quantitative-genetic information can be defined as in [28]:
Linkage disequilibrium: refers to founder alleles from different loci in the same gamete, and the loci are in LD (not sampled independently, i.e., in population level disequilibrium) and describe genetic relationships between founders.
Co-segregation: refers to non-founder alleles (not in LD and not identical by descent from the base population) from different loci in the same gamete, and the loci are linked (not transmitted independently, i.e., in population level equilibrium but in within-family level disequilibrium).
Genetic relationships: statistical dependency between alleles from the same locus in different gametes. This kind of information is of three types: When associated with markers, it refers to parentage only on the marker loci and does not involving a linkage between markers and QTL; when associated with the pedigree of individuals in a model with both markers and pedigree, it refers to residual polygenic effects; when associated with the pedigree of individuals only, it refers to total polygenic effects.
G-BLUP makes use of the following: (i) co-segregation of QTL and markers due to linkage; (ii) pedigree genetic relationships between markers not linked to QTL; and (iii) LD between markers and genes to capture relationships at QTL [28]. The genomic relationship matrix is called the realized relationship, as it describes IBD at SNP, assuming an ancient founder population. However, only genetic relationships at QTL matter.
The genomic relationship matrix includes LD, co-segregation and pedigree genetic relationships between markers not linked to QTL (for example, in structured populations). Habier et al. [28] derived formulas for proving that all three sources of information are used by G-BLUP.
The data sets analyzed were as follows: overall (raw or without any correction of the phenotypes); within-family deviations across families (with correction of the phenotypes for family effects and analyzing families altogether); and within each family with posterior averaging (with correction of the phenotypes for family effects and analyzing one family at a time). The accuracy of genomic selection in the analysis using the within-each-family with posterior averaging dataset is due to LD and co-segregation. In the analysis using the dataset from within-family deviations across families, the accuracy is due only to LD, while the accuracy of the analysis with the overall dataset is due to family IBD relationships, LD and co-segregation.