Observed data
Phenotypes for three traits were available from 34,425 pure-bred Duroc boars that were part of the Danish pig-breeding system. All boar testing was conducted at the national test station Bøgildgaard (Pig Research Centre, Danish Agriculture and Food Council, Denmark). The phenotypic records included average daily gain (ADG; g/day) from 30 kg–100 kg body weight, feed efficiency (FE; feed units/kg gain), and lean meat percentage (LMP). At the end of the test period, all boars were weighed and back-fat was measured by ultrasound and used to predict LMP. The pedigree was traced back to 1984, consisted of 419,961 animals, and included 256 unknown parents (base animals).
Genotypes were obtained for 3,085 of the phenotyped animals using either Illumina’s Porcine SNP60 BeadChip or Illumina’s 8.5 K GGP-Porcine Low Density Bead SNP chip. Genotypes of animals genotyped with the 8.5 K SNP chip were imputed to the SNP60 chip as described by [23]. A total of 33,029 of the 60 K SNPs fulfilled the following editing criteria and were used in our analyses: call rate of SNPs greater than 90 %, minor-allele frequency greater than 0.01, showed Hardy Weinberg expectations (p(χ
21
) > 10− 7), and allocated a chromosomal position on build Sscrofa10.2 [24]. All animal samples had call rates greater than 80 %.
Adjusted phenotypes used in genomic model analyses
The phenotypes used in the genomic model analyses were derived from phenotypic records of growth traits adjusted for relevant environmental factors using the following linear mixed model:
$$ \mathbf{y}=\mathbf{X}\mathbf{b}+{\mathbf{Z}}_{\mathrm{p}}\mathbf{p}+{\mathbf{Z}}_l\mathbf{l}+{\mathbf{Z}}_{\mathrm{a}}\mathbf{a}+\mathbf{e}\kern3em \left({\mathrm{M}}_{\mathrm{a}}\right) $$
where y is a vector of phenotypic observations; X is a design matrix for the fixed effects (starting weight, year, and section); Z
p is a design matrix for the random effect of pen; Z
l
is a design matrix for the random effect of litter; Z
a is a design matrix for the random additive genetic effect of animal (inter-individual variation determined from pedigree information); b is the vector of fixed effects; p, l, and a are vectors of random pen effects, litter effects, and animal effects, respectively; and e represents the residuals. The random effects and residuals were assumed to be independent normally distributed variables described as follows: p ~ N(0, I
pσ
2p
), l ~ N(0, I
lσ
2l
), a ~ N(0, Aσ
2a
), and e ~ N(0, Iσ
2
e
). The relationship matrix A was constructed using pedigree information. The variance components σ
2p
, σ
2l
, σ
2a
, and σ
2e
were estimated using an average information REML procedure [25]. The adjusted phenotypes used as response variables for genomic model analysis were calculated as the sum of the estimated residuals e and additive genetic effects a. This procedure enabled the use of all available phenotypes to estimate the fixed and random environmental effects, regardless of whether the animal was genotyped.
Statistical analyses using genomic models
We performed analyses using two different genomic models: GBLUP and GFBLUP using prior information on genomic features. These models were compared based on their predictive abilities, the proportion of phenotypic variance explained by genomic effects, and the precision of the estimated genomic parameters. Analyses utilized both observed and simulated phenotypic data.
The GFBLUP model was based on a linear mixed model including two random genomic effects:
$$ \mathbf{y}\hbox{'}=\mu +\mathbf{Z}\mathbf{f}+\mathbf{Z}\mathbf{r}+\mathbf{e}\kern3em \left({\mathrm{M}}_{\mathrm{GF}}\right) $$
where y is the vector of adjusted phenotypes, µ is an overall mean, Z is the design matrix linking observations to genomic values, f is the vector of genomic values captured by genetic markers linked to the genomic feature of interest, r is the vector of genomic values captured by the remaining set of genetic markers, and e is the vector of residuals. The random genetic effects and the residuals were assumed to be independent normally distributed values described as follows: f ~ N(0, G
f
σ
2
f
), r ~ N(0, G
r
σ
2
r
), and e ~ N(0, I
σ
2
e
).
The GBLUP model was based on a linear mixed model including only one random genomic effect:
$$ \mathbf{y}\hbox{'}=\mu +\mathbf{Zg}+\mathbf{e}\kern3em \left({\mathrm{M}}_{\mathrm{G}}\right) $$
where y is the vector of phenotypic observations, µ is an overall mean, Z is the design matrix linking observations to genomic values, g is the vector of genomic values captured by all genetic markers, and e is the vector of residuals. The random genomic values and the residuals were assumed to be independent normally distributed values described as follows: g ~ N(0, G
σ
2
g
) and e ~ N(0, I
σ
2
e
).
The additive genomic relationship matrix G was constructed using all genetic markers [2] as follows: G = WW
'/m, where W is the centered and scaled genotype matrix, and m is the total number of markers. Each column vector of W was calculated as follows: \( {\boldsymbol{w}}_{\boldsymbol{i}}=\frac{{\boldsymbol{m}}_{\boldsymbol{i}}-2{p}_i}{\sqrt{2{p}_i\left(1-{p}_i\right)}} \), where p
i
is the minor allele frequency of the i
th genetic marker, and m
i
is the i
th column vector of the allele count matrix M, which contains the genotypes coded as 0, 1, or 2 depending on the number of copies of the minor allele. The G
f
and G
r
was constructed similarly using only the genetic marker set defined by the genomic feature and the remaining set of markers, respectively.
Estimation of genomic parameters
The variance components \( {\hat{\upsigma}}_{\mathrm{f}}^2,{\hat{\upsigma}}_{\mathrm{r}}^2,{\hat{\upsigma}}_{\mathrm{g}}^2,\mathrm{and}\ {\hat{\upsigma}}_{\mathrm{e}}^2 \) were estimated using an average information REML procedure [25], as implemented in DMU [26]. For this process, we used the generalized inverse of the genomic relationship matrices. This was necessary because these matrices were not full rank due to centring, as well as in cases where the number of genetic markers was smaller than the number of phenotypic records. From these variance components, inferences on genomic heritability were based on the following ratios: \( {\hat{h}}_{GBLUP}^2=\frac{{\hat{\upsigma}}_{\mathrm{g}}^2}{{\hat{\upsigma}}_{\mathrm{g}}^2+{\hat{\upsigma}}_{\mathrm{e}}^2} \), for GBLUP, and \( {\hat{h}}_{GBLUP}^2=\frac{{\hat{\upsigma}}_{\mathrm{f}}^2 + {\hat{\upsigma}}_{\mathrm{r}}^2}{{\hat{\upsigma}}_{\mathrm{f}}^2+{\hat{\upsigma}}_{\mathrm{r}}^2+{\hat{\upsigma}}_{\mathrm{e}}^2} \) for GFBLUP. Inferences on partitioning of genomic variance in GFBLUP were based on the following ratios: \( {\hat{h}}_f^2=\frac{{\hat{\upsigma}}_{\mathrm{f}}^2}{{\hat{\upsigma}}_{\mathrm{f}}^2+{\hat{\upsigma}}_{\mathrm{r}}^2} \) and \( {\hat{h}}_r^2=\frac{{\hat{\upsigma}}_{\mathrm{f}}^2}{{\hat{\upsigma}}_{\mathrm{f}}^2+{\hat{\upsigma}}_{\mathrm{r}}^2} \). These ratios quantified the proportions of total genomic variance explained by the genetic markers in the genomic feature, and by the remaining set of genetic markers not part of the genomic feature.
Model statistics for comparing genomic models
The predictive abilities of the models were assessed using bootstrap validations. The training population included 1,814 of the animals born in 1998–2010 and for which we had both phenotypes and genotypes. To ensure a gap of at least one generation from the training population, the validation population comprised 1,271 genotyped boars that were born between 2012 and 2014. We evaluated the models’ predictive abilities by calculating the correlation between the observed phenotype y and the total genomic value—which was \( \hat{\mathbf{g}\ } \) for GBLUP, and \( \hat{\mathbf{g}} = {\hat{\mathbf{g}}}_f+{\hat{\mathbf{g}}}_r \) for GFBLUP. This was completed by first randomly sampling 1/5 of the animals in the validation set, and then calculating the correlation between the observed phenotype and the total genomic value. This procedure was repeated 100 times and the predictive ability was defined as the average correlation of 100 bootstrap samples (± standard error).
GBLUP approach for identifying genomic features associated with phenotypes
To identify phenotype-associated genomic features, we used a GBLUP-derived procedure for evaluating the collective action of a set of genetic markers. This approach is based on computing a summary statistic for the set of genetic markers that measures the degree of association between the genetic feature and the phenotypes. This summary statistics can be computed several ways using single-marker effects and test statistics.
Single-marker effects and test statistics
The single-marker effects \( \hat{\mathbf{s}\ } \) can be computed from the predicted genomic effect \( \hat{\mathbf{g}\ } \) [25, 27] as follows:
$$ \hat{\mathbf{s}} = \mathbf{W}\hbox{'}{\left(\mathbf{W}\mathbf{W}\hbox{'}\right)}^{-1}\hat{\mathbf{g}\ } $$
The variance of the single-marker effects can be calculated with the following equation:
$$ Var\left(\hat{\mathbf{s}\ }\right)=\mathbf{W}\hbox{'}{\left(\mathbf{W}{\mathbf{W}}^{\hbox{'}}\right)}^{-1}\mathrm{V}\mathrm{a}\mathrm{r}\left(\hat{\mathbf{g}\ }\right){\left(\mathbf{W}\mathbf{W}\hbox{'}\right)}^{-1}\mathbf{W}\hbox{'} $$
In this expression, \( \mathrm{V}\mathrm{a}\mathrm{r}\left(\hat{\mathbf{g}\ }\right) \) is the variance of the predicted genomic effect [28], which can be derived from the inverse of the coefficient matrix of the mixed model equations as G − C
gg, where C
gg corresponds to the genomic effects.
A test statistic for a single genetic marker effect is computed as follows:
$$ {t}_{{\hat{\mathbf{s}}}_{\boldsymbol{j}}}=\frac{{\hat{\mathbf{s}}}_{\boldsymbol{j}}}{\sqrt{Var\left({\hat{\mathbf{s}}}_{\boldsymbol{j}}\right)}} $$
where \( Var\left({\hat{\mathbf{s}}}_{\boldsymbol{j}}\right) \) is the estimate of variance of the j’th element of \( \hat{\mathbf{s}\ } \), obtained from the j’th element of the diagonal of the (co)variance matrix of the single-marker effects. Under the null hypothesis that \( {\hat{\mathbf{s}}}_{\boldsymbol{j}} = 0 \), it is assumed that \( {t}_{{\hat{\mathbf{s}}}_{\boldsymbol{j}}} \) follows a t distribution with dfe residual degrees of freedom [29]. The residual degrees of freedom dfe is computed as tr(I–H), which is equivalent to n-tr(H) where n is the total number of phenotypic observations and tr(H) represents the degrees of freedom occupied by the penalised fit (e.g. the linear mixed model fit). The hat matrix H transforms y into \( \hat{\mathbf{y}} \) [30]. Although the individual p values calculated using this method differ from those obtained via traditional methods, the ranking of the p values will be the same.
Summary statistic for a genomic feature derived from single-marker statistics
For each genomic feature, we constructed an appropriate summary statistic that measured the degree of association between the marker set and the phenotypes. We considered two different summary statistics. The first summary statistic was based on counting the genetic markers in the feature that were associated with the trait phenotype, as follows:
$$ {\mathrm{T}}_{\mathrm{count}} = {\displaystyle \sum_{i=1}^{m_f}}\mathrm{I}\left({t}_i>{t}_0\right) $$
where mf is the number of markers in the feature, ti is the i’th single-marker test statistic (e.g. t-statistic), t0 is an arbitrarily chosen threshold for the single-marker test statistics, and I is an indicator function that has a value of 1 if t
i
> t
0. However, no matter how the threshold is selected for determining “significant associations,” it is somewhat arbitrary, and genetic markers with slightly differing test statistics may be treated completely differently. By design, this test has high power to detect association if the genomic feature harbours genetic markers with large effects, but it will not detect a genomic feature with many genetic markers having small to moderate effects [31]. In such a case, it would be more powerful to use a summary statistic, such as the mean or sum of the test statistic for all genetic markers belonging to the same genomic feature. Thus, we also utilized a second summary statistic based on summing the single genetic marker test statistics in the feature, as follows:
$$ {\mathrm{T}}_{\mathrm{sum}}={\displaystyle \sum_{i=1}^{m_f}}{t}_i^2 $$
where ti represents the i’th single variant test statistics, e.g. marker effects or t-statistics.
Testing for association between a genomic feature and a phenotype
A genomic feature was considered significant if the associated summary statistics were more extreme than the cut-off set based on an empirical distribution of random marker sets of same size as the genomic feature. This was tested using a competitive null hypothesis, i.e. that the degree of association of the feature set was the same as that of a random marker set [32]. To this end, we obtained an empirical distribution of the test statistic by sampling random marker sets. A null hypothesis is only competitive if the parameters influencing the summary statistic are identical to the alternative hypothesis. Thus, there must be an equal number of markers for the random set and the true set, and the correlation structure among markers (due to linkage disequilibrium) should be retained. The empirical distribution of the summary statistics was obtained using the following permutation procedure. First, the observed test statistic was ordered accordingly to the physical position of the SNPs, and an element (i.e. one test statistic) was randomly selected from this vector. All elements were then shifted to new positions—such that the selected one became the first element, with the remaining SNPs shifted to new positions, but maintaining the original order. A new summary statistic was then computed based on the original position of the genomic features. This uncouples any associations between SNPs and the genomic feature, while retaining the correlation structure among test statistics. The permutation was repeated 1,000 times for each set in the feature class, and empirical p values were obtained through one-tailed tests of the proportion of randomly sampled summary statistics larger than that observed.
Genomic feature classes
Several strategies were used to define genetic marker sets that formed different classes of genomic features used in GBLUP and GFBLUP model analyses.
First, genomic features were derived from single-marker association test statistics (single-marker sets). A standard t-test was used to assess the single-marker statistical significance of the regression effect for individual SNPs. When an SNP was determined to be significantly associated with the genomic value based on a pre-specified significance cut-off level, the corresponding genome regions were then considered to define a “genomic feature.” These steps were repeated with decreasing significance cut-offs, thereby increasing the genomic region of the feature (SNP set).
Second, including or excluding SNPs from a genomic feature based on single-marker association tests can result in over-fitting of the data [33]. To ameliorate this risk, we created block sets of 50 markers that were physical adjacent on the genome, and we tested the associations of these marker sets with the trait using the above-described summary statistics. The significance of the association between the marker sets and the trait was determined using a pre-defined set of cut-off levels. Marker sets with p values below the cut-off were included in the genomic feature set.
Third, to assess the benefit of including prior data in GFBLUP models, we derived genomic features from the summary statistics of a group of genetic markers defined by a previously identified QTL region (a QTL set). The QTLs recorded in the Pig QTL database [11] are organized based on trait ontology, and we used the 167 traits listed in the Vertebrate Trait Ontology column. A trait can have multiple associated QTLs originating from several sources. We utilized the QTLs comprising the QTL set for the selected trait. The markers of our data set were grouped according to the genomic locations of QTL sets for the 167 trait categories downloaded from the database. The genomic region spanned by each individual QTL was standardized to 250 kb on each side of the QTL midpoint. Only QTL sets spanning >2 SNPs were used in the analysis. A marker set containing the SNPs that was not included in any of the QTL sets and a set containing all markers was added to this genomic feature class, resulting in a total of 169 tested marker sets. The number of SNPs in each QTL set is shown in Additional file 2.
Simulated data
We also established a series of simulation studies to investigate factors influencing the power to detect genomic features affecting the trait phenotype, estimation of genomic parameters, and prediction ability of the two tested linear mixed models. We used the method described in [34] pp. 98. The genetic values and residuals were simulated in R using the function mvrnorm from the library MASS [35]. The factors varied in the simulations included genomic heritability (h
2), proportion of genomic variance explained by causal SNPs in the genomic feature (h
2
f
), proportion of non-causal SNPs in the genetic marker set defined by the genomic feature (dilution), genome distribution of causal SNPs (causal model) (i.e. how the causal SNPs were physically distributed on the genome: random or clustered), and the number of phenotypic observations available for analysis (Nobs).
Genotypes
The simulations were based on the real genotype data set including 3,085 individuals and 33,029 SNPs. In all scenarios, the number of causal SNPs was equal to 1,000. Causal sets were divided into two subsets. The first subset C
1 included 100 SNPs and was used as the causal SNP set in the genomic feature that explains 10 %, 20 %, 30 %, or 50 % of the genomic variance. The second subset C
2 included 900 SNPs and explained the remaining genomic variance. To mimic relevant genetic scenarios, the genome distribution of the causal SNPs in the genomic feature was simulated using two different causal models: a random and a cluster model. The cluster model illustrated causal SNPs among connected genes in QTL regions. On the other hand, the random model provides an example of a trait with causal variants distributed in genes, which are linked to many different processes such that the pattern seems random. For the clustered causal model, the 100 causal SNPs in C
1 were chosen from 20 randomly selected genomic regions spanning 50 SNPs each, and the remaining 900 SNPs in C
2 were randomly selected from the complete SNP set. For the random causal model, the SNPs in C
1 and C
2 were randomly selected from the complete SNP set. To investigate the effects of non-causal SNPs within the causal sets, we added an increasing number of non-causal SNPs (100, 200, …, 1,900, 2,000), to the causal sets, in a process referred to as dilution. To determine the false-positive rate, 50 marker sets (referred to as a non-causal SNP set) of varying sizes (100, 500, 1,000, and 5,000) were sampled among the non-causal SNPs.
Phenotypes
Phenotypes were simulated using the following linear model: y = g
1 + g
2 + e, where g1 ~ N(0, G
1 * σ
2
g1
), g2 ~ N(0, G
2 * σ
2
g2
), and e ~ N(0, I * σ
2
e
). G1 and G2 are the genomic relationship matrices for causal SNPs in C1 and C2, respectively. The total phenotypic variance σ
2P
= σ
2g1
+ σ
2g2
+ σ
2e
was 100 in all scenarios. We simulated data under additive genomic heritabilities \( \left({h}^2=\frac{\sigma_{g1}^2+{\sigma}_{g2}^2}{\sigma_{g1}^2+{\sigma}_{g2}^2+{\sigma}_e^2}\right) \) of 0.1, 0.2, or 0.3, to analyse scenarios with low to intermediate heritabilities, reflecting those observed in the real data. To analyse scenarios with non-uniform SNP effects, the proportion of additive genomic variance explained by the causal SNPs in C
1
\( \left({h}_f^2=\frac{\sigma_{g1}^2}{\sigma_{g1}^2+{\sigma}_{g2}^2}\right) \) was varied across scenarios: 0.1, 0.2, 0.3, or 0.5. These parameters were investigated for three population sizes (Nobs): 1000 (1 K), 2000 (2 K), and 3000 (3 K). These variations resulted in a total of 72 individual simulated data sets [3 (Nobs) × 3 (h2) × 4(h
2f
) × 2 (causal model)], which were each replicated 50 times. Table 2 presents an overview of the factors included in the simulation. The simulated data were analysed using the above-described linear mixed models, permutation, and cross validation procedures.
Ethics
The present study was not subject to ethical approval since it was based on pre-existing data belonging to the Danish Agriculture and Food Council, Pig Research Centre, and did not require the application of additional experimental procedures. The simulated data is available upon request.
Availability of data and materials
The genotypic and phenotypic data on the Danish Duroc population used in this study is private property of the Danish pig breeders and the authors are not at liberty to disclose them in the public domain. However, the simulated data are available upon request.