- Methodology article
- Open access
- Published:
Design of experiments for fine-mapping quantitative trait loci in livestock populations
BMC Genetics volume 21, Article number: 66 (2020)
Abstract
Background
Single nucleotide polymorphisms (SNPs) which capture a significant impact on a trait can be identified with genome-wide association studies. High linkage disequilibrium (LD) among SNPs makes it difficult to identify causative variants correctly. Thus, often target regions instead of single SNPs are reported. Sample size has not only a crucial impact on the precision of parameter estimates, it also ensures that a desired level of statistical power can be reached. We study the design of experiments for fine-mapping of signals of a quantitative trait locus in such a target region.
Methods
A multi-locus model allows to identify causative variants simultaneously, to state their positions more precisely and to account for existing dependencies. Based on the commonly applied SNP-BLUP approach, we determine the z-score statistic for locally testing non-zero SNP effects and investigate its distribution under the alternative hypothesis. This quantity employs the theoretical instead of observed dependence between SNPs; it can be set up as a function of paternal and maternal LD for any given population structure.
Results
We simulated multiple paternal half-sib families and considered a target region of 1 Mbp. A bimodal distribution of estimated sample size was observed, particularly if more than two causative variants were assumed. The median of estimates constituted the final proposal of optimal sample size; it was consistently less than sample size estimated from single-SNP investigation which was used as a baseline approach. The second mode pointed to inflated sample sizes and could be explained by blocks of varying linkage phases leading to negative correlations between SNPs. Optimal sample size increased almost linearly with number of signals to be identified but depended much stronger on the assumption on heritability. For instance, three times as many samples were required if heritability was 0.1 compared to 0.3. An R package is provided that comprises all required tools.
Conclusions
Our approach incorporates information about the population structure into the design of experiments. Compared to a conventional method, this leads to a reduced estimate of sample size enabling the resource-saving design of future experiments for fine-mapping of candidate variants.
Background
Genomewide association studies (GWAS) help exploring the relationship between genetic and phenotypic variation. Genetic variation is often expressed in terms of genomic markers such as single nucleotide polymorphisms (SNPs). Identified variants may or may not be part of known genes. In a candidate-gene approach, variants are then assigned to the closest known gene and their functional importance can be studied further (e.g., [1]). The functional meaning of a variant may be differently interpreted if, due to statistical uncertainty, it was identified a few kbp upstream or downstream of its position. In general, it could be a complicated task to detect single loci as reported by Sahana et al. [2] in a study on udder health in dairy cattle. Instead of identifying important SNPs for clinical mastitis, only target regions were found. For instance, a window of about 1 Mbp length was detected on BTA6. A statistical reason for this complication lies in the high multicollinearity among predictor variables due to linkage and linkage disequilibrium (LD) between SNPs (e.g., [3]). Region-based aggregation tests in biologically relevant regions (e.g., genes; [4]) or fine-mapping approaches in independent partitions of the genome [5] have been suggested as powerful options. To eventually unravel which of the variants in a target region might be truly related to the trait, a follow-up experiment is recommended. The experimental design should account for the dependence between SNPs to ensure sufficient statistical power. This will be reflected in the sample size required. Statistical tools for the design of experiments (e.g., QUANTO; [6]) could not provide this until now. However, the denser the SNP chip is, the higher will be the correlation between SNPs. For instance, the target region on BTA6 of Sahana’s paper covers 17 SNPs using a 50k SNP panel, 192 SNPs based on a 700k SNP panel and 21 796 SNPs in case of DNA sequence [2, 7].
In theory, it can be determined what sample size is needed for discovering a new variant in a single-locus model at a given power, e.g., 80 %. Such investigations are based on ANOVA (one way classification; [8]) and can also account for a hypothetical degree of LD between causative variant and SNP [9, 10]. Proposals for an optimum experimental design have been made for mapping of a quantitative trait locus (QTL) in different population structures (e.g., F2, backcross or daughter design; [11]). However, it is not clear what sample size is required to distinguish multiple independent signals of a QTL using dense marker data.
Moreover, the power of association analysis depends not only on sample size and population parameters (e.g., heritability) but also on the underlying statistical model. Among myriad options for whole genome regression models, SNP-BLUP is an obvious choice for estimating genetic effects captured by all SNPs simultaneously. Also because of its direct relationship to GBLUP (e.g., [12]), it is widely used in livestock (e.g., [13, 14]) and beyond (e.g., [15, 16]). Being enormously relevant in practice, it has been upgraded to comprise information of individuals with and without genotypic data in the framework of single-step methods [17, 18]. Though directly or indirectly estimated SNP effects are tested for being significantly different from zero [18], reports on statistical power of the underlying study design are lacking.
This paper addresses the question how to design a follow-up experiment based on a SNP-BLUP approach knowing that the predictor variables are so highly correlated. Our objective is the theoretical inference of optimal sample size to fine-map a QTL signal or to find evidence for multiple independent signals in a specified chunk of DNA. Eventually, it should be possible to detect variants at their actual position with high power. This paper concentrates on the case study of paternal half-sib families which is a typical family structure in livestock (e.g., dairy cattle). But the methodology developed enables sample size calculation for any population structure (e.g., full siblings, half siblings, mixture of both, unrelated individuals). Given the number of families, SNPs, signals of QTL and heritability, the optimal sample size is then presented as overall number of progeny. We validated our approach using simulated data. Furthermore, publicly available bovine HD SNP chip data helped verifying that the simulated linkage blocks resemble the genome structure in dairy cattle. A discussion of our achievements complements this study.
Methods
The design of experiment requires a statistical model that combines phenotype with genotype data. Here, we assume a multiple-SNP approach that considers information of as many SNPs as desired simultaneously. For comparing the outcome with a conventionally used approach, a single-SNP model is specified.
Multi-SNP model
For a joint association analysis of p SNPs with additive effects, a regression model is fitted to a phenotype y=(y1,…,yn)′ of n individuals,
The n×p design matrix X contains the genotype codes: Xj,k∈{1,0,−1} for j=1,…,n and k=1,…,p. The columns of X and the vector y are centered within family and scaled afterwards to ensure \(\frac {1}{n}X_{.,k}'X_{.,k}=1 \forall k\) and \(\frac {1}{n}y'y=1\). This way the model becomes independent of allele frequency. The residual error term is \(e\sim N(0,I_{n}\sigma _{e}^{2})\). Then the coefficient vector β is estimated using a ridge regression approach as
This step requires a penalty term λ which is practically obtained via cross-validation or REML approach.
Next, we investigate a multiple testing problem. For each SNP k, k=1,…,p, it is tested
with a suitable test statistic which is defined as the estimator of the k-th regression coefficient over its standard deviation, i.e.,
The calculation of power π requires the distribution of Tk under HA, then
where q1−α/2 and qα/2 denote the upper and lower threshold, respectively, of the distribution of Tk under H0 with respect to a type-I error α. Due to the ridge approach, requirements for fulfilling a t distribution do not hold ([19], p. 57). Hence the distribution of Tk is approximated as normal with mean μk and variance 1. The distribution mean μk is obtained from the expectation and variance of the estimator \(\widehat \beta _{k}\). The moments are
The central point of our investigation is to substitute the correlation matrix \(\frac {1}{n}X'X\) to be observed in the progeny generation by the theoretical correlation matrix R. For any SNP pair k,l∈{1,…,p}, \(\frac {1}{n}X_{.,k}'X_{.,l}\) is a plausible approximation to its expectation E(Xj,kXj,l)=cor(Xj,k,Xj,l) because of centered and scaled genotype codes. The derivation of R is shown in the Appendix; it requires a genetic map and genetic information of parents.
Then the mean of the test statistic becomes
Under H0, μk=0. In order to calculate the optimal sample size, the experimenter has to specify a set of parameters: number of SNPs (p) in the investigated window of DNA, number of QTL signals to be detected (κ), proportion of variance explained by the QTL signals in that window (h2) and number of families (e.g., N sires). The input parameters for statistical power calculation are inferred from this experimental set-up:
-
1.
R requires haplotypes of N sires (plus genetic map and maternal LD in general).
-
2.
We assume that all variants corresponding to the QTL signals contribute equally to the genetic variance. Hence the relative effect size is determined at κ QTL signals as
$$\begin{array}{@{}rcl@{}} {}\frac{\beta_{l}}{\sigma_{e}}&=&\sqrt{\frac{h^{2}}{\kappa(1-h^{2})}}\quad\text{for \textit{l} in the set of QTL signals}\,. \end{array} $$(4)The remaining β’s are 0.
-
3.
The shrinkage parameter is derived corresponding to Hoerl et al. [20],
$$\begin{array}{@{}rcl@{}} \lambda&=&p\frac{\sigma_{e}^{2}}{\beta'\beta}\ \\ &=&p\frac{1-h^{2}}{h^{2}}\,. \end{array} $$This is a rough approximation assuming linkage equilibrium between variants corresponding to the QTL signals.
We circumvent doing any assumption about the unknown positions of QTL signals by taking a random sample of κ positions. Then the optimal sample size is calculated over a range of n’s (e.g., 1−5 000) employing the method of bisection. The minimum n that exceeds the given power is selected as “optimal” and denoted as nopt. Here, we considered a power level of 80 % which is arbitrary but often used for statistical analysis (e.g., [21]). In order to get a reliable estimate of optimal sample size, sampling is repeated 100 times, and the median of nopt is suggested as final \(n_{\text {opt}}^{*}\). The overall type-I error was α=0.01.
Single-SNP model
For comparison, we consider a single SNP k∈{1,…,p} in a sliding window over the target region. Using the parameter definitions as above, the linear model in its simplest form is
Then the regression coefficient is estimated via ordinary least squares as
The null hypothesis testing problem (1) and the corresponding test statistic (2) also apply in the single-SNP analysis. The test statistic is t-distributed with n−1 degrees of freedom and non-centrality parameter δk ([19], pp. 110),
This approach neglects any impact of the other SNPs in the target region on y. Thus, a reduced pointwise error level (αk) has to be employed to keep the overall type-I error at α. Knowing the effective number of independent tests (Meff), a suitable type-I-error correction is
In accordance with the simple\(\mathcal {M}\) method of Gao et al. [22], we suggest using R instead of \(\frac {1}{n}X'X\) for calculating Meff. More precisely, the number of eigenvalues of R that contribute at least 99.5 % to the sum of all eigenvalues yields Meff.
Data and validation study
The software R version 3.6.1 [23] was employed in this study. Unless otherwise stated, we implemented own R scripts.
Population genetic data were simulated using the R package AlphaSimR version 0.11.0 [24]. In total, 300 SNPs were uniformly spread in a chunk of DNA of 1 cM length corresponding approximately to 1 Mbp. The SNP density roughly resembled HD data. Five traits were simulated simultaneously, one for each number of QTL signals affecting the trait, κ=1,…,5. The founder population comprised 2 000 individuals (gender ratio 1:1) and constituted the parent generation. Other population parameters were kept at default settings (e.g., effective population size 100, mutation rate 2.5·10−8). Using this information, the coalescent simulation program MaCS [25] was internally called: it generated parental haplotypes with realistic amount of LD. As the data simulation yielded no consistent pattern of SNP dependence, the simulation of the parent population was repeated 100 times. The maternal LD in terms of r2 between adjacent SNPs was on average 0.45 and reflected high multicollinearity. In each repetition, N=10 sires were selected with best phenotypes with respect to κ=1 and mated with 1 000 dams. In order to resemble dairy cattle, one progeny per cross was simulated yielding 100 half-siblings per family. At each SNP, the major allele was coded as reference. Then, haplotypes of all selected sires, or a subset thereof if N=1 or N=5, and maternal haplotypes of 1 000 progeny were used to set up the R matrix. Few loci with no variation were disregarded. Optimal sample size was estimated based on R. Separately for each κ but using the same parent generation, N males were selected based on their phenotype as sires of half-siblings in the progeny generation. The number of dams was determined according to optimal sample size required and assuming balanced family sizes. The simulation of the progeny generation was also repeated 100 times to estimate and test SNP effects for validation purposes; this yielded 100×100 data sets in total. Heritability was h2∈{0.1,0.2,0.3} which was partitioned into κ QTL effects of equal size. QTL positions were drawn at random out of the segregating sites. For each h2, data sets were simulated independently.
Additionally, to explore a direct relationship between positions of QTL signals and nopt, we selected arbitrarily a single repetition of simulation with h2=0.1 and N=10. For this particular data set, we determined nopt for each SNP position (i.e., assuming one QTL signal) and for all possible SNP pairs (i.e., assuming two QTL signals).
The R package asreml version 3.0 [26] was used for association analysis. Other suitable R packages, such as rrBLUP [27] or ridge [28], had difficulties to converge or produced almost zero variance components due to the high multicollinearity of predictor variables. The multi-SNP model was applied to all simulated scenarios as described in Multi-SNP model section. Unlike in Single-SNP model section, the single-SNP model considered an additional factor \(u~\sim N(0,A\sigma _{a}^{2})\) that accounts for background genetic effects due to the relationship between individuals. This was modeled similarly, e.g., in EMMAX [29] but we used the numerator relationship matrix A for computational convenience. The pointwise testing of SNP effects was followed by p-value correction according to Benjamini & Hochberg [30]. P-values from the multi-SNP model were not altered. The outcome was used to assess sensitivity and specificity of the multi-SNP and single-SNP model. For this, a window of 0.01 cM to both sides of a QTL signal (covering 2-3 SNPs) was specified in order to accept a significant SNP as a true positive result. Then, the true-positive rate (TPR) reflected sensitivity. Specificity was obtained as 1− the false-positive rate (FPR), and ROC curves were produced from TPR and FPR.
To evaluate how realistically the simulation of genetic data worked, empirical HD SNP chip data from the Dryad repository have been used [31]. These data included 1 151 dairy cows with no pedigree specification. We selected an arbitrary window on BTA7 comprising 300 SNPs on 1.16 Mbp and phased haplotypes of all animals using AlphaPhase [32]. We selected randomly 10 animals and marked them as sires in order to set up a matrix R. Because of the high SNP density, genetic distances were approximated linearly, i.e., 1 Mbp≈1 cM. Maternal LD was roughly approximated from haplotype frequencies of all animals. Furthermore, this R matrix was used for the inspection of optimal sample size assuming κ=1,…,5 QTL signals and h2=0.1 and following the workflow of Multi-SNP model section.
Results
The optimal sample size suggested by the single-SNP model required the effective number of independent tests which was on average Meff=53 if h2=0.1 and rather constant for R set up from N=1,5 or 10 sires (h2=0.2: Meff=54; h2=0.3: Meff=56). Hence results are reported for Meff based on N=10. Table 1 presents the median of \(n_{\text {opt}}^{*}\) from 100 repetitions of simulation. The median increased almost linearly with number of QTL signals but reduced with increasing heritability, and it was rather unaffected by the number of families. As an example, 127 individuals were required to fine-map a single QTL signal based on the multi-SNP model if h2=0.1. Almost twice as much were required to distinguish two signals if h2=0.1 or only 34 individuals were required to detect a single signal correctly when h2=0.3 instead of h2=0.1. Optimal sample size suggested by the multi-SNP model was 17 % to 39 % less than estimated from the single-SNP model. Figure S.1 (Additional file 1) visualizes the dependence of optimal sample size estimated from the single-SNP model on heritability. It also shows that a much larger sample was required if QTL heritability was less than 0.2.
In case of h2=0.1, the distribution of nopt is represented in Fig. 1; a separate panel is shown for each number of QTL signals to be detected. Based on 100×100 estimates of nopt, we derived a bimodal distribution of optimal sample size in the multi-SNP model. The median of nopt was consistently less than sample size estimated from single-SNP investigations. With increasing heritability, the first mode approached the median of nopt but was still less than optimal sample size based on the single-SNP model, see Figures S.2 (h2=0.2) and S.3 (h2=0.3) (Additional file 1). The second mode appeared due to strong negative correlations between SNPs. Particularly this outcome was observed when all possible pairs of SNPs were evaluated for detecting two QTL signals in a single repetition of simulation. Figure 2a shows the correlation matrix for a single data set. Those entries of R have been selected that belonged to 10 % of the highest estimates of sample size, i.e., nopt≥864 (h2=0.1). Correspondingly, Fig. 2b indicates that, with few exceptions, negative correlations caused this outcome. The separation of SNP dependence into maternal and paternal contribution revealed further insight, and most often negative maternal LD was the driving term (Fig. S.4, Additional file 1). The distance between two QTL signals hardly influenced nopt (Fig S.5, Additional file 1); any possible association was overlaid by the strong impact of correlation between loci. An additional inspection of the relationship between position of a single QTL signal and nopt was not conclusive. Neither extreme maternal allele frequency nor missing sire heterozygosity led to obviously increased nopt for detecting one QTL signal (Fig. S.6, Additional file 1).
The association analysis of data sets of optimal sample size was validated in terms of sensitivity and specificity of testing SNP effects. The shape of ROC curves was similar for all investigated simulation scenarios. As an example, if N=10 and κ=2, the median of \(n_{\text {opt}}^{*}\) was 273, and the outcome is displayed in Fig. 3. The analysis showed superiority of the multi-SNP model over the single-SNP model. In general, it was observed that the smaller \(n_{\text {opt}}^{*}\) was estimated, the larger both TPR and FPR turned out for the single-SNP model. The multi-SNP model performed rather robust against changes in sample size. However, the flat appearance of the ROC curve complicates fine-mapping of QTL signals based on the suggested multi-SNP approach. For instance, a TPR of 80 % is accompanied with a FPR larger than 20 %.
Blocks of varying linkage phases, as shown in Fig. 2, might be an artifact of data simulation. Based on empirical bovine HD SNP chip data, a possible R matrix was set up, see Fig. 4. The blocking structure was less pronounced. Using this R for estimating \(n_{\text {opt}}^{*}\) led to results being similar to the simulation study for one and two QTL signals but larger samples were required to detect more QTL signals: \(n_{\text {opt}}^{*}\) was 123 (1 signal), 288 (2 signals), 516 (3 signals), 800 (4 signals) and 1 342 (5 signals) if h2=0.1. The number of repetitions of randomly drawing the positions of QTL signals did not substantially affect the final \(n_{\text {opt}}^{*}\). For instance, the median deviated less than 4 % if nopt was calculated 1 000 instead of 100 times.
Discussion
Our investigation contributes to the design of powerful experiments for fine-mapping of causative variant(s) in a genomic target region. We incorporated the expected dependence among SNPs in this region and estimated optimal sample size based on a SNP-BLUP approach. The outcome was compared to a single-SNP model. Negative correlations between SNPs, which were mainly due to negative maternal LD, caused essentially inflated sample size estimates. In case of positive correlations, the majority of sample size estimates was less than sample size estimated from the single-SNP approach. The less the heritability, the higher the deviation between models was.
Population parameters
Our approach is applicable to any population structure. The matrix K of covariance between SNPs can be set up for any kind of family stratification by adapting the derivations of the Appendix or, in case of unrelated individuals, by using population LD in K.
Due to the way of model parametrization (columns Xk have been scaled), the dependence on allele frequency has been excluded. For instance, in a random mating population, the column-scaling term is \(\sqrt {2p_{k}(1-p_{k})}\) with allele frequency pk at SNP k. Likewise, a scaling term can be derived for half-sib families as the square root of Eq. (7) in the Appendix by investigating maternally and paternally inherited SNP alleles separately. Results of our association analyses suggested that there was no clear relationship between high nopt and maternal allele frequency or sire heterozygosity (Fig. S.6, Additional file 1). However, regions with large or low variation have to be taken into account when selecting sires for fine-mapping of QTL signals in a follow-up experiment. The lower sire heterozygosity or maternal minor allele frequency is, the lower the effect size βk on the model scale will be and, consequently, higher \(n_{\text {opt}}^{*}\) is required in order to detect QTL signals. To investigate this, we employed the relationship \(X_{k}\beta _{k}=X_{k}^{(o)}\beta _{k}^{(o)}\) at SNP k. Here \(X_{j,k}^{(o)}\) is the allele count at SNP k for individual j, and \(\beta _{k}^{(o)}\) is the coefficient on the observed genotype scale, i.e., \(\beta _{k}^{(o)} s_{k}=\beta _{k}\) with scaling term \(s_{k}=\sqrt {V(X_{j,k})}\). The relationship between allele frequency and optimal sample size for detecting one QTL signal based on the single-SNP model is presented in Figure S.7 (Additional file 1). Extreme alleles, roughly spoken with major allele frequency >0.95, require drastically increased sample size.
In the simulation study, equal effects of QTL signals were simulated on the observed genotype scale. For comparison with relative effect sizes derived from heritability (Eqn. 4), allele frequencies in the (random mating) founder population and the residual variance component were used to calculate the corresponding relative effect sizes: \(\beta _{k}^{(o)}\sqrt {2p_{k}(1-p_{k})}/\sigma _{e}\). Figure 5 shows the effect sizes of all repetitions of simulation with h2=0.1 separated by the number of simulated QTL signals (κ). As expected, the relative effect size decreased with increasing κ but a high fluctuation has been observed which was due to the high variation of allele frequencies in the simulated data. This observation underlines the difficulty of detecting multiple QTL signals at given h2 – the lower the effect size, the higher nopt required.
The suggested optimal sample size is divided into N sires which are selected for most heterogeneity in the target region. The actual number of sires is of minor importance. The choice of individuals depends on the objectives of the follow-up study. Sires can be chosen independently from the GWAS population in order to confirm and fine-map QTL signal(s). However, if the initial study indicated the presence of rare variants, sires under suspicion should be re-used. Selective genotyping is an option to increase power [11] but this might have negative impact on reproducibility of the study design [4]. In our investigation of paternal half-sib families, mothers are treated as random samples from a dam population. Thus, the choice of dams for future matings is not addressed here but is definitely an issue for other family designs.
Being equally important for fine-mapping of QTL signals is the positive correlation between SNPs. Positive correlatedness is a matter of genotype coding. Coding has to be consistent throughout the target region to avoid unnecessary sign changes in correlation. We employed coding in terms of counting the major allele in the population. But in regions of intermediate frequency, the coding might not be appropriate and hence a dynamic approach of coding the SNP alleles can circumvent negative correlations. A strategy on this is worth further investigation.
Power calculations are needed to quantify and judge the prospects of identifying causative variants with a hypothesized effect size in a particular population. In practice, however, experiments are usually not planned to obtain maximum power but data are regularly collected for purposes of breeding as a standard routine. Thus, experimental designs being theoretically optimal could be compared with available field data to understand the possible shortcomings of such data and to understand differences between theoretical/expected and actually achieved power. Based on the results, decisions can be made whether the amount of data is sufficient or, in case of underpowered experiments, more data should be acquired.
Necessity of fine-mapping of QTL signals using an appropriate design
The QTL databases of livestock species [33] contain information on several thousands QTL for a wide range of traits. This shows that the variability of most of the traits studied has a polygenic origin, with multiple QTL contributing to the overall genetic variance. Despite the number of QTL, only a handful of causal mutations could be detected and verified in the different livestock species [34]. This is partly due to the fact that GWAS show considerable weaknesses in the fine-mapping of QTL signals which are related to the SNP panel requirements for a genomewide distribution and high LD to neighboring markers [5]. Accordingly, these SNPs are usually indicative of a large genomic region that likely comprises the unmeasured causal SNP but does not provide information about the causal variant itself. Statistical methods for fine-mapping have been designed to overcome these issues and perform fine-mapping using the available SNP information from a SNP-chip or GBS (summarized by [5]). However, even these methods require a high SNP density in the region of interest, which favors a targeted sequencing strategy that enable the dissection of QTL regions and increase the chance of detecting causal variants [35]. Major factors to be considered for designing a targeted sequencing study are effect size, the number of causal SNPs, local LD structure and sample size [5]. The approach proposed in this study incorporates information on κ, h2 and R derived from the data to estimate the optimal sample size (\(n_{\text {opt}}^{*}\)) and thus provides all the information needed to design a fine-mapping experiment.
Currently, several fine-mapping studies are based on imputation strategies or the integration of results with functional enrichment analysis to identify promising candidate genes and QTNs (e.g., [36–38]). These approaches largely depend on imputation accuracy and the status of genome annotation, thus limiting the ability to detect causal variants, especially those with a low minor allele frequency [39]. Specific examples for the fine-mapping of important genomic regions with a resequencing strategy are still rare nowadays. Fraser et al. [40] focused in their study on collagenous lectins in horses by resequencing 658 kb DNA consisting of different candidate genes and regulatory regions. Therefore, a case-control design with pooled samples was used and with this approach 113 variants were identified, which differed between the groups. Although the results are promising, the authors concluded that a large-scale genotyping of individual samples is necessary for deeper insights. In this context, and considering that targeted sequencing for a reasonable set of samples is becoming increasingly affordable, an accurate estimate of sample size is advisable.
Other random effects
Association analysis of empirical data with certain pedigree structure requires an additional model term to account for genetic effects beyond the target window (Zu). Then u=(u1,…,un), \(u\sim N(0,G\sigma _{u}^{2})\), is the vector of individual genetic effects with suitable relationship matrix G. The calculation of optimal sample size should consider the presence of additional random effects (genetic or environmental) for the design of experiments. For instance, the coefficients of the single-SNP model could be estimated via BLUE. This affects sample size calculation because the variance of the estimator \(\widehat \beta _{k}\),
has impact on the distribution of the test statistic. Accounting for V in the denominator of test statistic increases the denominator of non-centrality parameter. However, in order to keep it simple, it would be sufficient to increase \(\sigma _{e}^{2}\) or reduce h2 appropriately without any other alterations.
It is possible to consider other kinds of genetic effects with the proposed methods. For instance, exploring dominance genetic effects requires only one modification. Instead of coding SNP genotypes for additive effects via Xj,k∈{1,0,−1}, genotypes can be coded as Xj,k∈{0,1,0} to account for dominance effects. The covariance between dominance effects has been worked out by Bonk et al. [41]. Feeding the Equation (3) with the corresponding dominance correlation matrix will provide estimates of optimal sample size to fine-map QTL signals with dominance effect.
Conclusion
For planning the design of experiment, we recommend a multi-SNP approach which considers the expected dependence among SNPs. Compared to a conventional approach, this leads to a reduced estimate of sample size and thus promises a more efficient use of animal resources. The benefit depends strongly on heritability: the lower heritability, the more resources can be saved. In general, optimal sample size increases almost linearly with the number of QTL signals to be detected. This study constitutes a frequentist framework for the design of experiments in specific populations that may be characterized by family stratification. It will help differentiating independent signals in QTL regions that can be further examined for cellular and molecular properties.
Appendix: Derivation of correlation matrix
We study the dependence between pairs of SNPs, each with two alleles A and B, in a population consisting of N paternal half-sib families. Let Xj,k be the genotype code at SNP k∈{1,…,p} of individual j∈{1,…,n} being progeny of sire s and dam d. Homozygous genotypes A/A and B/B are coded as 1 and -1, respectively, and the heterozygous genotype A/B is indicated as 0. The family-specific (i.e., sire-specific) covariance between SNPs k and l of individual j is, according to Bonk et al. [41] and Wittenburg et al. [42],
a function of maternal and paternal contribution and depends on the sire diplotype \(\mathcal {S}_{s}\). The \(D_{k,l}^{d}\) denotes the LD of maternal gametes in a dam population. The sire term depends on the phase of paternal haplotypes and recombination rate (θk,l). It is determined as
To achieve the covariance between a pair of SNPs, we employ conditioning on families,
and approximate \(\Pr (\mathcal {S}_{s})\) by family weights \(w_{s}=\frac {n_{s}}{n}\) with \(\sum _{s=1}^{N} n_{s}=n\). The aim is now to derive an expression that depends on already known terms. For instance, using
yields
We exploit the separation into independently inherited maternal and paternal SNP alleles: Xj,k=Xj,k,s+Xj,k,d, where Xj,k,s and Xj,k,d take a value of \(\frac {1}{2}\) if the A allele was inherited but \(-\frac {1}{2}\) otherwise. Then
where pk denotes the maternal allele frequency at SNP k. Furthermore,
Putting it all together,
This reduces to
and this is evaluated using the sire-specific terms in (5) and (6).
Now the variance of genotype codes at SNP k is derived explicitly – it also serves as a scaling term in a regression model for association analysis. The second moment of the paternally inherited SNP allele is constant \(E(X_{{j,k,s}}^{2}|\mathcal {S}_{s})=\frac {1}{4}\) for all s. Hence
Then, the variance at SNP k is
Finally, the correlation matrix R={Rk,l}k,l=1,…,p is calculated by scaling the entries correspondingly,
Note that the covariance based on non-centered genotype codes (as derived above) is identical to the one based on centered genotype codes (as used in Methods section). Centering is used to study within-family genetic effects, and it allows the direct estimation of allele substitution effects [43].
The R package hscovar for the calculation of K and R is provided at CRAN repository.
Availability of data and materials
R scripts used to simulate and analyze data are available in Additional files 2 and 3. However, an exact reproduction of the results of our simulation study is not possible because AlphaSimR relies on random number generation which cannot be initialized yet. The R package hscovar version 0.2.1 is available at CRAN; it provides tools for setting up the covariance or correlation matrix as well as for performing power calculations. The empirical bovine HD SNP chip data are accessible through the Dryad repository https://doi.org/10.5061/dryad.519bm [31].
Abbreviations
- ANOVA:
-
Analysis of variance
- BLUE:
-
Best linear unbiased estimation
- BTA:
-
Bos taurus autosome
- cM:
-
CentiMorgan
- DNA:
-
Deoxyribonucleic acid
- FPR:
-
False positive rate
- (G)BLUP:
-
(Genomic) best linear unbiased prediction
- GBS:
-
Genotyping by sequencing
- GWAS:
-
Genomewide association study
- HD:
-
High density
- LD:
-
Linkage disequilibrium
- Mbp:
-
Mega base pairs
- QTL:
-
Quantitative trait locus
- QTN:
-
Quantitative trait nucleotide
- ROC:
-
Receiver operating characteristic
- SNP:
-
Single nucleotide polymorphism
- TPR:
-
True positive rate
References
Reyer H, Hawken R, Murani E, Ponsuksili S, Wimmers K. The genetics of feed conversion efficiency traits in a commercial broiler line. Sci Rep. 2015; 5:16387.
Sahana G, Guldbrandtsen B, Thomsen B, Holm LE, Panitz F, Brøndum RF, et al.Genome-wide association study using high-density single nucleotide polymorphism arrays and whole-genome sequences for clinical mastitis traits in dairy cattle. J Dairy Sci. 2014; 97(11):7258–75.
Hampel A, Teuscher F, Gomez-Raya L, Doschoris M, Wittenburg D. Estimation of recombination rate and maternal linkage disequilibrium in half-sibs. Front Genet. 2018; 9:186.
Lee S, Abecasis GR, Boehnke M, Lin X. Rare-variant association analysis: study designs and statistical tests. Am J Hum Genet. 2014; 95(1):5–23.
Schaid DJ, Chen W, Larson NB. From genome-wide associations to candidate causal variants by statistical fine-mapping. Nat Rev Genet. 2018; 19(8):491–504.
Gauderman J, Morrison J. QUANTO Version 1.2. 2007. Retrieved June 10, 2015. Available from: http://biostats.usc.edu/Quanto.html.
Schnabel R. ARS-UCD1.2 Cow Genome Assembly: Mapping of all existing variants. 2018. Retrieved Sep 21, 2018. Available from: https://www.animalgenome.org/repository/cattle/UMC_bovine_coordinates/.
Luo Z. Detecting linkage disequilibrium between a polymorphic marker locus and a trait locus in natural populations. Heredity. 1998; 80(2):198.
Pritchard JK, Przeworski M. Linkage disequilibrium in humans: models and data. Am J Hum Genet. 2001; 69(1):1–14.
Khatkar MS, Nicholas FW, Collins AR, Zenger KR, Cavanagh JA, Barris W, et al.Extent of genome-wide linkage disequilibrium in Australian Holstein-Friesian cattle based on a high-density SNP panel. BMC Genomics. 2008; 9(1):187.
Weller J. Quantitative trait loci analysis in animals: CABI Publishing; 2001. https://doi.org/10.1079/9781845934675.0000.
Gualdrón Duarte JL, Cantet RJ, Bates RO, Ernst CW, Raney NE, Steibel JP. Rapid screening for phenotype-genotype associations by linear transformations of genomic evaluations. BMC Bioinf. 2014; 15(1):246. Available from: https://doi.org/10.1186/1471-2105-15-246.
Koivula M, Strandén I, Su G, Mäntysaari EA. Different methods to calculate genomic predictions—Comparisons of BLUP at the single nucleotide polymorphism level (SNP-BLUP), BLUP at the individual level (G-BLUP), and the one-step approach (H-BLUP). J Dairy Sci. 2012; 95(7):4065–73.
Mucha S, Mrode R, MacLaren-Lee I, Coffey M, Conington J. Estimation of genomic breeding values for milk yield in UK dairy goats. J Dairy Sci. 2015; 98(11):8201–8.
Maier R, Moser G, Chen GB, Ripke S, Absher D, Agartz I, et al.Joint analysis of psychiatric disorders increases accuracy of risk prediction for schizophrenia, bipolar disorder, and major depressive disorder. Am J Hum Genet. 2015; 96(2):283–94.
Kristensen PS, Jahoor A, Andersen JR, Cericola F, Orabi J, Janss LL, et al.Genome-wide association studies and comparison of models and cross-validation strategies for genomic prediction of quality traits in advanced winter wheat breeding lines. Front Plant Sci. 2018; 9:69.
Taskinen M, Mäntysaari EA, Strandén I. Single-step SNP-BLUP with on-the-fly imputed genotypes and residual polygenic effects. Genet Sel Evol. 2017; 49(1):36.
Aguilar I, Legarra A, Cardoso F, Masuda Y, Lourenco D, Misztal I. Frequentist p-values for large-scale-single step genome-wide association, with an application to birth weight in American Angus cattle. Genet Sel Evol. 2019; 51(1):28.
Searle S. Linear models. New York: Wiley; 1971.
Hoerl AE, Kennard RW, Baldwin KF. Ridge regression: some simulations. Commun Stat Theor M. 1975; 4(2):105–23.
Cohen J. Statistical power analysis for the social sciences. Hillsdale: Erlbaum; 1988.
Gao X, Starmer J, Martin ER. A multiple testing correction method for genetic association studies using correlated single nucleotide polymorphisms. Genet Epidemiol. 2008 May; 32:361–9.
R Core Team. R: A Language and Environment for Statistical Computing. Vienna; 2019. Retrieved Dec 16, 2019. Available from: https://www.R-project.org/.
Faux AM, Gorjanc G, Gaynor RC, Battagin M, Edwards SM, Wilson DL, et al.AlphaSim: software for breeding program simulation. Plant Genome. 2016; 9(3):1–14. Available from: https://doi.org/10.3835/plantgenome2016.02.0013.
Chen GK, Marjoram P, Wall JD. Fast and flexible simulation of DNA sequence data. Genome Res. 2009; 19(1):136–42.
Butler D, Cullis BR, Gilmour A, Gogel B. ASReml-R reference manual. Brisbane: The State of Queensland, Department of Primary Industries and Fisheries; 2009.
Endelman JB. Ridge regression and other kernels for genomic selection with R package rrBLUP. Plant Genome. 2011; 4(3):250–5.
Cule E, Vineis P, De Iorio M. Significance testing in ridge regression for genetic data. BMC Bioinf. 2011; 12:372.
Kang HM, Sul JH, Service SK, Zaitlen NA, Kong Sy, Freimer NB, et al. Variance component model to account for sample structure in genome-wide association studies. Nat Genet. 2010; 42(4):348.
Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Statist Soc B. 1995; 57(1):289–300.
Bermingham ML, Bishop SC, Woolliams JA, Pong-Wong R, Allen AR, McBride SH, et al. Data from: Genome-wide association study identifies novel loci associated with resistance to bovine tuberculosis. Dryad, Dataset. 2013. Available from: https://doi.org/10.5061/dryad.519bm.
Hickey JM, Kinghorn BP, Tier B, Wilson JF, Dunstan N, van der Werf JH. A combined long-range phasing and long haplotype imputation method to impute phase for SNP genotypes. Genet Sel Evol. 2011; 43(1):12.
Hu ZL, Park CA, Wu XL, Reecy JM. Animal QTLdb: an improved database tool for livestock animal QTL/association data dissemination in the post-genome era. Nucleic Acids Res. 2012; 41(D1):D871—9.
Andersson L, Georges M. Domestic-animal genomics: deciphering the genetics of complex traits. Nat Rev Genet. 2004; 5(3):202.
Mamanova L, Coffey AJ, Scott CE, Kozarewa I, Turner EH, Kumar A, et al.Target-enrichment strategies for next-generation sequencing. Nat Methods. 2010; 7(2):111.
Jiang J, Cole JB, Freebern E, Da Y, VanRaden PM, Ma L. Functional annotation and Bayesian fine-mapping reveals candidate genes for important agronomic traits in Holstein bulls. Commun Biol. 2019; 2(1):212.
Cai Z, Guldbrandtsen B, Lund MS, Sahana G. Weighting sequence variants based on their annotation increases the power of genome-wide association studies in dairy cattle. Genet Sel Evol. 2019; 51(1):20.
Liu Z, Wang T, Pryce JE, MacLeod IM, Hayes BJ, Chamberlain AJ, et al.Fine-mapping sequence mutations with a major effect on oligosaccharide content in bovine milk. Sci Rep. 2019; 9(1):2137.
Dadaev T, Saunders EJ, Newcombe PJ, Anokian E, Leongamornlert DA, Brook MN, et al.Fine-mapping of prostate cancer susceptibility loci in a large meta-analysis identifies candidate causal variants. Nat Commun. 2018; 9(1):2256.
Fraser RS, Arroyo LG, Meyer A, Lillie BN. Identification of genetic variation in equine collagenous lectins using targeted resequencing. Vet Immunol Immunopathol. 2018; 202:153–63.
Bonk S, Reichelt M, Teuscher F, Segelke D, Reinsch N. Mendelian sampling covariability of marker effects and genetic values. Genet Sel Evol. 2016; 48(1):36.
Wittenburg D, Teuscher F, Klosa J, Reinsch N. Covariance between genotypic effects and its use for genomic inference in half-sib families. G3 Genes Genom Genet. 2016; 6:2761–72.
Abecasis GR, Cardon LR, Cookson W. A general test of association for quantitative traits in nuclear families. Am J Hum Genet. 2000; 66(1):279–92.
Acknowledgments
Special thanks are given to N. Reinsch (Leibniz Institute for Farm Animal Biology, Dummerstorf), J. Hartung (University of Hohenheim, Stuttgart) and V. Liebscher (University of Greifswald), who contributed invaluable ideas to the project, and to C. Gaynor (Roslin Institute, Edingburgh) for giving handy insight into AlphaSimR. We also thank the reviewers for their helpful comments.
Funding
The project was funded by the German Research Foundation (DFG, WI 4450/1-1). The funder had no role in the design of the study and collection, analysis, and interpretation of data and in writing the manuscript.
Author information
Authors and Affiliations
Contributions
DW developed the theory, implemented the statistical methods, performed the analysis, and wrote the manuscript. SB contributed to the research on covariance between SNPs, MD was involved in theoretical investigations. HR contributed to the discussion. All authors have read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Additional file 1
PDF file containing additional figures.
Additional file 2
R script for data simulation and analysis. Running this script performs the complete data analysis and evaluation. It calls functions from Additional file 3.
Additional file 3
R script including functions used for data simulation and analysis.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visithttp://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Wittenburg, D., Bonk, S., Doschoris, M. et al. Design of experiments for fine-mapping quantitative trait loci in livestock populations. BMC Genet 21, 66 (2020). https://doi.org/10.1186/s12863-020-00871-1
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12863-020-00871-1