Genotypes
High density genotypes for approximately 30 sire/dam/offspring trios were obtained and phased for each of four breeds of pigs (Duroc, Hampshire, Landrace, Yorkshire) in a previous study [23]. To ensure accurate phasing, the reference panel for imputation used in this study was the 128 haplotypes from the Yorkshire sire/dam pairs previously genotyped as parents in those trios. Animal protocols were approved by the Michigan State University All University Committee on Animal Use and Care (AUF# 03/09-046-00). The haplotypes of these animals are freely available at https://www.msu.edu/~steibelj/JP_files/LD_estimate.html.
Detailed information about data cleaning procedures, descriptive statistics of LD, and correlation of phase between Yorkshire and other US pig breeds can be found in Badke et al. [23]. In addition, DNA samples were collected from 920 Yorkshire pigs and sent to a commercial laboratory (GeneSeek, a Neogen Company, Lincoln, NE) to be genotyped on the Illumina PorcineSNP60 (Number of markers M = 62,163) Genotyping BeadChip (Illumina Inc.) [6]. Only animals with more than 90% genotype call rate were considered for analysis, resulting in 889 animals used as the testing panel for this study. All SNP included in the 128 haplotype Yorkshire reference panel were used for analysis. All data from this study is available at https://www.msu.edu/~steibelj/JP_files/imputation.html.
In our previous study [23] we reported breed specific LD and persistence of phase among breeds for Duroc, Hampshire, Landrace, and Yorkshire pigs. We found that persistence of phase between Yorkshire and the other breeds ranged between 0.42 and 0.57 for SNP spaced approximately 1MB apart [23]. As a result the amount of LD within the Yorkshire breed that could be recovered through haplotypes from another breed ranges between 0.18 and 0.33, such that adding haplotypes of a second breed to impute Yorkshire genotypes did not appear to be beneficial. For genomic selection, a simulation study conducted by de Roos et al. [24] found that persistence of phase between breeds needs to be much larger than the reported value between Yorkshire and any of the other three breeds to implicate any advantage for the use of mixed breed training panels. For this reason we decided to use only Yorkshire haplotypes in the reference panel for imputation in this paper.
Genotype imputation and estimation of imputation accuracy
All imputations in this study were done using BEAGLE version 3.3.1 [13], a genotype imputation software that uses a reference panel of haplotypes to estimate phase and impute missing genotypes in a set of unrelated individuals. Beagle was run separately for each chromosome using 128 reference haplotypes from the trio design [23] (phased file) to phase and impute genotypes of the 889 un-phased testing animals. All SNP, except tagSNP, were masked as missing in the testing set. Beagle was run for ten iterations of the phasing algorithm, drawing four samples per iteration. Previous results from another study [16], as well as a short experiment conducted in this study (data not shown) found no increase in imputation accuracy when the number of iterations or samples per iteration were increased. The output files from BEAGLE contained the most likely imputed genotypes (AA, AB, BB), posterior genotype probabilities (P(A A), P(A B), P(B B)), and posterior expected allelic dosage of the B allele derived from the posterior genotype probabilities (i.e. 0∗P(A A) + 1 ∗ P(A B) + 2 ∗ P(B B)) [25].
Imputation accuracy was estimated using three different measures that reflect different influences of MAF and error counting. The proportion of correctly imputed alleles was computed as
(1)
where g
i
j
is the observed allelic dosage of SNP i in individual j, ĝ
ij
is the corresponding posterior expected allelic dosage obtained from BEAGLE output, M is the total number of imputed SNP, and N
i
is the number of individuals with called genotypes for SNP i. This overall measure of imputation accuracy can be further decomposed into SNP-specific accuracy () and animal specific accuracy (). This measure of imputation accuracy will be biased upwards, especially for SNP with low MAF, because even if imputation ignores LD information and is based solely on allele frequency, the major allele would be correctly imputed for a large proportion of genotypes [15, 16]. As tagSNP density decreases imputation accuracy of rare alleles further decreases as rare haplotypes become harder to identify due to longer sequences of SNP missing [15]. Estimating the total percentage of correctly imputed alleles for SNP with low MAF will be biased due to the large number of correctly imputed major alleles masking the small number of misspecified minor alleles, which can be overcome through the use of a more sensitive measure of accuracy for these SNP [15]. In addition, if individuals carrying the minor allele are not correctly identified and their phenotype cannot be matched for GWAS this relatively small proportion of incorrectly imputed alleles will further decrease power. A variety of measures have been introduced to obtain estimates of imputation accuracy unbiased by MAF [15, 16, 26]. We estimated the proportion of correctly imputed alleles adjusted for MAF using the formula presented by Hayes et al. [16]:
(2)
where IA is computed as described in equation (1) and I A
F
r
e
q
is the accuracy of imputation based on genotypic frequencies estimated as:
(3)
where , , and are the observed frequencies for genotypes AA, AB, and BB for SNP i in the reference haplotypes and , , and are the predicted genotypic frequencies in the testing population for SNP i. I A
F
r
e
q
can be interpreted as the expected probability of correctly imputing a genotype in the testing population by assigning a randomly sampled genotype from the haplotypes in the reference panel. This measure was computed on a SNP-wise basis and averaged across all SNP. To account for a slightly different number of genotypes observed within each SNP (due to missing at random) the average was obtained by weighting the accuracy of each SNP by the number of individuals with observed genotypes within each SNP.
Alternatively, another measure of imputation accuracy robust to MAF is the squared correlation between the observed and imputed allelic dosage [15]. The correlation was obtained on a SNP by SNP basis using the correlation function in R [27]. SNP wise correlation measures were weighted by the number of available observations within the SNP to obtain an overall average imputation accuracy.
Methods of tagSNP selection
TagSNP were selected using three approaches: 1) evenly spaced based on physical position, 2) based on minimum pairwise LD with non-tagSNP (statistical selection), and 3) based on marker predictive ability to accurately impute non-observed SNP genotypes (predictive selection).
To select evenly spaced SNP the total length of each chromosome was partitioned into segments corresponding to the total number of tagSNP to be selected. Then, within each segment the SNP closest to the segment center was identified and added as a tagSNP. If a given segment was empty, no tagSNP was selected in that segment.
To implement a statistical search for tagSNP [19] we used the freely available software package FESTA [18]. FESTA performed a greedy search, where each SNP i was either an element of the tagSNP set or in LD higher than a threshold () with an existing element of the tagSNP set. FESTA was run repeatedly for increasing ranging from 0.1 to 0.9 in 0.1 increments using estimates of LD based on 128 reference haplotypes.
To implement predictive tagSNP selection, we applied the following forward search algorithm: First, we split the 64 Yorkshire reference animals into a randomly sampled set of 10 individuals (training set) and 54 individuals (reference haplotypes). Second, all SNP except one tagSNP in the training set were masked and imputed using the reference haplotypes. Third, accuracy of all imputed SNP was estimated and saved. Steps two and three were then repeated until all estimates of imputation accuracy were available for all potential tagSNP (at first, the potential tagSNP are all SNP on the chromosome). Fourth, the SNP that yielded the highest average accuracy of imputation among those not already chosen as tagSNP was selected as a new tagSNP. Steps two through four were repeated until the maximum number of tagSNP or a target imputation accuracy were reached. Because of the high computational demand of this methodology, this approach was only applied to the smallest available chromosome (SSC18), selecting tagSNP from 786 candidate SNP.
Concurrent to this research, a set of 9390 tagSNP (Release Date: April 2012) was assembled by GeneSeek (Lincoln, NE) for the development of a commercial platform for low density genotyping in swine. This assay has been marketed as the GeneSeek Genomic Profiler for Porcine LD (GGP-Porcine; GeneSeek, Lincoln, NE). After production, the GGP-Porcine contains approximately 8500 tagSNP (Jeremy Walker, personal communication). TagSNP covering the entire genome were selected based on MAF in 13 commercial lines of pigs represented by four breeding companies and four purebred populations. The MAF were provided by the breeding companies (identified simply as company A, B, C, and D). The number of lines provided by these companies were 1, 1, 4, and 7. Additional estimates of MAF used to identify tagSNP were obtained from our previous study [23] of four pure breeds: Duroc, Hampshire, Landrace, and Yorkshire. The freely available SNPspace software (C.P. Van Tassell, unpublished data) was used to select tagSNP. SNPspace was initially developed to select SNP for the Illumina BovineSNP50 beadchip [2]. The conceptual framework of SNPspace is briefly described in that study [2], but additional features have been added since that time. Relative weights on lines or breeds of pigs ranged from 0.00625 to 0.25. SNPspace is based on a greedy algorithm, where SNP scores account for breed or line specific MAF, region of the genome, and position of SNP relative to previously selected tagSNP. Density of tagSNP was doubled within 5 Mbp of the chromosomal extremes, which has been shown to improve average accuracy of imputation compared to tagSNP evenly spaced across the entire chromosome [1, 28].
Increasing reference panel size
To assess the effect of the number of reference haplotypes on imputation accuracy, we split the available sample of 889 Yorkshire pigs into two groups: 1) a 200 animal testing panel, and 2) a 689 animal set of supplemental reference sires. Assignment to the two panels was random.
To obtain imputation accuracy for a decreased set of reference haplotypes, we split the original 128 reference haplotypes obtained from 64 Yorkshire animals into two groups of 64 reference haplotypes (corresponding to 32 animals) and estimated average imputation accuracy in the 200 animal testing set. Then, we split the two groups of 64 reference haplotypes further into two groups of 32 reference haplotypes and obtained four estimates of imputation accuracy that were averaged into a single measure.
Subsequently, we compared imputation accuracy using trio based reference panels to imputation accuracy based on randomly sampled reference panels. To this end, we randomly sampled 16 animals from the 689 animal supplemental reference set and continued to add individuals at random to obtain reference sets of 24, 32, 48, 64, 96, 128, 256, and 512 animals. Each of these sets was phased individually using BEAGLE [13] and then those haplotypes were used as reference panel to impute the 200 testing animals.
Finally, we assembled reference panels of haplotypes combining the original 128 haplotypes from trios, with an increasing number of supplemental reference sires. To form these reference panels 64, 128, 192, and 448 supplemental reference sires were randomly selected and phased using the trio haplotypes as a reference panel. Both, the trio reference haplotypes and an increasing number of supplementary reference haplotypes were then used to impute the 200 animal testing set.
Because imputation accuracy was constant across chromosomes (see Results, section 3.1) we conducted this experiment on chromosome SSC14, a medium sized chromosome that has uniform coverage of SNP across its length. We expect results to extrapolate to all other chromosomes.