Genotype imputation accuracy in a F2 pig population using high density and low density SNP panels

Gualdrón Duarte, Jose L; Bates, Ronald O; Ernst, Catherine W; Raney, Nancy E; Cantet, Rodolfo JC; Steibel, Juan P

doi:10.1186/1471-2156-14-38

Research article
Open access
Published: 08 May 2013

Genotype imputation accuracy in a F2 pig population using high density and low density SNP panels

Jose L Gualdrón Duarte^1,2,
Ronald O Bates¹,
Catherine W Ernst¹,
Nancy E Raney¹,
Rodolfo JC Cantet² &
…
Juan P Steibel^1,3

BMC Genetics volume 14, Article number: 38 (2013) Cite this article

6223 Accesses
37 Citations
Metrics details

Abstract

Background

F₂ resource populations have been used extensively to map QTL segregating between pig breeds. A limitation associated with the use of these resource populations for fine mapping of QTL is the reduced number of founding individuals and recombinations of founding haplotypes occurring in the population. These limitations, however, become advantageous when attempting to impute unobserved genotypes using within family segregation information. A trade-off would be to re-type F₂ populations using high density SNP panels for founding individuals and low density panels (tagSNP) in F₂ individuals followed by imputation. Subsequently a combined meta-analysis of several populations would provide adequate power and resolution for QTL mapping, and could be achieved at relatively low cost. Such a strategy allows the wealth of phenotypic information that has previously been obtained on experimental resource populations to be further mined for QTL identification. In this study we used experimental and simulated high density genotypes (HD-60K) from an F₂ cross to estimate imputation accuracy under several genotyping scenarios.

Results

Selection of tagSNP using physical distance or linkage disequilibrium information produced similar imputation accuracies. In particular, tagSNP sets averaging 1 SNP every 2.1 Mb (1,200 SNP genome-wide) yielded imputation accuracies (IA) close to 0.97. If instead of using custom panels, the commercially available 9K chip is used in the F₂, IA reaches 0.99. In order to attain such high imputation accuracy the F₀ and F₁ generations should be genotyped at high density. Alternatively, when only the F₀ is genotyped at HD, while F₁ and F₂ are genotyped with a 9K panel, IA drops to 0.90.

Conclusions

Combining 60K and 9K panels with imputation in F₂ populations is an appealing strategy to re-genotype existing populations at a fraction of the cost.

Background

The search for regions in the genome containing genetic variants that affect production traits requires experimental populations to identify the segregating QTL within and between parental populations [1]. The F₂ design is commonly used to map QTL segregating in divergent parental lines [2, 3]. To produce reliable analyses of association or genetic evaluations using genomic information, a great number of individuals with phenotypes and high density (HD) genotypes are required [4]. However, HD genotypes for large numbers of animals are expensive to obtain [5, 6]. A way of reducing cost is to genotype individuals from base generations (parents) in HD, and their more numerous descendants at low density (LowD) [6, 7]. Then, using selected SNP from the HD panel, called tagSNP, the non-typed SNP are imputed with high accuracy [7]. Imputing HD genotypes of progeny from LowD genotypes, conditional on grandparental and parental HD genotypes, may result in higher imputation accuracies than those obtained using a reference panel from unrelated individuals [7–9]. This is because HD genotypes from base generations can be traced within family by means of co-segregation or descendant probabilities [6] while searching for the phase of parental alleles [7].

Most studies on genotype imputation of livestock species have been performed with purebreds [4, 7, 9–13], and genotype imputation from crossbreds has been largely absent. With regard to agricultural plant species, studies on genotype imputation have used inbred lines [8], recombinant inbred lines (RILs) in Nested Association Mapping (NAM) designs [14, 15], and Multiparent Advanced Generation Inter-Cross studies (MAGIC) [16]. Genotype imputation has also been employed in human studies of genome-linkage analysis for test association of candidate transcriptional regulators with gene expression [17]; and also in a model organism in biomedical research such as the mouse, imputation of genotypes from crosses of inbred lines was used to identify candidates genes for complex disease [18, 19].

Imputing genotypes in humans, plants, livestock, or model organisms, is similar in the sense that a small number of founding individuals can be genotyped at high density, and the bulk of the mapping population can be genotyped at low density using linkage information. In this paper we focus on imputing F₂ individuals from a three generation (F₀, F₁ and F₂) population of Duroc × Pietrain crossbred pigs. The F₀ and F₁ animals were genotyped in HD (60K). The F₀ populations used to map QTL in pigs are typically composed of a small number of animals (in our case, 4 males and 15 females) [1, 20–22]. As it is expected that few recombinations occur in the first generations, these populations have low resolution to map QTL [23]. However, and for the same reason, there is a potential for attaining high accuracy of imputation. The latter effect can be taken to advantage for imputing HD genotypes from inexpensive LowD F₂ genotypes, which subsequently allows combining existing data from experimental populations in a meta-analysis for association. There are several reasons for this strategy to be attractive. First, several of these populations have been recently created [21, 22, 24, 25] and DNA from these animals is available. Second, extensive datasets of phenotypes have been recorded for these populations including for traits that are expensive or difficult to measure, such as the content of intramuscular fat and composition of fatty acids [25], age at puberty in gilts [22], and meat tenderness [26]. Finally, these populations are generally developed from breeds that are divergent for some traits of interest such as fat/lean content, meat quality or reproductive efficiency, take for example: Duroc × Pietrain [1, 21], Duroc × Landrace [24], Duroc × Large-White [25], White-Duroc × Erhualian [22], Meishan × Duroc [27], Berkshire × Duroc [20].

Therefore, it follows that imputation of F₂ LowD to HD genotypes with high accuracy would be useful and convenient, providing a cost effective strategy as a first step for association analyses or meta-analyses. Different methods have been employed to select tagSNP in LowD panels. Two of them are: 1) imposing restrictions on the minimum value of linkage disequilibrium (LD) or r_t² between markers [28], 2) selection of tagSNP that are evenly spaced using the physical distance between markers[4, 11, 12]. In addition, commercial chips are also available with medium density segregating SNP selected from several populations, as for example for bovine [29] and pig [10]. A question arises of how many SNP are needed to attain a high accuracy of imputation for a given F₂ population. Another question is whether a specific chip has to be custom designed, or whether current commercially available chips can be used. Finally, it is important to determine whether both the F₀ and F₁ have to be genotyped at HD, or if just genotyping the F₀ is adequate to obtain a high accuracy of imputation in the F₂.

The goal of this research was to estimate the accuracy of imputation at HD (60K), from LowD F₂ genotypes for a Duroc × Pietrain population, using different genotyping schemes. The strategies were evaluated by means of Monte Carlo simulation, conditional on the genotypes from animals in the first two generations (F₀ and F₁). In doing so, two methods of tagSNP selection were considered and their results were compared to those obtained from a commercial panel chip (9K). In addition to simulations, accuracy of imputation was evaluated using experimental data, taking advantage of a reduced number of F₂ animals that were genotyped at HD.

Results

Linkage disequilibrium and selection of tagSNP

Table 1 displays the number of tagSNP selected with different values of LD in an intermediate size chromosome (SSC12), reflected by the measure r_t². As the value of r_t² increases, more tagSNP are selected and IA increases. As an example, when r_t² = 0.2, 79 tagSNP were selected at an average distance of 0.79 Mb and at an accuracy of 0.970. On the other hand for r_t² = 0.5, 399 tagSNP were selected, positioned at an average distance of 0.16 Mb with IA being equal to 0.982 (Table 1).

Table 1 Accuracy of imputation using tagSNP selected for different values of r _t ² on chromosome 12

Full size table

Evenly spaced SNP

The IA using tagSNP selected using either LD information or evenly spaced SNP were similar. For example, the IA of non-typed SNP on SSC12 were 0.973 and 0.970, respectively, for 80 evenly spaced SNP as compared with 79 tagSNP selected with r_t² = 0.2 (Figure 1). Results for other densities of tagSNP were similar (Figure 1). Moreover, evenly spaced tagSNP sets of comparable density across chromosomes yielded similar accuracies. Thus, for example an average inter–marker distance of 2.1 Mb, 140 tagSNP on chromosome 1 and 30 tagSNP on chromosome 12 produced IA of 0.969 and 0.968, respectively (Figure 2). In summary, a minimum of 1,200 evenly spaced tagSNP across the genome (average distance = 2.1 Mb) are needed in this F₂ population to attain imputation accuracy IA ≥ 0.97 when the F₀ and F₁ are genotyped with a SNP60 chip.

Imputed genotypes in experimental F2 animals

9K commercial chip

The values of IA were calculated for two scenarios and for each chromosome, using a 9K SNP list that was developed for producing a commercial LowD panel (GeneSeek, Inc., Lincoln, NE, USA; described in Badke et al. [10]).

Imputation accuracies IA were 0.90 and 0.99 when the F₁ was genotyped at low or high density, respectively (Figure 3). In the latter case, although the accuracy was high in all chromosomes (0.99), SNP in some regions were imputed with lower accuracy (Figure 4). High IA in the F₂ were obtained across all SNP when the F₁ was genotyped at HD (Figure 4a,b). However, when the F₁ was genotyped at LowD, IA in F₂ individuals decreased along the whole chromosome (Figure 4c,d). A logical question to consider is the following: how much accuracy is gained when including pedigree information, when compared with the use of population-wise LD as the unique source of information? To answer this, the imputation was performed again using as reference panel the genotypes of F₀ and F₁ animals and the F₂ at LowD, but without specifying the pedigree of the F₂s. In other words, the F₂ animals were assumed unrelated and their parents were unknown. For chromosome 1 the results are displayed in Figure 5. Notice that the average IA in the F₂ was equal to 0.90. Therefore, the IA was lower than when the information on relationships was used (0.99, Figure 4a,b). This indicates that the inclusion of HD genotypes from related animals and explicitly specifying paternities greatly increases accuracy of imputation.

The IA from both genotyping scenarios (Figures 3 and 4) reflect an average drop of 0.1 when the F₁ is genotyped at LowD. To gain further insight, the simulated haplotypes of two families were used to calculate accuracy of imputation in each scenario. When the F₁ is genotyped at LowD, the results showed that the phase error among the SNP that are not tagSNP increased. This loss of accuracy in determining the SNP phase can be traced back to the F₀ generation in which the non-tagSNP are also phased with low accuracy. Furthermore, the proportion of SNP with uncertain phase in the F₁ genotyped at HD was 4%, and the ensuing accuracy of haplotyping was 0.97. However, when the F₁ was genotyped at LowD the proportion of SNP with uncertain phase increased to 30%, and the corresponding accuracy of haplotyping for the non-tagSNP of F₁ genotypes dropped to 0.85. In a further analysis with the F₁ generation genotyped at HD and used as a reference population (ignoring F₀ genotypes), this resulted in 43% of non-tagSNP with uncertain phase in the F₁ at HD, and the haplotyping accuracy was even lower (0.78). These results suggest that, in order to have a high accuracy of imputation for non-tagSNP in F₂ genotypes, certainty of the phase in the F₁ genotypes is required. Such accurately estimated phase is guaranteed when two generations of HD genotypes (F₀, F₁) are available.

A closer look at Figures 4 and 5 indicates that the position of the SNP had some effect over IA. Therefore, we investigated the relationship between single SNP imputation accuracy and each SNP’s MAF, distance to the nearest tagSNP, and allelic frequency difference between founding breeds.

Minor allele frequency (MAF)

The measure of accuracy based on counting the number of alleles correctly imputed is sensitive to the allelic frequency [8, 12, 30]. In the current study, the square of the correlation (R²) between observed and imputed genotypes was used as a robust measure of accuracy of imputation. It is worth noting that the scale of this measure is somewhat different from the one derived from AI (Table 2).

Table 2 Imputation accuracy of SNP on chromosome 12 measured by IA or by R ²

Full size table

MAF using the 9K panel in the F₂

Figure 6 shows that the MAF of the imputed SNP was not related to R² in these data. Notice also that alleles with extreme frequencies (MAF < 0.1) can be imputed with accuracy similar to those SNP at intermediate frequencies (MAF > 0.3).

Distance to the closest tagSNP

No differences in R² were found for the range of distances between non-tagSNP and tagSNP observed (average was equal to 0.936 Mb). Therefore, for an average density between tagSNP of 0.26 Mb, R² is similar for a SNP that is in the middle of the interval than for a SNP that is close to the tagSNP (Figure 7). This observation suggests that the density of tagSNP was enough to attain a reasonably equal R² for all SNP within the interval.

Effect of the difference in allelic frequencies in the F₀

The difference in allelic frequency between founding populations does not seem to affect the R². This means that even SNP that segregate at very different frequencies in founders can be imputed with high accuracy as revealed in Figure 8. Moreover, the apparent drop in R² for MAF differences over 0.75 presented in Figure 8 is largely an artefact of very small number of SNP used in the smoothing line fit.

Discussion

SNP selection methods and accuracy of imputation

A main goal of the present research was to evaluate accuracy of imputation in an F₂ cross of pigs (Duroc × Pietrain) using different genotyping scenarios. In a first stage, IA was calculated from simulated F₂ data. An ideal situation for linkage based imputation would be to select SNP equally spaced based on genetic distance, as the possibility of recombination between imputed SNP and tagSNP would be minimal. However, this is not possible in the absence of a high resolution linkage map. Consequently, to position the tagSNP we used two proxies: a) physical spacing, and b) LD-based selection. For our simulated population, the two proxies produced the same results, most likely because it was assumed that 1cM = 1 Mb uniform recombination rate. Therefore, in this simulated population, the average distance between tagSNP throughout the genome proved to be a good indicator of accuracy of imputation (IA), as values greater or equal to 0.97 were obtained using average distances among tagSNP that were less than or equal to 2.1 Mb. Next, the selection of tagSNP using the LD method was compared to choosing SNP located at regularly spaced intervals throughout the genome. In the first method, LD was measured by r_t², the minimum threshold of r² between any non-tagSNP with at least one tagSNP. It was observed that when r_t² increased, the number of selected tagSNP and IA also increased. The accuracy was between 0.960 (r_t² = 0.1) and 0.982 (r_t² = 0.5), with average distance between tagSNP of 1.86 Mb and 0.16 Mb, respectively. Xu et al. [28] used r_t² = 0.8 to select a set of tagSNP for genome-wide association analyses in humans. Their use was slightly different from ours in that they were selecting SNP to tag causative variants for genome-wide association using population level LD information only. On the other side, we wanted to use this method to select SNP that were more evenly spaced in terms of genetics distance as done previously with outbred pig populations [10], but this time exploiting within and between family LD. Consequently, low levels of r_t² were used in the current study as we found that with a threshold of r_t² ≥ 0.6, many tagSNP were selected with marginal increases of IA. The second method employed to select tagSNP consisted of dividing the chromosome into segments of equal size, and then choosing the SNP that lay closest to the center of the segment. Other studies have used evenly spaced tagSNP by selecting one SNP every given number of markers [12], or by choosing in each segment the SNP with the largest MAF [4],11,12]. The fact that we had available a sizable number of SNP throughout the genome, i.e. 60 K, made it possible to select approximately evenly spaced SNP with a wide range of MAF, as long as those SNP were segregating in the population. The values of IA calculated while using tagSNP chosen at evenly spaced segments were similar to those obtained using the LD method. This similarity of results may be due to an assumption made in the method of SNP selection at evenly spaced intervals, i.e. that the distribution of LD along the genome is almost uniform and there are no large blocks of LD. In the current research, the haplotypes of F₁ animals are sampled from two populations: Duroc and Pietrain. The resulting LD was relatively high and uniformly distributed, except for a few blocks with extremely high LD: blocks with at least 7 consecutive SNP with r² ≥ 0.8. For this reason, evenly spaced tagSNP and tagSNP selected based on the LD method produced similar imputation accuracy at equivalent density. Although we indeed simulated assuming uniform recombination rates, these results seem to agree also with experimental data, where the two methods of selection used here produced virtually the same accuracy in an outbred pig population [10]. Designing custom low density SNP panels for each population of interest would not be cost effective. Consequently, we investigated the imputation accuracy obtained using a commercially available SNP chip with markers selected based on physical position and MAF [10].

Imputation using 9K panel and genotyping scenarios

Data from a 9K chip (average distance between SNP = 0.30 Mb) were used as a LowD panel to impute to a HD 60K panel. Using the experimental data from F₂ individuals, different genotyping scenarios were tested. In the first scenario, data consisted of F₀ and F₁ genotypes at HD and F₂ at LowD, and average IA was 0.99. Similarly, Weigel et al. [13] imputed 8K genotypes to 43K using information of the sire, dam, and grandsires (paternal and maternal), and obtained a value of IA > 0.95.

Our second scenario included the F₁ genotyped at 9K, between the generations of grandparents and grandoffspring, and it was observed that IA of F₂ decreased to 0.9. In our last scenario F₀ and F₁ were genotyped at HD and F₂ at LowD but the relationships between the F₂ and the reference panel were ignored, resulting in an average accuracy of imputation of 0.9. Badke et al. [10] used the genotypes of a reference population formed by trios to impute genotypes of an unrelated population, and obtained values of IA of 0.90 and 0.95 using reference groups of 16 and 64 animals, respectively.

Habier et al. [6] indicated that the reasons for the decay in accuracy of imputation are two-fold: 1) the accuracy of haplotyping the tagSNP flanking the non-tagSNP; 2) the accuracy of haplotyping the imputed non-tagSNP, conditional on a correct haplotyping of the tag-SNP. Therefore, the impact of both factors under the first two scenarios and taking into account the relationships between the individuals in the F₂ and in the reference population, were evaluated by means of simulated data. Accuracies of haplotyping were calculated as the number of erroneous inference of phase between consecutive heterozygous markers, as in Druet and Georges [31]. In all scenarios, it was observed that the phases of tagSNP were correct, thus the uncertainty was due to the grandparental origin of the non-tagSNP that were flanked by the tagSNP. The next step was to quantify the fraction of non-tagSNP with uncertain phase. When F₀ and F₁ were genotyped at HD and F₂ was genotyped at LowD, the fraction of non-tagSNP with uncertain phase was 4%, whereas this statistic was 30% when the F₀ was genotyped at HD, and the F₁ and F₂ were genotyped at LowD. The corresponding IA were 0.97 and 0.85, respectively. These results suggest that accuracies of imputation in the current study were affected by knowledge of the phase of non-tagSNP. Moreover, when the amount of genotypes from related individuals (i.e., F₀ at HD) increases, the accuracy of haplotyping goes up, and the accuracy of imputation also increases. These results apply to genotyping designs with a pedigree with a small number of founder individuals genotyped in HD and a large number of progeny genotyped in LowD. If the phase is known in the founders, it is easy to accurately follow transmission of chromosomal segments to the remainder of the population using linkage information. In practice, however the phase needs to be ascertained using LD information. Such information is very limited in cases such as our F₀ because of reduced sample size. In that case, the researchers can follow two paths. First, as presented with large pedigrees, having extra animals from the same founding population(s) can help in using LD to accurately phase those animals. Second, as presented here, two consecutive generations can be genotyped in HD to use the information in grand-parents (F₀) to accurately phase the parents (F₁) and then use linkage information to impute genotypes within the progeny (F₂). For such approaches to work, full pedigree information (three generations) and two generations of HD genotypes are needed. The approach is still cost effective in typical F₂ populations [6, 32]. These results are partially reaffirmed in large pedigree based imputation.

MAF effect

The measures of accuracy of imputation that are based only on allelic counts are not useful for comparing SNP having different values of MAF. This is due to the fact that imputation errors are highly sensitive to the value of the allelic frequencies [8, 12, 30]. To overcome this restriction, two alternative measures of accuracy of imputation have been proposed: 1) the correlation between imputed and observed genotypes [8]; and 2) an accuracy of imputation corrected to its expected value [12, 30]. The second method consists of adjusting the calculated accuracy of imputation by the difference between the observed accuracy and an estimate of the expected value under random sampling. There are several possible ways of calculating the accuracy under this method. Regardless of the measure being used to calculate the accuracy, a trend for the accuracy of imputation to drop when MAF < 0.15 has been observed. For example, in maize Hickey et al. [8] observed a decrease in R² when MAF < 0.10, and the drop was higher when the masked genotypes were >84% of total SNP. Similarly, Lin et al. [30] used human data with the correction for expected accuracy and observed a marked decrease in accuracy of imputation when MAF < 0.15. Hayes et al. [12] used the same correction as Lin et al. [30] with sheep data and found highly variable accuracies of imputation but tending to decrease whenever MAF < 0.10. The correlation between observed and imputed genotypes (R²) was employed in the current research to evaluate the effect of MAF on imputation accuracy. Our results showed that markers with MAF < 0.10 in the founders were imputed with reasonably good accuracy in the F₂ (Figure 6), a result different from those previously discussed. This is not unexpected considering we used both LD and linkage (pedigree information), as sources of information from our crossbred population. Therefore, the allele frequency in the F₀ does not matter as long as in that generation the two alleles are segregating. Moreover, whenever the F₁ is genotyped at HD, SNP with low MAF can be observed in the F₀ and F₁. Coupled to the fact that all family relationships are known, this simplifies the imputation of F₂ animals.

Possible effects in association

In the current research we compared allelic dosage of observed and imputed genotypes to find accurate genotyping design and imputation methods for LowD genotypes in an F₂ population. Zhen et al. [33] reported that the regression of phenotype on allelic dosage was an accurate method to evaluate QTL effects. Moreover, they observed that when accuracies of imputation were high, the power for the association test was high. For example, accuracies of imputation > 0.95 were associated with values of power > 0.85. In the current study, the accuracy of imputation obtained with the 9K panel was R² = 0.94, which suggests that the power for an association test is high. Other studies also found that imputation improved the power for association tests. Using data from humans, Hao et al. [34] compared the power for GWAS analysis of four different strategies involving imputation: (1) directly testing for associations using the Illumina 317K SNPs, (2) testing for associations using the entire imputed HapMap SNP set based on the Illumina 317K genotype data; (3) directly testing for associations using the Illumina 650Y SNPs; and (4) testing for associations using the entire imputed HapMap SNP set based on Illumina 650Y genotype data. It was observed that genomic wide imputation (strategies 2 and 4) improved power by 5.5% for the Illumina 317K, or 3.3% for Illumina 650Y, compared to the analyses with assayed SNPs only (strategies 1 and 3, respectively). Similar results were obtained by Anderson et al. [5] for the 300K and 550K platforms.

The cost of genotyping is an important consideration. At present, the cost of commercial HD genotyping (60K) for pigs is more than twice as much as the cost of genotyping with the 9K chip. Assuming a population with a structure similar to the one used here (approximately 20 F₀, 56 F₁ and 1000 F₂), one can genotype 1.9 times more individuals in a scenario with F₀ and F₁ at HD, and F₂ at LowD than in a scenario with F₀, F₁ and F₂ at HD. The imputed genotypes can then be used for association or for meta-analysis studies.

Conclusions

Designing custom SNP panels for each F₂ population to be imputed will likely not be cost effective due to the relatively large number of SNP needed to attain reasonable imputation accuracies, and the high development costs for each SNP panel. In particular, for our population we would need a minimum of M = 1,200 markers with average distance of 2.1 Mb to have IA over 0.97 in the F₂. On the other hand, using the 9K panel as tagSNP (LowD) resulted in IA of 0.99 when the F₀ and F₁ were genotyped at HD and the F₂ at LowD. The cost of such genotyping scheme would be less than half the cost of using HD genotypes for all individuals. The correlation between observed and imputed genotypes was high (R² = 0.94), so that the power for future association studies would be high. Thus, under a genotyping strategy of high accuracy of imputation (i.e., F₀ and F₁ at HD, F₂ at LowD), information on imputed genotypes from more animals that is similar to that from a HD panel can be obtained at a lower cost. These results apply to the imputation of markers in the SNP60 beadchip, in populations where a small number of founders can be genotyped at HD and phase of parents of imputed animals can be derived with certainty. Translation of LD-based results, on the other hand, are constrained to pig populations showing similar levels of LD as in the founding animals [35].

Methods

Animals

The experimental population was raised at the Michigan State University Swine Teaching and Research Farm, East Lansing, MI [1]. Parents from the initial generation (F₀) were four unrelated Duroc boars mated to 15 Pietrain sows by artificial insemination. From all resulting F₁ animals, 50 females and 6 males (progeny of 3 F₀ sires) were selected as parents for the F₂ generation, by avoiding full or half sib matings. A total of 1,259 F₂ piglets were born alive from 142 litters out of 11 farrowing groups. Animal protocols were approved by the Michigan State University All University Committee on Animal Use and Care (AUF# 09/03-114-00).

Genotyping and data editing

DNA was isolated from white blood cells using standard procedures as we have previously described for this population [1]. Quantity and quality of DNA samples were determined using a Qubit fluorometer (Invitrogen by Life Technologies, Carlsbad, CA, USA). The number of genotyped animals was N = 411 (4 F₀ Duroc boars, 15 F₀ Pietrain sows, 6 F₁ males, 50 F₁ females and 336 F₂ pigs). Genotyping was performed at a commercial laboratory (GeneSeek, a Neogen Company, Lincoln, NE, USA) using the Illumina PorcineSNP60 beadchip [36]. Out of M = 62,163 SNP, 6,422 SNP were eliminated as their physical positions were unknown. Mendelian inconsistencies (≤ 0.01%) were taken as missing genotypes, and 12 animals (1 F₁ and 11 F₂) with more than 10% of SNP missing were not used in any analysis. By similar consideration, 3,038 SNP were removed from the analyses due to presenting more than 10% missing data. Additionally, 10,139 SNP were excluded as their minor allele frequency (MAF) was below 0.01. These editing policies resulted in a data set comprising 399 pigs with 45,003 SNP per animal. This editing procedure followed that of Badke et al. [35] and the program PLINKv1.07 [37] was used. Additionally, starting with genotypes for F₀ and F₁ animals, genotypes for 932 F₂ animals were simulated conditional on the real pedigree using a gene-dropping model. Simulated genotypes were used to assess alternative tagSNP selection procedures while experimental genotypes on a subset of animals (n = 336) were used to assess imputation accuracy using a SNP list for a 9K commercial chip that has recently been publicly released by GeneSeek Inc. (Lincoln, NE, USA; described in Badke et al. [10]).

Genotype simulation

A stochastic simulation was performed to evaluate two different methods of selecting tagSNP for imputation on the accuracy of the resulting F₂ genotypes. The genotypes of 932 F₂ animals were simulated using gene-dropping[38] theory, by conditioning on a real pedigree and on the haplotypes of the 55 F₁ parents (6 males and 49 females) from the real F₂ population. The haplotypes were estimated at a high accuracy from the genotypes of the F₁ parents and 19 F₀ ancestors (4 Duroc boars and 15 Pietrain sows), using the software MERLIN [39]. The number of recombinations in the F₁ haplotypes were drawn from a Poisson distribution with mean equal to the length of the given chromosome in Morgans (M) by assuming 1 Mb = 1 cM [40]. The positions of the recombinations were simulated from a uniform distribution using Haldane’s mapping function [41, 42]. As an example, there were 1,405 SNP on chromosome 12 that were spread over 64.2 Mb, and the ensuing average distance between markers was 0.04573 Mb. By assuming a recombination rate of 1 cM per Mb [38], the number of recombinations in chromosome 12 was drawn from a Poisson distribution with parameter equal to 64.2 / 100 = 0.642. The next step was to assign the resulting gametes carrying these recombinations of the F₁ genotypes to their F₂ progeny.

TagSNP selection using simulated dataset

Two different methods were used for tagSNP selection: 1) The first one consisted of a statistical search built into the software FESTA [43] and used information on LD [44]. In this method, each SNP was either an element of the tagSNPset, or in LD with an existing element in the tagSNPset, at a value equal or larger than a specified threshold (r_t²) [10]. A minimum level of r_t² based on pair-wise LD of the F₁ haplotypes was selected, so that all SNPs above the chosen threshold were selected as tagSNP. 2) The second method consisted of selecting evenly-spaced markers. The chromosome was divided into k segments of equal length, and then the SNP that was closest to the center of the segment was selected. In cases where there were no SNP lying in a segment, no selection was performed resulting in the number of tagSNP≤k in segments of approximately equal length.

Genotype imputation

For simulated data, F₂ genotypes of non-typed markers were imputed using the algorithm of Lander and Green [39] that predicts the non-tagSNP by conditioning on the observed markers. For computational reasons the pedigree was analyzed on a per litter basis. Thus, for each F₂ litter, a three generation pedigree was built [45] using the four F₀ grandparents, the two F₁ parents, and up to a maximum of 10 F₂ animals. When the litter had more than 10 progeny, a new “family” was formed with the four F₀ grandparents, the two F₁ parents and the remaining F₂ animals. The resulting “families” were analyzed separately and genotypes were imputed with MERLIN [39]. Breaking the pedigree in this way produces some loss of information, but simulation results (data not shown) suggested that the loss was negligible.

For experimental data, F₂ genotypes of non-typed markers were imputed using the algorithm built into the software AlphaImpute [4]. The algorithm implemented in AlphaImpute [4] uses information on population-wide and within family LD and it required certain tuning. In particular, we set the core length parameter to 100, 150, 400 and 600 SNP and the tail parameter haplotype to 300, 400, 600 and 800 SNP, respectively. Likewise, genotype error percentage parameter was set to 0%, so as to obtain a high percentage of alleles under the correct phase [46]. The algorithm was run for the entire pedigree as there was no computing restriction in this case.

Calculation of the accuracy of imputation

Irrespective of data generation (simulation or experimental), the accuracy of genotype imputation in F₂ individuals for all methods was evaluated using two different statistics. First, the mean of the difference between observed and imputed allelic dosage was calculated [9, 13] as follows:

IA = 1 - \frac{1}{2 N} \sum_{i}^{N} \sum_{j}^{M_{i}} |{\hat{g}}_{ij} - {\hat{g}}_{ij}|

In this expression, N is the total number of animals imputed, M_i represents the number of markers with observed genotype in animal i, g_ij is the observed (experimental or simulated) allelic dosage in animal i and SNP j, and ${\hat{g}}_{ij}$ is the corresponding imputed allelic dosage. Allelic dosage was defined as the number of copies of a reference allele that took values 0, 1 and 2 for homozygous reference, heterozygous and homozygous non-reference, respectively. The second expression used to quantify the imputation accuracy was the square of the correlation between observed and imputed genotypes at each allele, or R² statistics of Huang et al. [47]. Denoting $\bar{\hat{g}}$ , the average value of the imputed genotypes, and with $\bar{g}$ the average value of observed genotypes, the R² statistics were calculated as follows:

R^{2} = {(\frac{\sum_{i = 1}^{N} ({\hat{g}}_{ij} - \bar{\hat{g}}) (g {}_{ij}- \bar{g})}{\sqrt{\sum_{i = 1}^{N} {({\hat{g}}_{ij} - \bar{\hat{g}})}^{2} \sum_{i = 1}^{N} {(g_{ij} - \bar{g})}^{2}}})}^{2}

The statistic is interpreted as a squared correlation coefficient.

References

Edwards DB, Ernst CW, Tempelman RJ, Rosa GJM, Raney NE, Hoge MD, Bates RO: Quantitative trait loci mapping in an F2 Duroc x Pietrain resource population: I. Growth traits. J Anim Sci. 2008, 86: 241-253.
Article CAS PubMed Google Scholar
Haley CS, Knott SA, Elsen JM: Mapping quantitative trait loci in crosses between outbred lines using least squares. Genetics. 1994, 136: 1195-1207.
PubMed Central CAS PubMed Google Scholar
Choi I, Steibel JP, Bates RO, Raney NE, Rumph JM, Ernst CW: Application of alternative models to identify QTL for growth traits in an F2 Duroc x Pietrain pig resource population. BMC Genet. 2010, 11: 97-
Article PubMed Central PubMed Google Scholar
Hickey JM, Kinghorn BP, Tier B, van der Werf JH, van der Cleveland MA: A phasing and imputation method for pedigreed populations that results in a single-stage genomic evaluation. Genet Sel Evol. 2012, 44: 9-10.1186/1297-9686-44-9.
Article PubMed Central PubMed Google Scholar
Anderson CA, Pettersson FH, Barrett JC, Zhuang JJ, Ragoussis J, Cardon LR, Morris AP: Evaluating the Effects of Imputation on the Power, Coverage, and Cost Efficiency of Genome-wide SNP Platforms. Am J Hum Genet. 2008, 83: 112-119. 10.1016/j.ajhg.2008.06.008.
Article PubMed Central CAS PubMed Google Scholar
Habier D, Fernando RL, Dekkers JCM: Genomic selection using low-density marker panels. Genetics. 2009, 182: 343-353. 10.1534/genetics.108.100289.
Article PubMed Central CAS PubMed Google Scholar
Huang Y, Hickey JM, Cleveland MA, Maltecca C: Assessment of alternative genotyping strategies to maximize imputation accuracy at minimal cost. Genet Sel Evol. 2012, 44: 25-10.1186/1297-9686-44-25.
Article PubMed Central PubMed Google Scholar
Hickey JM, Crossa J, Babu R, de los Campos G: Factors affecting the accuracy of genotype imputation in populations from several maize breeding programs. Crop Sci. 2012, 52: 654-663. 10.2135/cropsci2011.07.0358.
Article Google Scholar
Zhang Z, Druet T: Marker imputation with low-density marker panels in Dutch Holstein cattle. J Dairy Sci. 2010, 93: 5487-5494. 10.3168/jds.2010-3501.
Article CAS PubMed Google Scholar
Badke YM, Bates RO, Ernst CW, Schwab C, Fix J, Van Tassell CP, Steibel JP: Methods of tagSNP selection and other variables affecting imputation accuracy in swine. BMC Genet. 2013, Accepted
Google Scholar
Huang Y, Maltecca C, Cassady JP, Alexander LJ, Snelling WM, Macneil MD: Effects of reduced panel, reference origin, and genetic relationship on imputation of genotypes in Hereford cattle. J Anim Sci. 2012, 59301: 1-17.
Google Scholar
Hayes BJ, Bowman PJ, Daetwyler HD, Kijas JW, van der Werf JHJ: Accuracy of genotype imputation in sheep breeds. Anim Genet. 2011, 43: 72-80.
Article PubMed Google Scholar
Weigel KA, Tassell CPV, O’Connell JR, VanRaden PM, Wiggans GR: Prediction of unobserved single nucleotide polymorphism genotypes of Jersey cattle using reference panels and population-based imputation algorithms. J Dairy Sci. 2010, 93: 2229-2238. 10.3168/jds.2009-2849.
Article CAS PubMed Google Scholar
Tian F, Bradbury PJ, Brown PJ, Hung H, Sun Q, Flint-Garcia S, Rocheford TR, McMullen MD, Holland JB, Buckler ES: Genome-wide association study of leaf architecture in the maize nested association mapping population. Nat Genet. 2011, 43: 159-162. 10.1038/ng.746.
Article CAS PubMed Google Scholar
Poland JA, Bradbury PJ, Buckler ES, Nelson RJ: Genome-wide nested association mapping of quantitative resistance to northern leaf blight in maize. Proc Natl Acad Sci USA. 2011, 108: 6893-6898. 10.1073/pnas.1010894108.
Article PubMed Central CAS PubMed Google Scholar
Kover PX, Valdar W, Trakalo J, Scarcelli N, Ehrenreich IM, Purugganan MD, Durrant C, Mott R: A multiparent advanced generation inter-cross to fine-map quantitative traits in Arabidopsis thaliana. PLoS Genet. 2009, 5: e1000551-10.1371/journal.pgen.1000551.
Article PubMed Central PubMed Google Scholar
Burdick JT, Chen W-M, Abecasis GR, Cheung VG: In silico method for inferring genotypes in pedigrees. Nat Genet. 2006, 38: 1002-1004. 10.1038/ng1863.
Article PubMed Central CAS PubMed Google Scholar
Bouxsein ML, Uchiyama T, Rosen CJ, Shultz KL, Donahue LR, Turner CH, Sen S, Churchill G, Müller R, Beamer WG: Mapping quantitative trait loci for vertebral trabecular bone volume fraction and microarchitecture in mice. Journal of bone and mineral research: the official journal of the American Society for Bone and Mineral Research. 2004, 19: 587-599.
Article CAS Google Scholar
Leduc MS, Hageman RS, Verdugo RA, Tsaih S-W, Walsh K, Churchill GA, Paigen B: Integration of QTL and bioinformatic tools to identify candidate genes for triglycerides in mice. J Lipid Res. 2011, 52: 1672-1682. 10.1194/jlr.M011130.
Article PubMed Central CAS PubMed Google Scholar
Stearns TM, Beever JE, Southey BR, Ellis M, Mckeith FK, Rodriguez-Zas SL: Evaluation of approaches to detect quantitative trait loci for growth, carcass, and meat analyses The online version of this article, along with updated information and services, is located on the World Wide Web at: Evaluation of approaches to detect. J Anim Sci. 2005, 83: 1481-1493.
CAS PubMed Google Scholar
Liu G, Jennen DGJ, Tholen E, Juengst H, Kleinwächter T, Hölker M, Tesfaye D, Ün G, Schreinemachers H-J, Murani E, Ponsuksili S, Kim J-J, Schellander K, Wimmers K: A genome scan reveals QTL for growth, fatness, leanness and meat quality in a Duroc-Pietrain resource population. Anim Genet. 2007, 38: 241-252. 10.1111/j.1365-2052.2007.01592.x.
Article CAS PubMed Google Scholar
Yang G, Ren J, Li S, Mao H, Guo Y, Zou Z, Ren D, Ma J, Huang L: Genome-wide identification of QTL for age at puberty in gilts using a large intercross F2 population between White Duroc x Erhualian. Genetics. 2008, 40: 529-539.
Google Scholar
Mackay TFC: The genetic architecture of quantitative traits. Annu Rev Genet. 2001, 35: 303-339. 10.1146/annurev.genet.35.102401.090633.
Article CAS PubMed Google Scholar
Nonneman D, Lindholm-Perry AK, Shackelford SD, King DA, Wheeler TL, Rohrer GA, Bierman CD, Schneider JF, Miller RK, Zerby H, Moeller SJ: Predictive markers in calpastatin for tenderness in commercial pig populations. J Anim Sci. 2011, 89: 2663-2672. 10.2527/jas.2010-3556.
Article CAS PubMed Google Scholar
Sanchez MP, Iannuccelli N, Basso B, Bidanel J-P, Billon Y, Gandemer G, Gilbert H, Larzul C, Legault C, Riquet J, Milan D, Le Roy P: Identification of QTL with effects on intramuscular fat content and fatty acid composition in a Duroc x Large White cross. BMC Genet. 2007, 8: 55-
Article PubMed Central PubMed Google Scholar
Meyers SN, Rodriguez-Zas SL, Beever JE: Fine-mapping of a QTL influencing pork tenderness on porcine chromosome 2. BMC Genet. 2007, 8: 69-
Article PubMed Central PubMed Google Scholar
Sato S, Oyamada Y, Atsuji K, Nade T, Sato S-i, Kobayashi E, Mitsuhashi T, Nirasawa A, Komatsuda Y, Saito S, Terai T, Hayashi T, Sugimoto Y: Quantitative trait loci analysis for growth and carcass traits in a Meishan × Duroc F2 resource population. J Anim Sci. 2003, 81: 2938-2949.
CAS PubMed Google Scholar
Xu Z, Kaplan NL, Taylor JA: TAGster: efficient selection of LD tag SNPs in single or multiple populations. Bioinformatics. 2007, 23: 3254-3255. 10.1093/bioinformatics/btm426.
Article PubMed Central CAS PubMed Google Scholar
Wiggans GR, Cooper TA, Vanraden PM, Olson KM, Tooker ME: Use of the Illumina Bovine3K BeadChip in dairy genomic evaluation. J Dairy Sci. 2012, 95: 1552-1558. 10.3168/jds.2011-4985.
Article CAS PubMed Google Scholar
Lin P, Hartz SM, Zhang Z, Saccone SF, Wang J, Tischfield JA, Edenberg HJ, Kramer JR, Goate A, Bierut LJ, Rice JP: A new statistic to evaluate imputation reliability. PLoS One. 2010, 5: e9697-10.1371/journal.pone.0009697.
Article PubMed Central PubMed Google Scholar
Druet T, Georges M: A hidden markov model combining linkage and linkage disequilibrium information for haplotype reconstruction and quantitative trait locus fine mapping. Genetics. 2010, 184: 789-798. 10.1534/genetics.109.108431.
Article PubMed Central CAS PubMed Google Scholar
Druet T, Farnir FP: Modeling of identity-by-descent processes along a chromosome between haplotypes and their genotyped ancestors. Genetics. 2011, 188: 409-419. 10.1534/genetics.111.127720.
Article PubMed Central CAS PubMed Google Scholar
Zheng J, Li Y, Abecasis GR, Scheet P: A comparison of approaches to account for uncertainty in analysis of imputed genotypes. Genet Epidemiol. 2011, 35: 102-110. 10.1002/gepi.20552.
Article PubMed Central PubMed Google Scholar
Hao K, Chudin E, McElwee J, Schadt EE: Accuracy of genome-wide imputation of untyped markers and impacts on statistical power for association studies. BMC Genet. 2009, 10: 27-
Article PubMed Central PubMed Google Scholar
Badke YM, Bates RO, Ernst CW, Schwab C, Steibel JP: Estimation of linkage disequilibrium in four US pig breeds. BMC Genomics. 2012, 13: 24-10.1186/1471-2164-13-24.
Article PubMed Central CAS PubMed Google Scholar
Ramos AM, Crooijmans RPMA, Affara NA, Amaral AJ, Archibald AL, Beever JE, Bendixen C, Churcher C, Clark R, Dehais P, Hansen MS, Hedegaard J, Hu Z-L, Kerstens HH, Law AS, Megens H-J, Milan D, Nonneman DJ, Rohrer GA, Rothschild MF, Smith TPL, Schnabel RD, Van Tassell CP, Taylor JF, Wiedmann RT, Schook LB, Groenen MAM: Design of a high density SNP genotyping assay in the pig using SNPs identified and characterized by next generation sequencing technology. PLoS One. 2009, 4: e6524-10.1371/journal.pone.0006524.
Article PubMed Central PubMed Google Scholar
Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR, Bender D, Maller J, Sklar P, de Bakker PIW, Daly MJ, Sham PC: PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet. 2007, 81: 559-575. 10.1086/519795.
Article PubMed Central CAS PubMed Google Scholar
Ledur MC, Navarro N, Pérez-Enciso M: Large-scale SNP genotyping in crosses between outbred lines: how useful is it?. Heredity. 2010, 105: 173-182. 10.1038/hdy.2009.149.
Article CAS PubMed Google Scholar
Abecasis GR, Cherny SS, Cookson WO, Cardon LR: Merlin–rapid analysis of dense genetic maps using sparse gene flow trees. Nat Genet. 2002, 30: 97-101. 10.1038/ng786.
Article CAS PubMed Google Scholar
Laval G, Excoffier L: SIMCOAL 2.0: a program to simulate genomic diversity over large recombining regions in a subdivided population with a complex history. Bioinformatics. 2004, 20: 2485-2487. 10.1093/bioinformatics/bth264.
Article CAS PubMed Google Scholar
Cheema J, Dicks J: Computational approaches and software tools for genetic linkage map estimation in plants. Brief Bioinform. 2009, 10: 595-608. 10.1093/bib/bbp045.
Article CAS PubMed Google Scholar
Haldane JBS: The combination of linkage values and the calculation of distances between the loci of linked factors. J Genet. 1919, 8: 299-309.
Article Google Scholar
Qin ZS, Gopalakrishnan S, Abecasis GR: An efficient comprehensive search algorithm for tagSNP selection using linkage disequilibrium criteria. Bioinformatics. 2006, 22: 220-225. 10.1093/bioinformatics/bti762.
Article CAS PubMed Google Scholar
Carlson CS, Eberle MA, Rieder MJ, Yi Q, Kruglyak L, Nickerson DA: Selecting a maximally informative set of single-nucleotide polymorphisms for association analyses using linkage disequilibrium. Am J Hum Genet. 2004, 74: 106-120. 10.1086/381000.
Article PubMed Central CAS PubMed Google Scholar
Haley S, Elsen JM: Mapping quantitative trait loci in crosses between outbred lines using least squares. Genetics. 1994, 136: 1195-1207.
PubMed Central CAS PubMed Google Scholar
Hickey JM, Kinghorn BP, Tier B, Wilson JF, Dunstan N, van der Werf JHJ: A combined long-range phasing and long haplotype imputation method to impute phase for SNP genotypes. Genet Sel Evol. 2011, 43: 12-10.1186/1297-9686-43-12.
Article PubMed Central PubMed Google Scholar
Huang L, Li Y, Singleton AB, Hardy JA, Abecasis G, Rosenberg NA, Scheet P: Genotype-imputation accuracy across worldwide human populations. Am J Hum Genet. 2009, 84: 235-250. 10.1016/j.ajhg.2009.01.013.
Article PubMed Central CAS PubMed Google Scholar

Download references

Acknowledgements

This project was supported by Agriculture and Food Research Initiative Competitive Grant no. 2010-65205-20342 from the USDA National Institute of Food and Agriculture, and by funding from the National Pork Board Grant no. 11–042. Partial funding was also provided by the US Pig Genome Coordinator. Computer resources were provided by the Michigan State University High Performance Computing Center (HPCC). JLGD and RJCC were funded by UBACyT 20020100100861 from Universidad de Buenos Aires (Argentina). We acknowledge Yvonne M. Badke (Ph.D. student) and Yijian Huang Ph. D. (Postdoctoral researcher) of Michigan State University for help with programing.

Author information

Authors and Affiliations

Department of Animal Science, Michigan State University, East Lansing, Michigan, USA
Jose L Gualdrón Duarte, Ronald O Bates, Catherine W Ernst, Nancy E Raney & Juan P Steibel
Departamento de Producción Animal, Facultad de Agronomía, UBA-CONICET, Buenos Aires, Argentina
Jose L Gualdrón Duarte & Rodolfo JC Cantet
Department of Fisheries and Wildlife, Michigan State University, East Lansing, Michigan, USA
Juan P Steibel

Authors

Jose L Gualdrón Duarte
View author publications
You can also search for this author in PubMed Google Scholar
Ronald O Bates
View author publications
You can also search for this author in PubMed Google Scholar
Catherine W Ernst
View author publications
You can also search for this author in PubMed Google Scholar
Nancy E Raney
View author publications
You can also search for this author in PubMed Google Scholar
Rodolfo JC Cantet
View author publications
You can also search for this author in PubMed Google Scholar
Juan P Steibel
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Juan P Steibel.

Additional information

Authors’ contributions

JPS, RJCC, JLGD: performed and supervised statistical and simulation analyses and wrote the manuscript. ROB, CWE: designed the resource population and led collection of phenotypic data. CWE, NER: performed DNA extraction and coordinated genotyping with commercial laboratory. JPS, ROB, CWE: designed high density genotyping scheme. All authors read and approved the paper.

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Authors’ original file for figure 5

Authors’ original file for figure 6

Authors’ original file for figure 7

Authors’ original file for figure 8

Rights and permissions

Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Gualdrón Duarte, J.L., Bates, R.O., Ernst, C.W. et al. Genotype imputation accuracy in a F2 pig population using high density and low density SNP panels. BMC Genet 14, 38 (2013). https://doi.org/10.1186/1471-2156-14-38

Download citation

Received: 01 December 2012
Accepted: 13 April 2013
Published: 08 May 2013
DOI: https://doi.org/10.1186/1471-2156-14-38

Genotype imputation accuracy in a F2 pig population using high density and low density SNP panels

Abstract

Background

Results

Conclusions

Background

Results

Linkage disequilibrium and selection of tagSNP

Evenly spaced SNP

Imputed genotypes in experimental F2 animals

9K commercial chip

Minor allele frequency (MAF)

MAF using the 9K panel in the F2

Distance to the closest tagSNP

Effect of the difference in allelic frequencies in the F0

Discussion

SNP selection methods and accuracy of imputation

Imputation using 9K panel and genotyping scenarios

MAF effect

Possible effects in association

Conclusions

Methods

Animals

Genotyping and data editing

Genotype simulation

TagSNP selection using simulated dataset

Genotype imputation

Calculation of the accuracy of imputation

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Authors’ contributions

Authors’ original submitted files for images

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Genomic Data

Contact us

MAF using the 9K panel in the F₂

Effect of the difference in allelic frequencies in the F₀