- Research article
- Open Access
Evaluation of approaches for identifying population informative markers from high density SNP Chips
BMC Genetics volume 12, Article number: 45 (2011)
Genetic markers can be used to identify and verify the origin of individuals. Motivation for the inference of ancestry ranges from conservation genetics to forensic analysis. High density assays featuring Single Nucleotide Polymorphism (SNP) markers can be exploited to create a reduced panel containing the most informative markers for these purposes. The objectives of this study were to evaluate methods of marker selection and determine the minimum number of markers from the BovineSNP50 BeadChip required to verify the origin of individuals in European cattle breeds. Delta, Wright's FST, Weir & Cockerham's FST and PCA methods for population differentiation were compared. The level of informativeness of each SNP was estimated from the breed specific allele frequencies. Individual assignment analysis was performed using the ranked informative markers. Stringency levels were applied by log-likelihood ratio to assess the confidence of the assignment test.
A 95% assignment success rate for the 384 individually genotyped animals was achieved with < 80, < 100, < 140 and < 200 SNP markers (with increasing stringency threshold levels) across all the examined methods for marker selection. No further gain in power of assignment was achieved by sampling in excess of 200 SNP markers. The marker selection method that required the lowest number of SNP markers to verify the animal's breed origin was Wright's FST (60 to 140 SNPs depending on the chosen degree of confidence). Certain breeds required fewer markers (< 100) to achieve 100% assignment success. In contrast, closely related breeds require more markers (~200) to achieve > 95% assignment success. The power of assignment success, and therefore the number of SNP markers required, is dependent on the levels of genetic heterogeneity and pool of samples considered.
While all SNP selection methods produced marker panels capable of breed identification, the power of assignment varied markedly among analysis methods. Thus, with effective exploration of available high density genetic markers, a diagnostic panel of highly informative markers can be produced.
The identification and verification of the origin of individuals is useful in a variety of biological contexts and the practical applications of individual assignment protocols are extensive [1–3]. Topical issues in population, conservation and evolutionary biology can benefit from the inference of ancestry of individuals. In an applied context, genetic identification can shed light on issues such as the contribution of source populations in mixed fisheries [3, 4], meat traceability or brand authentication , translocated or migrant individuals , structure and levels of discrimination amongst populations [7, 8], anthropological forensic investigations  and tracking the trade routes of illegally poached animals .
Where there is sufficient genetic heterogeneity amongst populations genetic markers can be used to identify and verify the origin of individuals . Customarily, the genetic marker routinely used in individual assignment studies has been hypervariable microsatellite loci (e.g. [4, 5, 7]). However, with the advent of genome-wide analytical technologies, microsatellites are now being widely replaced by Single Nucleotide Polymorphism (SNP) markers (e.g., ). SNPs are increasingly favoured as population genetic markers because they are highly abundant and widespread in the genome, homoplasy is virtually absent, methods to discover markers are reliable and subsequent automated genotyping through assay design can be easily implemented [10, 11]. Numerous SNPs have been identified in the genomes of domestic animals, for example, in the dog (> 2.5 million) , chicken (~ 2.8 million)  and cattle (> 2 million) . This has led to the technological development of standard products commonly termed 'SNP Chips', which enable the rapid automated large-scale production of genomic data. SNP Chips are now commercially available for many animal species (e.g., sheep, ; pigs, ) including the Illumina Bovine50SNP BeadChip (Illumina Inc., San Diego, CA) for cattle [17, 18].
These new resources are highly informative; the Bovine50SNP BeadChip has already been used in genetic studies investigating population genetic structure , mapping for marker assisted selection of economically important traits [20, 21] and unravelling the patterns of signatures of selection [19, 22].
Dense genome-wide data is valuable but is relatively costly to produce and time-consuming or computationally expensive to analyse; it is therefore often desirable to reduce the number of markers by screening and selecting according to their information content to create reduced panels for population genetic analyses [23, 24]. Several statistical selection methods are available to determine which genetic markers contain the most information to discriminate among populations. The statistic, delta, which measures allele frequency differences, is commonly used in the field of human genetics to assess marker information content [25, 26]. Bowcock et al.,  suggested that informative genetic markers may be identified using Wright's FST and its derivatives . Principle Component Analysis (PCA) has also been more recently proposed as an alternative method to determine population informative SNP markers . Other algorithms have been developed to optimize the combination of loci selected (e.g., BELS,  and references therein); however, these approaches are computationally intensive and their execution may be prohibitively slow with large datasets.
The objective of this study was to examine methods for selecting population informative SNP loci. To achieve this we set out to determine the minimum number of SNP markers from the Illumina Bovine50SNP BeadChip (Illumina Inc., San Diego, CA) that is required for individual genetic assignment to discriminate a set of European cattle breeds (Table 1). This was approached in a two-stage manner. First, several SNP selection methods were evaluated to determine the genetic information content of each SNP marker and markers were ranked by decreasing level of informativeness for each of the methods. Second, the likelihood of assigning individual genotypes to their known breed origin was estimated by cumulatively increasing the number of SNP markers, according to the ranked estimates of each SNP marker's informativeness for each selection method.
Comparison of the marker selection methods
Frequency histograms of the level of genetic information in the SNP markers are shown for each selection method (Figure 1). A predominantly left-skewed distribution was produced for each selection method, except delta, which produced a fairly symmetric distribution. The majority of the markers contained low to medium levels of genetic information and a small proportion had high levels of genetic information (Figure 1).
To assess the level of similarity of the estimates of genetic information contained in each SNP marker across the different selection methods, a Spearman's rank correlation was calculated between the different estimates from the selection methods. High levels of correlation were observed between delta, pairwise Wright's FST, pairwise W&C's FST and PCA (Table 2). Similarly, there was a substantial amount of overlap (> 200) in the top ranked 500 SNP markers between these four selection methods (Table 2). In contrast, the level of correlation was lower between global FST and the other selection methods (Table 2). There was far less overlap (< 200) in the top ranked 500 SNP markers between the global FST estimates and the other selection methods (Table 2).
To further explore the conflicting results produced by global Wright's and W&C's FST, the observed breed allele frequencies for the top ranked 50 SNP markers for each selection method were displayed in a box-plot [Additional file 1: Supplemental Figure S1]. The boxplot is an effective visual representation of both the central tendency and dispersion of data. Delta, pairwise Wright's FST, pairwise W&C's FST and PCA selected SNP markers with median allele frequency between 0.2 and 0.8 and with large interquartile ranges indicating a high level of dispersion amongst the observed allele frequencies [Additional file 1: Supplemental Figure S1]. In comparison, the majority of the top-ranked SNP markers selected by global Wright's FST had median allele frequencies near 0 or 1 and low levels of dispersion. The global W&C's FST resulted in the selection of SNPs with a higher level of dispersion amongst the observed allele frequencies than global Wright's FST, but, nonetheless, also included markers with quite a few outliers and smaller interquartiles ranges than the other selection methods. The global FST methods resulted in the selection of many SNP markers specific for a single most genetically distinct population.
Assignment precision: overall assessment
The accuracy of assignment of individual genotypes to known breed origin was evaluated by cumulatively adding 20 markers, in descending order of estimated marker informativeness for each selection method. No population genetic differentiation was detected between the American and British Angus populations (Table 1), consequently the two populations were pooled together and treated as a single breed in subsequent analyses.
The success of assignment of the 384 individual genotypes to breed of origin at the four stringency level thresholds for four of the selection methods (delta, pairwise Wright's FST, pairwise W&C's FST and PCA) is presented in Figure 2. Strikingly, it is immediately noticeable that > 50% assignment success for all selection methods is achieved at stringency level LLR > 0 using just the first 20 SNP markers. Overall, pairwise Wright's FST required the smallest number of SNP markers to reach 90%, 95% and 98% correct assignment at the four stringency threshold levels (Table 3). Of the four selection methods, PCA was the poorest performer, requiring > 190 SNP markers to attain 95% assignment success (Figure 2; Table 3). The power of assignment using PCA as a selection method decreased considerably across all the stringency thresholds when a 98% assignment success was imposed (Figure 2; Table 3).
Full results are not shown for assignment precision using ranked SNP markers for global FST because they performed comparatively poorly. For global Wright's FST, 90% assignment success was obtained with 230 and 380 SNP markers at the stringency levels of LLR > 0 and LLR > 3, respectively. Using up to 400 markers, 95% assignment success was not achieved at any stringency level. For global W&C's FST, 90% assignment success was obtained with 80 and 230 SNP markers at the stringency levels of LLR > 0 and LLR > 3, respectively. The global W&C's FST had greater assignment accuracy over global Wright's FST, but still performed worse than the other four selection methods (Table 3).
Randomly chosen SNP sets performed worse than ranked informative SNP markers in individual assignment analysis (Figure 2). Neither an asymptote nor 95% assignment success were reached using up to 400 markers (average across 20 sets of randomly chosen SNP at LLR > 3).
Individual assignment analysis using a training set and a holdout set was performed in order to evaluate the power of assignment for samples not included in the reference population. This cross-validation analysis reported slightly worse power of assignment than the main analysis [Additional file 1: Supplemental Figure S2]. The assignment power for breeds with large sample sizes N > 50 was comparable to the results of the main analysis (results not shown). However, certain breeds with a low sample size had worse assignment power in the cross-validation analysis. For example, poor assignment power was observed in Red Angus and Norwegian Red, two breeds of low sample size and for which closely related breeds were included in the dataset (Angus and Finnish Ayrshire, respectively) (results not shown).
Assignment precision: individual breeds
The SNP selection methods differed for power of assignment in individual breeds, but no one method consistently outperformed any other in all breeds (Table 4). No substantial further gain in power of assignment in individual breeds was observed beyond ~ 200 SNP markers. Certain breeds required relatively few SNP markers to attain > 95% assignment success (Table 4). For example, the Jersey breed required < 50 SNPs to achieve 100% individual assignment; even when strict stringency levels were applied. In contrast, the Charolais breed required ~100 SNP markers to achieve > 95% individual assignment and power was severely compromised with increasing stringency level.
There was a positive significant correlation between the percentage of correctly assigned individuals and a breed's average level of genetic differentiation (Figure 3; Spearman's rank correlation, rho = 0.635, p = 0. 0082).
Type I (false positives) and II errors (false negatives) that occurred in the individual assignment analysis, using pairwise Wright's FST at the lowest stringency threshold level (LLR > 0) were calculated [Additional file 1: Supplemental Table S1]. Using 50 SNP markers, 5 breeds were assigned with 100% assignment success, and the remaining breeds had type I errors of < 15%. The type I error rate was highest for Angus (14.6%), followed closely by Red Angus (13.3%), whereby if an individual was not assigned to its correct origin it was assigned to the other breed. Using 50 SNP markers, eight breeds had no individuals assigned from other breeds, and the remaining breeds displayed a type II error of < 17% (except for the Red Angus breed, where 35% of the assigned individuals were Angus; and this may have been inflated by the relatively low sample size of Red Angus breed (15), compared to Angus (41)). The type I and II error rates decreased to < 5% by 150 SNP markers.
The SNP markers on the BovineSNP50 BeadChip were discovered through various breed sources. The majority of the markers were discovered from Angus, Holstein and Hereford breeds (others included Charolais, Limousin, Red Angus, Simmental, Jersey, Limousin and Norwegian Red, but fewer SNPs were found through these breeds) . The inclusion of few representative sources could influence the level of SNP informativeness and individual assignment power, such that breeds used in the discovery process show higher SNP variability. Although Jersey was one of the breeds used for SNP discovery, it had the lowest average minor allele frequency (MAF) (Table 5). MAF values for Angus, Hereford and Holstein were relatively high but lower than for Charolais and Simmental. The power of assignment at a breed level revealed that the breeds represented during the SNP discovery process were not amongst those (except for Jersey) that required comparatively fewer markers to achieve 100% assignment success (Table 4).
The top 500 SNP markers ranked by decreasing informativeness were listed with their corresponding SNP discovery method (7 in total, ) [Additional file 2: Supplemental Table S2]. A x2-test revealed that the proportions of SNP discovery methods represented in the pairwise Wright's FST 500 top SNP markers [Additional file 2: Supplemental Table S2] were not significantly different from those of the overall Bovine SNP50 set (x2, df = 36, NS).
The principal goal of this study was to evaluate marker selection methods and determine the minimum number of SNP markers from the BovineSNP50 BeadChip required to effectively and confidently assign individual genotypes to European cattle breeds. While all SNP selection methods yielded reduced marker panels capable of breed identification, the power of assignment varied markedly among analysis methods.
Behaviour of the marker selection methods
The pairwise Wright's FST selection method marginally outperformed other selection methods in the individual assignment analysis (Table 3, Figure 2). Nonetheless, three other selection methods, delta, pairwise W&C's FST and PCA, did not perform poorly at ranking markers or for assignment success rates. Across these selection methods, to achieve 95% assignment success, < 80, < 100, < 140 and < 200 SNP markers were required at the stringency threshold levels of LLR > 0, LLR > 1, LLR > 2 and LLR > 3, respectively (Table 3, Figure 2). These four selection methods (delta, pairwise Wright's FST, pairwise W&C's FST and PCA) to a large extent agreed on the most informative SNP markers. The resulting estimates of genetic informativeness of each SNP marker were highly correlated across the four selection method and there was a large degree of overlap among the top-ranked 500 SNP markers (Table 2). This was to be expected because all methods were applied to individual SNP marker allele frequencies. In addition, it has been demonstrated that delta and Wright's FST function similarly . However, PCA exhibited the poorest correlation with the other methods and lowest overall individual assignment power. Paschou et al.,  advocated using PCA to determine marker informativeness because PCA renders an overall estimate for a SNP marker, as compared with other selection methods where it is necessary to estimate an average from pairwise calculations when the number of populations (K) > 2. PCA is an approach used to characterise the structure of a set of variables (in this case SNPs). The inferred relationships between objects (e.g., populations/breeds) are determined by the structure of the covariance matrix between the marker allele frequencies. Thus, the informativeness of a given marker will depend on the other markers included in the analysis and this could influence the informative markers that PCA identified. In contrast, delta and FST do not take into account the relationships amongst markers and the level of information of each marker is estimated independently of the others.
The remaining two selection methods, global Wright's and W&C's FST, performed comparatively poorly in the individual assignment test. As similarly observed by Kersbergen et al., global FST may not be appropriate to assess the level of genetic information in SNP markers when K > 2, as the method could result in the selection of SNP markers which are specific in distinct populations [Additional file 1: Supplemental Figure S1]. The selected SNP markers that were specific for only the most distinct breed were not segregating in the majority of the other breeds [Additional file 1: Supplemental Figure S1], and thus the expected heterozygosity would be low. Indeed, it is suggested that genetic markers with high expected heterozygosity are informative and therefore useful in individual assignment analysis [15, 33], such as those identified using pairwise Wright's FST, delta, pairwise W&C's FST and PCA. As a result the performance of individual assignment tests using global FST selected markers may be compromised compared to the other selection methods. Consequently, when K > 2 it is preferable to estimate FST, either Wright's or W&C's, on a population pairwise basis and then estimate the average across the pairwise comparisons to obtain an overall estimate for a marker.
Assignment precision: minimum number of markers required
Since pairwise Wright's FST outperformed the other selection methods (Table 3) this selection method was subsequently adopted to estimate the minimum number of SNP markers required to achieve the desired assignment success. At the most commonly used stringency threshold (LLR > 0) and the accepted level of appropriate assignment success (95%) , < 60 SNP markers were required for the correct assignment of the 384 individual genotypes. When stricter stringency threshold levels are applied, the number of SNP markers required to attain 95% assignment success increased (Table 3). Depending on the chosen degree of confidence, the required number of markers ranges from 60 to 140 SNPs (80, 105 and 140 at LLR > 1, LLR > 2 and LLR > 3, respectively). While the percentage of assignment success decreases with increasing stringency thresholds, so too does the risk of false assignment. Consequently, there is greater confidence in the estimated genotype likelihoods and LLR calculations if a strict stringency threshold (LLR > 3) is adopted.
It is difficult to compare the results obtained here to other studies conducted on individual assignment analysis in cattle breeds. First, most previous studies used microsatellite markers and, second, these studies had only a limited number of markers (e.g., [5, 8]). These studies also primarily focused on the practicality of assigning individuals among cattle breeds with the available markers and were not concerned with how many markers would be required to achieve confident assignment of individual genotypes. In a study of French cattle breeds, Maudet et al.,  found that using 23 microsatellite loci > 93% of individuals could be assigned to their breed origin. A more recent study used SNP markers but did not have a large dataset at their disposal and could, again, only address the practicality of individual assignment with the limited set of available markers . Using 90 SNP markers genotyped in 24 European cattle breeds they were able to correctly assign 85% of individuals to their breed origin. McKay et al.,  used STRUCTURE to assess the number of loci required to estimate the number of ancestral populations in 6 Bos taurus breeds. The use of 150 randomly chosen loci (from a dataset of 2,641 loci) yielded the correct number of clusters in only 40% of cases, consistent with reduced assignment power for randomly-selected markers found in the current study (Figure 2). The lower assignment power in those studies was most probably a direct consequence of using an insufficient number of informative loci. The comparatively high assignment power of fewer SNP markers in the current study was probably due to the availability of > 40,000 SNP markers and the benefit of selecting markers that contain the most genetic information with respect to the reference populations. Only a few highly polymorphic microsatellite loci are required in individual assignment studies. However, dense SNP panels are now available for many species and SNP markers possess numerous advantages, including cost, throughput and reliability, making them a favourable choice over microsatellites.
Assignment success: individual breeds
It is evident that certain breeds in this study require far fewer markers to achieve > 95% assignment success than others, regardless of the selection method used (Table 4, Figure 3). For example, the Jersey, Brown Swiss, Guernsey and Piedmontese breeds achieved 100% assignment success, even at stricter stringency thresholds using 50 SNP markers (pairwise Wright's FST, LLR > 2, Table 4). In contrast, the French breeds like the Charolais, Limousin and Simmental achieved ~ 90% assignment success at LLR > 0, which fell to < 50% with increasing stringency threshold using 50 SNP markers (Table 4). Similarly, the breeds that exhibited a lower power of assignment success (Table 4) also had higher type I and II error rates (Table S1).
A problem associated with the use of SNP markers in population genetics is ascertainment bias, which could influence population genetic estimates and may contribute to differences in assignment performance for individual breeds . Heterogeneity amongst sample representatives can introduce ascertainment bias and breeds not included in the SNP discovery process could have lower minor allele frequencies (MAF) [15, 36]. The average MAF was lowest in the Brown Swiss, Guernsey and Jersey breeds (Table 5), one of which was represented in the SNP discovery process and the three breeds which were central to the process (Angus, Hereford, Holstein) did not have the highest average MAF values. In addition, no one particular SNP discovery method was over-represented in the top identified SNP markers [Additional file 2: Supplemental Table S2] as the discovery method proportions were similar to that represented on the Bovine SNP50 assay . SNP ascertainment bias would have been more pronounced if B. t. indicus breeds had been included in this study . Morin et al.,  concluded that ascertainment bias may be an issue in the assessment of population size and demographic changes. It is least important for individual identification and assignment tests, where the intentional selection of informative markers provides greater power than do randomly chosen markers.
A factor that could affect the power of assignment success and variation in power of assignment between breeds is the level of pairwise genetic differentiation amongst the breeds. It is known that the number of markers required to obtain a high accuracy of assignment is influenced by the level of population genetic differentiation [8, 37]. That is, it depends closely on the populations under consideration and respective levels of genetic heterogeneity. As demonstrated in Figure 3, the level of genetic differentiation of a breed, measured by FST, is correlated with power of assignment success. Low breed genetic differentiation was observed in Charolais and Simmental, which similarly showed higher rates of Type I and II errors (Figure 3, [Additional file 1: Supplemental Table S1]). False positive assignments also occurred between breeds of known recent ancestry, for example, Angus and Red Angus, and Finnish Ayrshire and Norwegian Red . In addition, cases of mistaken assignment occurred between Charolais, Simmental, Limousin and Shorthorn, where the pairwise FST values amongst these breeds were < 0.1. In a study on individual assignment using microsatellites, Ciampolini et al.,  reported that of the four breeds under consideration, Charolais and Limousin had the lowest level of pairwise genetic differentiation and were the most difficult to discriminate between (FST = 0.041). As assignment success is a function of both the number of markers and population genetic differentiation, the level of breed genetic differentiation is indicative of the potential number of SNP markers necessary to attain high levels of power in individual assignment tests [6, 37].
Informative marker panels in population genetics
Evaluation of the selection methods revealed that only a small proportion of the markers from the BovineSNP50 BeadChip were highly informative for discriminating among 17 breeds, and the majority contained medium to low levels of genetic information (Figure 1). This is consistent with the development of the assay in which SNPs with high MAF across B. t. taurus breeds were preferentially selected in the assay design. Consequently, sets of randomly chosen SNP markers contained sufficient genetic information to produce moderate levels of individual assignment power (Figure 2). However, in contrast, a substantially reduced set of highly informative SNP markers were capable of precisely discriminating amongst the European cattle breeds (Figure 2).
Studies have shown that a reduced set of selected informative markers can effectively capture the genetic structure of human populations [23, 24]. For instance, Lao et al.,  found that 10 SNP markers from a 10K SNP array contained enough genetic information to differentiate individuals from Africa, Europe, Asia and America and additional loci contributed very little extra information. Indeed, it is generally considered that uninformative markers (i.e., monomorphic loci) may add noise to the results and compromise power of population genetic studies [38, 39]. It could be useful to create a minimum panel of maximum power, particularly when using Bayesian genotypic clustering software such as STRUCTURE to elucidate population structure, because these approaches are computationally demanding (which intensifies as the number of markers increases) . Consequently, it is practical and cost-effective to apply a selection method to dense assays to isolate the highly diagnostic markers and increase the power of analysis.
The number of markers required for population assignment will depend on the species, the populations under consideration, their respective level of genetic differentiation and the desired stringency of assignment. For instance, within dogs 27% of the genetic variation is found between breeds, whereas for humans the level between populations is only 5%-10% . As a result, the number of SNP markers required for individual assignment and discrimination amongst populations (breeds) will differ between species under consideration.
Although the marker selection methods explored in this study agreed to a large extent on which SNPs were the most informative, there were significant differences in the power of assignment produced by the resulting ranked SNP panels, with pairwise Wright's FST outperforming all other approaches. These results illustrate that with effective exploration it is possible to identify the most informative markers and produce an optimal minimum set of markers that can differentiate among populations.
Allele frequencies from 17 cattle breeds representing the 'reference' populations and a total of 384 individual genotypes of known breed origin, sampled from the reference populations, were available (Table 1). Information on the sampling of the reference populations is given in Table 1. Decker et al.,  selected 40,843 SNPs from the Bovine SNP50 Bead Chip after a strict quality screening where "Loci selected for analysis were all located on autosomes, had a call rate of at least 80% in 36 (75%) B. t. taurus breeds, and were not monomorphic in all breeds.... ". Since only B. t. taurus breeds were used in the current study the selected set of SNP markers by Decker et al.,  was adopted. Detailed information of the genotyping procedure can be found in Decker et al., .
Selection methods to determine the most informative markers
The breed-specific allele frequencies for the 40,483 SNPs were used to estimate the genetic information contained in each SNP marker using the following selection methods: delta, Wright's FST, Weir and Cockerham's FST and PCA. The larger the estimated value, the more informative the marker is at genetically discriminating the sampled populations. All analyses were conducted in the R statistical environment .
One of the most commonly used measures of marker informativeness is delta . For a biallelic marker the delta value is given by | pAi- pAj|, where pAiand pAjare the frequencies of allele A in the ith and jth populations, respectively. Delta can only be estimated between pairs of populations (K = 2). Since K = 17 in this study, values were averaged across all pairwise comparisons to produce an estimated value for each SNP marker.
Wright  introduced F-statistics to describe the proportion of genetic diversity within and among populations . Wright's FST statistic has been extended by several authors and a preferable statistic based on the analysis of variance of allele frequencies is Weir and Cockerham's (W&C) FST. For both methods unbiased estimates of FST were first calculated over all populations (global FST) and on a pairwise basis (pairwise FST), with the latter values being averaged over all pairs to produce an estimated information content value for each SNP marker.
Wright's FST was estimated as , where var(p A ) is the variance of the allele frequency among breeds and is the mean allele frequency across the breeds.
Unbiased estimates of W&C's FST were estimated as functions of variance components as detailed in Akey et al., . Estimated FST can be negative if alleles drawn at random from within a population are less similar to one another than those drawn from different populations (FST < 0) [43, 44]. In this study the estimated FST values were left as negative.
Principal Component Analysis (PCA)
PCA is a statistical technique that can be used to reduce the dimension of a multivariate dataset. The original variables are linearly transformed by PCA into a set of underlying variables ("principal components") ranked in terms of their variance, such that most of the original variability may be contained in a smaller number of variables. Each new variable has an associated eigenvalue that measures the respective amount of explained variance. The coefficients ("loadings") used in the linear transformation of the original variables into new variables generate the proportion of variance that a variable contributes to a given principal component.
PCA was performed following Paschou et al., , but on the breed-specific allele frequency matrix rather than the individual genotypes. To determine which principal components were significant, 100 random matrices were created by sampling with replacement allele frequencies within each SNP marker across all breeds. The first eight principal components for the actual data contained more information than in the randomly generated components (i.e., their eigenvalues were greater) and therefore the first eight principal components were used to calculate marker informativeness. The loadings for each SNP marker were squared and summed over the eight significant principal components to produce an estimate of informativeness .
Individual Assignment Analysis
Several genetic assignment approaches are available [6, 7, 37]. The Bayesian implementation developed by Rannala and Mountain  has been found to be more effective at individual assignment than other methods . However, the method of Paetkau et al.,  is equally effective at individual assignment when the levels of genetic differentiation between reference populations are high . Comparison of the two methods for a subset of cattle breeds from the current study revealed similar performance levels (results not shown). Consequently, the method of Paetkau et al.,  was employed as it is easier to implement than that of Rannala & Mountain  and is most frequently employed in empirical studies.
Allele frequencies of zero were replaced by a value of 1 × 10-5 because log(0) is not defined . Likewise, if an observed allele frequency was 1, it was replaced by a value of 0.99999.
Genotype likelihoods were calculated for each individual in each reference population based on the observed allele frequencies for each marker. Let p ijk denote the frequency of the kth allele (k = 1, 2) at the jth locus (j = 1 .. J) in the ith population (I = 1 .. I). Let g jkk' denote an individual's diploid genotype at the jth locus, and let the Mendelian transmission probability of g jkk' arising in the ith population be T(g jkk' | i)
where a genotype is homozygous if k = k' and heterozygous otherwise, under the assumption of random union of gametes. Next, let g denote an individual's multilocus genotype. The likelihood of an individual diploid genotype occurring in a particular population, T(g|i), was estimated as above, as the square of the observed allele frequency for homozygotes or twice the product of the two allele frequencies for heterozygotes. Under the assumption of independence between the J loci
To assess the performance of the assignment procedure, log-likelihood ratios (LLR) were calculated by comparing the likelihood of an individual being assigned to its population of origin and the likelihood of it being assigned to another population
Different stringency thresholds can be applied as confidence levels of assignment precision. Four stringency levels are commonly used: LLR > 0, LLR > 1, LLR > 2 and LLR > 3 [4, 25, 26, 34]. LLR > 1, LLR > 2 and LLR > 3 levels, respectively, mean that a multilocus genotype has to be 10, 100 or 1000 times more likely in one population than any other. The LLR > 0 level requires that the genotype to be more likely in one population than any other. The correct assignment of an individual genotype to its known origin occurred when the calculated LLR was greater than the selected stringency level. If the LLR was lower than the selected stringency level, the individual genotype failed to be assigned to its origin and was instead assigned to the reference population that yielded the highest overall log-likelihood.
To obtain an estimate of the number of SNP markers required to achieve 90%, 95% and 98% correct assignment success of the 384 individual genotypes for each of the selection methods, at each of the 4 threshold levels, a non-linear regression model was fitted to the curves of correct assignment percentage against cumulative markers. An asymptotic regression model (y = a + b expcx, where parameter a represents the value of the asymptote, parameter b represents the difference between the value of y when x = 0 and the upper asymptote and parameter c represents the natural logarithm of the rate of exponential increase) was found to best fit the data. When a > 0, b < 0 and c < 0 the model represents the law of diminishing returns in which the rate of increase of y declines with successive equal increments of x.
To test whether the level of genetic differentiation of a breed corresponded to the power of assignment, a Spearman's rank correlation was calculated between the percentage of correctly assigned individuals for the 20 top ranked SNP markers for each breed (selection method = pairwise Wright's FST, LLR > 0) and the average FST for each breed (pairwise Wright's FST values across all breeds, based on 40, 843 SNP markers, averaged to provide an estimate for each breed).
A negative control to individual assignment analysis was applied by analysing 20 sets of 400 randomly selected SNPs. The average individual assignment success was estimated across the 20 random SNP sets at the stringency level LLR > 3.
In order to evaluate the power of assignment for samples of unknown origin, the individual assignment analysis was evaluated by cross-validation whereby a training sample was used to identify the informative loci and a holdout sample from each of the breeds was used to test the power of the resulting panel and the reference training sample. For breeds with a reference sample size > 50 (Table 1) the holdout sample comprised all the individuals to be assigned (those in column n); these were removed from their respective reference breed and allele frequencies of the reference breeds were re-estimated. For breeds with a reference sample size < 50 (Table 1) five random individual genotypes of the individuals assigned in the main analysis (those in column n) were designated as the holdout sample; these were removed from their respective reference breed and allele frequencies were re-estimated. The individual assignment analysis was repeated with the new training samples and the hold-out samples.
Waser PM, Strobeck C: Genetic signatures of interpopulation dispersal. Trends Ecol Evol. 1998, 13 (2): 43-44. 10.1016/S0169-5347(97)01255-X.
Davies N, Villablanca FX, Roderick GK: Determining the source of individuals: multilocus genotyping in nonequilibrium population genetics. Trends Ecol Evol. 1999, 14 (1): 17-21. 10.1016/S0169-5347(98)01530-4.
Manel S, Gaggiotti OE, Waples RS: Assignment methods: matching biological questions techniques with appropriate. Trends Ecol Evol. 2005, 20 (3): 136-142. 10.1016/j.tree.2004.12.004.
Roques S, Duchesne P, Bernatchez L: Potential of microsatellites for individual assignment: the North Atlantic redfish (genus Sebastes) species complex as a case study. Mol Ecol. 1999, 8 (10): 1703-1717. 10.1046/j.1365-294x.1999.00759.x.
Ciampolini R, Cetica V, Ciani E, Mazzanti E, Fosella X, Marroni F, Biagetti M, Sebastiani C, Papa P, Filippini G: Statistical analysis of individual assignment tests among four cattle breeds using fifteen STR loci. Journal of Animal Science. 2006, 84 (1): 11-19.
Rannala B, Mountain JL: Detecting immigration by using multilocus genotypes. Proceedings of the National Academy of Sciences of the United States of America. 1997, 94 (17): 9197-9201. 10.1073/pnas.94.17.9197.
Paetkau D, Calvert W, Stirling I, Strobeck C: Microsatellite analysis of population structure in canadian polar bears. Mol Ecol. 1995, 4 (3): 347-354. 10.1111/j.1365-294X.1995.tb00227.x.
Maudet C, Luikart G, Taberlet P: Genetic diversity and assignment tests among seven French cattle breeds based on microsatellite DNA analysis. Journal of Animal Science. 2002, 80 (4): 942-950.
Negrini R, Nicoloso L, Crepaldi P, Milanesi E, Colli L, Chegdani F, Pariset L, Dunner S, Leveziel H, Williams JL: Assessing SNP markers for assigning individuals to cattle populations. Anim Genet. 2009, 40 (1): 18-26. 10.1111/j.1365-2052.2008.01800.x.
Morin PA, Luikart G, Wayne RK, Grp SNPW: SNPs in ecology, evolution and conservation. Trends Ecol Evol. 2004, 19 (4): 208-216. 10.1016/j.tree.2004.01.009.
Kim S, Misra A: SNP genotyping: Technologies and biomedical applications. Annu Rev Biomed Eng. 2007, 9: 289-320. 10.1146/annurev.bioeng.9.060906.152037.
Lindblad-Toh K, Wade CM, Mikkelsen TS, Karlsson EK, Jaffe DB, Kamal M, Clamp M, Chang JL, Kulbokas EJ, Zody MC: Genome sequence, comparative analysis and haplotype structure of the domestic dog. Nature. 2005, 438 (7069): 803-819. 10.1038/nature04338.
Wong GKS, Liu B, Wang J, Zhang Y, Yang X, Zhang ZJ, Meng QS, Zhou J, Li DW, Zhang JJ: A genetic variation map for chicken with 2.8 million single-nucleotide polymorphisms. Nature. 2004, 432 (7018): 717-722. 10.1038/nature03156.
Eck SH, Benet-Pages A, Flisikowski K, Meitinger T, Fries R, Strom TM: Whole genome sequencing of a single Bos taurus animal for single nucleotide polymorphism discovery. Genome Biology. 2009, 10 (8): 8-
Kijas JW, Townley D, Dalrymple BP, Heaton MP, Maddox JF, McGrath A, Wilson P, Ingersoll RG, McCulloch R, McWilliam S: A Genome Wide Survey of SNP Variation Reveals the Genetic Structure of Sheep Breeds. PLoS One. 2009, 4 (3): 13-
Ramos AM, Crooijmans RPMA, Affara NA, Amaral AJ, Archibald AL, Beever JE, Bendixen C, Churcher C, Clark R, Dehais P: Design of a High Density SNP Genotyping Assay in the Pig Using SNPs Identified and Characterized by Next Generation Sequencing Technology. PLoS One. 2009, 4 (8): Article No.: e6524
Van Tassell CP, Smith TPL, Matukumalli LK, Taylor JF, Schnabel RD, Lawley CT, Haudenschild CD, Moore SS, Warren WC, Sonstegard TS: SNP discovery and allele frequency estimation by deep sequencing of reduced representation libraries. Nature Methods. 2008, 5 (3): 247-252. 10.1038/nmeth.1185.
Matukumalli LK, Lawley CT, Schnabel RD, Taylor JF, Allan MF, Heaton MP, O'Connell J, Moore SS, Smith TPL, Sonstegard TS: Development and Characterization of a High Density SNP Genotyping Assay for Cattle. PLoS One. 2009, 4 (4): 13-
Gautier M, Flori L, Riebler A, Jaffrezic F, Laloe D, Gut I, Moazami-Goudarzi K, Foulley JL: A whole genome Bayesian scan for adaptive genetic divergence in West African cattle. BMC Genomics. 2009, 21 (10): 550-
Hayes BJ, Bowman PJ, Chamberlain AJ, Savin K, van Tassell CP, Sonstegard TS, Goddard ME: A Validated Genome Wide Association Study to Breed Cattle Adapted to an Environment Altered by Climate Change. PLoS One. 2009, 4 (8):
Qanbari S, Pimentel ECG, Tetens J, Thaller G, Lichtner P, Sharifi AR, Simianer H: The pattern of linkage disequilibrium in German Holstein cattle. Anim Genet. 2010, 41 (4): 346-356.
Flori L, Fritz S, Jaffrezic F, Boussaha M, Gut I, Heath S, Foulley JL, Gautier M: The Genome Response to Artificial Selection: A Case Study in Dairy Cattle. PLoS One. 2009, 4 (8):
Lao O, van Duijn K, Kersbergen P, de Knijff P, Kayser M: Proportioning whole-genome single-nucleotide-polymorphism diversity for the identification of geographic population structure and genetic ancestry. American Journal of Human Genetics. 2006, 78 (4): 680-690. 10.1086/501531.
Paschou P, Ziv E, Burchard EG, Choudhry S, Rodriguez-Cintron W, Mahoney MW, Drineas P: PCA-correlated SNPs for structure identification in worldwide human populations. PLoS Genet. 2007, 3 (9): 1672-1686.
Shriver MD, Smith MW, Jin L, Marcini A, Akey JM, Deka R, Ferrell RE: Ethnic-affiliation estimation by use of population-specific DNA markers. American Journal of Human Genetics. 1997, 60 (4): 957-964.
Smith MW, Lautenberger JA, Shin HD, Chretien JP, Shrestha S, Gilbert DA, O'Brien SJ: Markers for mapping by admixture linkage disequilibrium in African American and Hispanic populations. American Journal of Human Genetics. 2001, 69 (5): 1080-1094. 10.1086/323922.
Bowcock AM, Ruizlinares A, Tomfohrde J, Minch E, Kidd JR, Cavallisforza LL: High resolution of human evolutionary trees with polymorphic microsatellites. Nature. 1994, 368 (6470): 455-457. 10.1038/368455a0.
Wright S: The genetical structure of populations. Annals Eugenics. 1951, 15: 323-
Weir BS, Cockerham CC: Estimating F-statistics for the analysis of population structure. Evolution. 1984, 38 (6): 1358-1370. 10.2307/2408641.
Bromaghin JF: BELS: backward elimination locus selection for studies of mixture composition or individual assignment. Mol Ecol Resour. 2008, 8 (3): 568-571. 10.1111/j.1471-8286.2007.02010.x.
Rosenberg NA, Li LM, Ward R, Pritchard JK: Informativeness of genetic markers for inference of ancestry. American Journal of Human Genetics. 2003, 73 (6): 1402-1422. 10.1086/380416.
Kersbergen P, van Duijn K, Kloosterman AD, den Dunnen JT, Kayser M, de Knijff P: Developing a set of ancestry-sensitive DNA markers reflecting continental origins of humans. Bmc Genetics. 2009, 10: 13-
Rosenberg NA, Burke T, Elo K, Feldman MW, Freidlin PJ, Groenen MAM, Hillel J, Maki-Tanila A, Tixier-Boichard M, Vignal A: Empirical Evaluation of Genetic Clustering Methods Using Multilocus Genotypes From 20 Chicken Breeds. Genetics. 2001, 159 (2): 699-713.
Campbell D, Duchesne P, Bernatchez L: AFLP utility for population assignment studies: analytical investigation and empirical comparison with microsatellites. Mol Ecol. 2003, 12 (7): 1979-1991. 10.1046/j.1365-294X.2003.01856.x.
McKay SD, Schnabel RD, Murdoch BM, Matukumalli LK, Aerts J, Coppieters W, Crews D, Dias E, Gill CA, Gao C: An assessment of population structure in eight breeds of cattle using a whole genome SNP panel. Bmc Genetics. 2008, 9: 9-
Decker JE, Pires JC, Conant GC, McKay SD, Heaton MP, Chen KF, Cooper A, Vilkki J, Seabury CM, Caetano AR: Resolving the evolution of extant and extinct ruminants with high-throughput phylogenomics. Proceedings of the National Academy of Sciences of the United States of America. 2009, 106 (44): 18644-18649. 10.1073/pnas.0904691106.
Cornuet JM, Piry S, Luikart G, Estoup A, Solignac M: New methods employing multilocus genotypes to select or exclude populations as origins of individuals. Genetics. 1999, 153 (4): 1989-2000.
Liu N, Chen L, Wang S, Oh C, Zhao H: Comparison of single-nucleotide polymorphisms and microsatellites in inference of population structure. Bmc Genetics. 2005, 6 (Suppl 1):
Smouse PE, Spielman RS, Park MH: Multiple-locus allocation of individuals to groups as a function of the genetic variatio within and differences among human populations. American Naturalist. 1982, 119 (4): 445-463. 10.1086/283925.
Parker HG, Ostrander EA: Canine genomics and genetics: Running with the pack. PLoS Genet. 2005, 1 (5): 507-513.
Team RDC: R: A language and environment for statistical computing. 2010, R Foundation for Statistical Computing, Vienna, Austria
Holsinger KE, Weir BS: Genetics in geographically structured populations: defining, estimating and interpreting F-ST. Nature Reviews Genetics. 2009, 10 (9): 639-650. 10.1038/nrg2611.
Akey JM, Zhang G, Zhang K, Jin L, Shriver MD: Interrogating a high-density SNP map for signatures of natural selection. Genome Research. 2002, 12 (12): 1805-1814. 10.1101/gr.631202.
Weir BS: Genetic Data Analysis II: Methods for Discrete Population. 1996, Sinauer Assoc., Inc.: Sunderland, MA, USA
Consortium TBH: Genome-Wide Survey of SNP Variation Uncovers the Genetic Structure of Cattle Breeds. Science. 2009, 324 (5926): 528-532.
The work was funded by the UK Food Standards Agency, Department of Environment, Food and Rural Affairs and a Genesis Faraday (KTN) SPARK award. S. Wilkinson acknowledges funding from BBSRC and Rare Breeds Survival Trust. JFT is supported by National Research Initiative Grant no. 2008-35205-04687 from the USDA Cooperative State Research, Education, and Extension Service, Agriculture and Food Research Initiative grant number 2009-65205-05635 from the USDA National Institute of Food and Agriculture.
SW participated in the study design, wrote the computer code, carried out the statistical analysis and drafted the manuscript. PW participated in the study design and manuscript preparation. ALA was a co-PI, involved in project design and manuscript preparation. AL provided bioinformatics support. RDS called the genotypes in BeadStudio, performed QA/QC analyses and estimated allele frequencies.
SDM genotyped all of the samples. JFT collected the samples and DNAs were extracted and genotyped in his laboratory. RO was the Principal Investigator on the project responsible for its conception, funding and implementation. All authors contributed to the writing and read and approved the final version of the manuscript.
Electronic supplementary material
Additional file 1:. Figure S1: A boxplot of the observed breed allele frequencies for the top ranked 50 SNP markers for each selection method. Figure S2: A plot of the percentage assignment success with cumulative number of top-ranked SNP markers at the 4 stringency threshold levels. The results of this individual assignment test is for the training set and hold-out set where the selection implemented was Wright's pairwise FST. Table S1: Type I (false positives) and II errors (false negatives). The table details the error rates that occurred in the individual assignment analysis, using pairwise Wright's FST at the lowest stringency threshold level (LLR > 0). (DOC 445 KB)
Additional file 2:Table S2. Top 500 SNP markers. The genetic markers are ranked by decreasing informativeness and the corresponding SNP discovery methods are listed with each SNP marker. (XLS 208 KB)
About this article
Cite this article
Wilkinson, S., Wiener, P., Archibald, A.L. et al. Evaluation of approaches for identifying population informative markers from high density SNP Chips. BMC Genet 12, 45 (2011). https://doi.org/10.1186/1471-2156-12-45
- Principle Component Analysis
- Single Nucleotide Polymorphism Marker
- Individual Assignment
- Assignment Power
- Single Nucleotide Polymorphism Discovery