Skip to main content

An ancestry informative marker set for determining continental origin: validation and extension using human genome diversity panels



Case-control genetic studies of complex human diseases can be confounded by population stratification. This issue can be addressed using panels of ancestry informative markers (AIMs) that can provide substantial population substructure information. Previously, we described a panel of 128 SNP AIMs that were designed as a tool for ascertaining the origins of subjects from Europe, Sub-Saharan Africa, Americas, and East Asia.


In this study, genotypes from Human Genome Diversity Panel populations were used to further evaluate a 93 SNP AIM panel, a subset of the 128 AIMS set, for distinguishing continental origins. Using both model-based and relatively model-independent methods, we here confirm the ability of this AIM set to distinguish diverse population groups that were not previously evaluated. This study included multiple population groups from Oceana, South Asia, East Asia, Sub-Saharan Africa, North and South America, and Europe. In addition, the 93 AIM set provides population substructure information that can, for example, distinguish Arab and Ashkenazi from Northern European population groups and Pygmy from other Sub-Saharan African population groups.


These data provide additional support for using the 93 AIM set to efficiently identify continental subject groups for genetic studies, to identify study population outliers, and to control for admixture in association studies.


As we and others have previously discussed, ancestry informative markers (AIMs) can be used as a tool to minimize bias due to population stratification in case-control association studies [14]. These AIMs are not necessary in genome-wide association studies (GWAS) since the data contains a wealth of SNP information that can define and control for population stratification [3]. However, AIMs may be particularly valuable for follow-up studies to confirm GWAS results or for focused candidate gene studies. These may include studies examining different continental populations as well as studies examining populations of mixed ancestry. Thus, it is timely to identify sets of AIMs that can be used to either pre-define subject groups or control for ancestry. Recently, our group has demonstrated the application of a set of SNP AIMs for discerning continental population information[4]. These studies using a total of 128 SNPs and subsets derived from this panel showed the ability of small sets of SNPs to separate a variety of self-identified subjects of European, Amerindian, East Asian, and sub-Saharan African ancestry. Using these SNPs we were able to provide admixture information for sub-Saharan African, European and Amerindian admixed populations, and perform structured association testing in the context of mixed or admixed population groups. In addition, these studies showed that a subset of 96 AIMs performed well in TaqMan® assays, thus enabling potential wide application of these SNPs.

In the current study, we further examine the ability of a set of 93 AIMs to ascertain the ancestry of diverse population groups. This study was facilitated by the recent publically available Human Genome Diversity Panel (HGDP) genotypes[5] that include our 128 SNP AIM set. For the current study, we chose the set of 96 TaqMan optimized SNP AIMs that were the most informative AIMs from our previous study that had clear profiles in TaqMan assays[4]. Of these 96 AIMs, 93 (Additional File 1) had HGDP genotypes that passed quality filters (see Methods). These additional data allow the assessment of this SNP AIM set in Oceana populations and multiple additional African, South Asian, Amerindian, and European population groups. In addition, since our previous studies relied on several population groups (East Asian, South Asian and European) that were derived from collections in the United States, it was important to further validate the previous results using samples that were collected from specific countries of origin.


Populations studied

The individuals used in these studies include those from the HGDP, HapMap, the New York Cancer Project (NYCP) [6] and samples collected in the United States (Houston, Sacramento), Guatemala, Peru, Sweden. and West Africa. For the HGDP and HapMap the genotypes were available from online databases. For the other sample sets the genotyping was performed at Feinstein Institute for Medical Research (North Shore LIJ Health System) using Illumina 300 K array or using TaqMan assays as previously described [4]. Of the total of 1620 individual participant genotypes, 825 were included in our previous studies[4].

The previous genotypes included those from 128 European Americans, 88 African Americans, 60 CEPH Europeans (CEU), 56 Yoruban sub-Saharan Africans (YRI), 19 Bini sub-Saharan Africans, 23 Kanuri West Africans, 50 Mayan Amerindians, 26 Quechuan Amerindians, 29 Nahua Amerindians, 40 Mexican Americans (MAM), 26 Mexicans from Mexico City, 28 Puerto Rican Americans, 43 Chinese (CHB), 43 Chinese Americans, 43 Japanese (JPT), 3 Japanese Americans, 8 Vietnamese Americans, 1 Korean American, 45 Filipino Americans, 2 unspecified East Asian Americans, and 64 South Asian Indian Americans. The Maya-Kachiquel were Maya from the Kachiquel language group as previously described[7] and is from a collection distinct from the HGDP Maya group.

The additional subject sets in the current study included the following HGDP genotypes: Adygei (also known as Adyghe)(14 individuals), Balochi (15), Bantu from Kenya (12), Bantu from South Africa (8), Basque (13), Bedouin (47), Biaki Pygmy (32), Burusho (7), Cambodian (10), Columbian (7), Daur (10), Druze (43), Kalash (18), Lahu (8), Mandenka (24), Maya (13), Mbuti Pygmy (15), Melanesian (17), Mongolian (9), Mozabite (30), Palestinian (26), Papuan (16), Pima (11), Russian (13), San (7), Uygur (10), Yakut (15), Yi (10), and Yoruba (25). Other additional samples not previously studied included Ashkenazi Jewish (40), Swedish (40), Irish (40) and other European Americans (190) from the NYCP. Genotypes from the HGDP subjects were obtained from the NIH Laboratory of Neurogenetics

For all subjects, blood cell samples were obtained according to protocols and informed-consent procedures approved by institutional review boards, and were labeled with an anonymous code number linked only to demographic information.

Data Filters

SNPs and individual samples with less than 90% complete genotyping information from any data set were excluded from analyses. SNPs that showed extreme deviation from Hardy-Weinberg equilibrium (p < 0.00001) in individual population groups were also excluded from these analyses.

Statistical Analyses

Fst was determined using Genetix software[8] that applies the Weir and Cockerham algorithm[9]. Hardy-Weinberg equilibrium was determined using HelixTree 5.0.2 software (Golden Helix, Bozeman, MT, USA).

Population structure was examined using STRUCTURE v2.1[10, 11] parameters and AIMs previously described[4]. Briefly, each analysis was performed without any prior population assignment and was performed at least 3 times with similar results using >200,000 replicates and >100,000 burn-in cycles under the admixture model. For all analyses reported, we used the "infer α" option with a separate α estimated for each population (where α is the Dirichlet parameter for degree of admixture). Runs were performed under the λ = 1 option where λ estimates the prior probability of the allele frequency and is based on the Dirichlet distribution of allele frequencies.

PCA was performed using the EIGENSTRAT statistical package[12].


AIMs Show Increased Ability to Differentiate Between Continental Population Groups

Wright's F statistic, Fst, was used as a common measure of population differentiation and calculates the inter-population compared to intra-population variation. Using the Weir and Cockerham algorithm[9] (see Methods) we compared the Fst values of selected population groups between random marker sets and the 93 SNP AIMs. The studies included samples derived from HapMap [13, 14], HGDP[5], and samples collected in the United States, Guatemala and Nigeria (see Methods). The random SNP Fst values were obtained using three random non-overlapping sets of 3500 SNPs distributed over the autosomal genome (minimum of 50 kb distance between SNPs). The small differences in these triplicate independent samplings (mean SD for all paired Fst values = 0.0023; median SD = 0.0014; mean coefficient of variance for all Fst values = 0.0023) indicate that this approach resulted in good estimations of paired Fst values.

The 93 SNP AIM subset had substantially larger intercontinental paired Fst values than the random SNPs for any of the pairs of population groups from the sub-Saharan African, Amerindian, East Asian, Oceana and European (excluding the South Asian populations) continental groups (Table 1). For the South Asian populations, the paired Fst values showed a similar pattern, with the exception of those between the South Asian and Oceana groups in which the paired Fst values using the AIM panel were similar to those determined using the random SNPs. In contrast, the paired Fst values within continental groups (European, Amerindian, East Asia, and Oceana) were very similar when comparing the AIM and random SNP sets (mean intra-continental group Fst difference = 0.008). Overall, the Fst values determined using the 93 AIM set were highly correlated with the Fst values determined using the random SNPs (r2 = 0.70) (Additional file 2).

Table 1 Paired Fst values using 93 AIMs and random sets of 3500 SNPsa

Examination of Population Structure Using Non-Hierachical Clustering

The population genetic structure of 1620 subjects was examined using the STRUCTURE program [10, 11] that applies a Bayesian non-hierarchical clustering method. The genotypes were from 795 new subjects not previously studied, and 825 subjects from our previous studies [4]. All subjects were examined under different assumptions of the number of population groups (clusters) ranging from one to twelve (K = 1, K = 2 ... K = 12) without any pre-assignment of population affiliation. The estimation of Ln probability of the data modestly favored the assumption of K = 9 (Fig 1) and strongly suggested that more than 5 population groups best fit these data. As shown in Fig 2, the population groups corresponded to different self-identified ethnic groupings of specific continental origins and particular sub-continental groupings. When large numbers of replicates were used (see Methods), multiple runs of this data set showed stable results at K = 5 and K = 6. When larger numbers of groups were assumed (i.e. K > 6) there was variation in the results. In particular runs various cluster groups would be present or absent. These included some runs in which Bedouin subjects corresponded to an individual cluster group, and others in which the South Asian populations was represented as a single cluster rather than two clusters as shown for K = 9 in Fig 2. Consistent with our previous studies, we observed one or more clusters that showed a high proportion in South Asian population groups and low membership of all other ethnic groups. Interestingly, we also consistently observed a splitting of European populations into two or more clusters that appears to correlate with a distinction between individuals of northern European ancestry and those of ethic groups derived from the Middle-East region. Thus, Palestinian, Bedouin, Druze, and Ashkenazi populations had many individuals with a large membership in a second European cluster (Fig 2, K = 9). Similarly, a division within the sub-Saharan African populations was observed with the majority of San and Mbuti Pygmy individuals showing a high proportion of membership in a second sub-Saharan African cluster. The sub-Saharan Africa results are consistent with observations in a variety of previous studies [5, 1517].

Figure 1
figure 1

Probability estimations for the number of cluster groups (K) using STRUCTURE. The ordinate show the Ln probability corresponding to the number of cluster (K). STRUCTURE analyses were performed using the F model (admixture) as described in Methods using the 93 SNP AIM set. The Ln probability closest to zero corresponds to the most likely number of clusters or population groups that explain the population structure.

Figure 2
figure 2

Analysis of population genetic structure using 93 SNP AIMs. Each horizontal line represents an individual subject. Each self identified population group is shown along the ordinate. Analyses were performed using STRUCTURE without any prior population assignment (see Methods). The number of cluster groups is shown for each panel. The color code corresponds to individual cluster groups that were named according to the continental group with the largest membership in that group.

The STRUCTURE analyses using the 93 SNP AIM set were also compared with results obtained using random sets of 3500 SNPs for K = 6 (Fig 3). In this comparison we used HGDP, HapMap and Maya (Kachiquel) samples. The individual membership in each cluster group was highly correlated: overall r2 = 0.94; for the African cluster, r2 = 0.99; for the Amerindian cluster, r2 = 0.99; for the East Asian cluster, r2 = 0.99; for the European cluster, r2 = 0.90; for the South Asian cluster, r2 = 0.56; and for the Oceana cluster, r2 = 0.97. The weakest correlations were observed for the Burusho, Balochi, and Kalash South Asian ethnic groups, and the Adygei individuals. For these particular ethnic groups the membership in the European and South Asian groups was substantially different using the 93 SNP AIM set than the result obtained using 3500 random SNPs (Fig. 3D and 3F).

Figure 3
figure 3

Correlations between population structure results. Results of STRUCTURE analyses using the 93 SNP AIM Set and 3500 random SNPs are shown. The fraction of membership for each individual analyzed is indicated for each of six clusters (panels A-F) for the 93 SNP AIM set (ordinate) and a 3500 SNP set (abscissa). The population clusters named according to the continental group with the largest membership in that group. The subjects included each of the HGDP, HapMap and the Maya (Kachiquel) individuals (see methods). The results show a single 3500 random SNP set is shown. However each of three independent 3500 random SNP sets show very similar results. For panels D and F, the South Asian ethnic groups (open circles) and Adygei ethnic group (grey circles) are shown with different symbols to highlight the differences between the 93 SNP AIM set and 3500 random SNPs.

There was also a correlation between the 93 AIM set and 3500 random SNP set when the splitting of the European and Sub-Saharan African populations was observed (e.g. when K = 9 analyses were performed). The correlation in membership in the two European clusters (r2 = 0.50) and two sub-Saharan African clusters (r2 = 0.44) provides support for the ability of the 93 SNP AIM set to partially discern these additional aspects of population substructure.

The performance of the 93 SNP AIM set was also compared with results obtained using different numbers of random SNPs. Overall, the 93 SNP AIM set showed marginally higher correlations with group membership determined using 3500 random SNPs (r2 = 0.94), than did random sets of 500 SNPs (r2 = 0.90). The performance of the 93 SNP AIM set was also examined using restricted sample groups (European, sub-Saharan African and Mozabites) to assess the European and sub-Saharan African contribution in the Mozabite ethnic group. For this comparison, the STRUCTURE analyses were performed using K = 2. Here, the correlation between the 93 SNP AIM set and 3500 random SNPs was stronger than the correlation observed for 500 random SNP sets and nearly comparable to 1000 random SNPs. The 93 SNP AIM set showed an r2 = 0.85 with a 3500 random SNP set. Each of three independent random 500 SNP sets showed lower r2 values with the 3500 random SNPs (r2 values = 0.64, 0.75 and 0.77, respectively, for three independent 500 SNP sets). For sets of 1000 random SNPs, very high correlations were observed (r2 = 0.90, 0.87, and 0.91 for three independent random 1000 SNP sets). As a comparison, 93 random SNP sets showed much lower correlations (r2 values = 0.44, 0.27, and 0.36 for three independent random SNP sets). Together, these data suggest that the 93 SNP AIMs capture more ancestry information than random sets of 500 SNPs and are nearly comparable to using 1000 random SNPs.

To further assess the correspondence of the cluster groups to geographic ancestry, we examined the K = 6 STRUCTURE results comparing the presumed European, East Asian, sub-Saharan African, Oceana, and Amerindian population clusters with each of the self identified or regionally collected groups (Table 2). For the purposes of these analyses, we considered South Asian origin as a distinct group separate from "European" populations (see Discussion). Using >0.85 membership in a cluster as the criterion for inclusion, most of the subjects within each self-defined or collected population group corresponded to the expected continental group (Table 2). For example, of the European subjects 91.9% were members of the "European" cluster group and none were members of the other five cluster groups.

Table 2 Ascertainment of Continental Ancestry Using 93 SNP AIM Panel

With respect to the South Asian population groups, overall 7.7% (8/104) of the individuals were included in the European group and none of the South Asian individuals were included in any of the other continental categories. Of these eight South Asian individuals included in the European group, five belonged to the Kalash ethnic group. For the South Asian subjects, the criterion of >0.85 membership in a specific cluster included only a minority (27/104) of these subjects. Decreasing the criterion to >0.50 membership resulted in the inclusion of the majority of the South Asian individuals within a single cluster (67/104). Using this criterion, there were only 6 individuals of other ethnic groups that were included in the South Asia group (2/10 Uygur, 1/40 Ashkenazi, 1/26 Palestinian, 1/399 EURA, and 1/43 Druze individuals).

As expected, the vast majority of individuals from admixed populations (African American, Mexican American, and Mexican) were not included in any of the continental groupings. In addition, the Uygur individuals from central Asia and all but one of the Mozabite subjects were excluded from any of the continental groups using the >0.85 cluster group membership criterion.

Principal Component Analyses of Diverse Population Groups Using AIMs

The same data set was also examined using principal component analyses (PCA). Nearly all of the variance detected using the 93 AIM set was defined by the first four principal components (PCs) (Fig 4). Similar to the results from STRUCTURE, the first four PCs (Fig 5) show the separation of the 5 continental populations as well as those of two admixed populations (African American and Mexican American groups) that were included in the analyses.

Figure 4
figure 4

Eigenvalue distribution for principal components. The eigenvalues for each PC are shown. Comparing the eigenvalue of each PC shows the relative amount of variation that is explained by the different PCs. The plateau in eigenvalues generally corresponds to variation that can not be attributable to discernable groupings of subjects.

Figure 5
figure 5

Principal component analysis of diverse population groups. The analysis used the same data set indicated in Figure 2. The population groups are shown by the color coded symbols. The sub-Saharan African groups are designated S-S African in this figure. A, shows the PC1 and PC2 results from the different ethnic groups excluding the admixed populations. C, shows the PC1 and PC2 results for the African American and Mexican American subjects that were run together with the individual subjects shown in A. B, and D show the results for the same subject groups for PC3 and PC4.

Also similar to the cluster groups defined by STRUCTURE, putative subject groups corresponding to continents or admixed population groups could be assigned using the individual subject eigenvector scores. Here, we used the self-identified or collected European, East Asian, sub-Saharan African, Oceana, and Amerindian populations groups to define the groups. The criterion for inclusion was the mean +/- two standard deviations (SD) of the eigenvector scores for each of the first four PCs (i.e. individuals with eigenvector scores > mean + 2 SD or < mean - 2SD for PC1, PC2, PC3, or PC4 were excluded). Using this definition, most of the subjects within each self-defined or collected population group were included within a self-matching group (Table 2). Similar to the STRUCTURE results, with the exception of a single Mozabite individual, none of the subjects that were a priori considered to be of other continental populations (excluding South Asian and admixed population groups) were included in any of the other continental groups.

For South Asians, this criterion (mean +/- 2 SD) included 87 of the 104 self-identified subjects in the South Asian grouping. However, 170 of the remaining 1516 subjects (not self-identified as South Asian) were also included in this group. When the criterion was changed to the mean +/- 1 SD, only 25 of the 104 self-identified South Asian subjects were included in this group and 8 other non-South Asian subjects were also included. Thus, for the 93 AIM set data, the PCA analyses did not perform well with respect to identifying South Asian subjects.

Using PCA, we further examined these individual population groups. There was partial grouping of certain ethnic groups when only those subjects within individual continental groups were analyzed separately (Fig 6). This was most distinct for the Mbuti Pygmy group within the sub-Saharan African populations. In addition, southern European population groups could be partially distinguished other European population groups (e.g. Palestinian compared to Russian). However, clustering of different East Asian populations groups was not observed using this set of AIMs selected for continental differences (see Discussion).

Figure 6
figure 6

Principal component analysis of sub-Saharan African, European and East Asian populations. The analysis was performed using only the individual continental groups. The populations included each of the sub-Saharan African (A), European (B), and East Asian (C) ethnic groups. The color code highlights specific ethnic groups.


The current study provides additional validation for the use of a set of AIMs in human genetic studies. We show that a set of 93 SNP AIMs can distinguish a wide variety of diverse population groups in a sampling that includes the most populous groups in the United States as well as many groups from each continent with the exception of Australia. For example, even very diverse tribal groups within Africa can be readily distinguished from other continental groups. In addition, since the AIMs were in large part initially selected to distinguish between Amerindian, European, and African ancestry, the same set of AIMs provides good information for individual admixture in the largest admixed population groups (African American and Mexican American) in the United States. This AIM set also distinguished most of the South Asian and Central Asian individuals from those of European or East Asian origin and was also effective in grouping Oceana populations. Although there were specific limitations (e.g. for distinguishing South Asian populations), overall, the data suggest that this AIM set performs better than 500 random SNPs for distinguishing continental population differences.

Although many previous studies have identified AIMs that distinguish particular combinations of continental groups [2, 1820], the current AIM set has several important features. These include: 1) validation using many different population groups from all continents with the exception of Australia; and 2) widely available genotyping results that can be readily incorporated in analyses. The latter includes the previously published individual genotypes accompanying our initial study of these SNPs [4], and any subject sets genotyped using the Illumina 300 K or larger SNP platforms. Importantly, both the HGDP and HapMap Illumina genotypes are publically available. In fact, the performance of the current AIM set could not be directly compared with other recently described AIM sets [2, 21] because of limited public availability of individual subject HGDP genotypes for these SNPs. The use of previous genotyped data sets in analyses can enhance the performance of analyses using either clustering algorithms or PCA, both of which are influenced by the inclusion of different population groups [22]. Finally, the current set of AIMs has been selected for performance on the widely used TaqMan® platform that can be efficiently applied in small laboratory settings and is commercially available as a marker set = 606102.

We have previously discussed and provided general guidelines for the application of AIMs [3, 4]. In the current study, we have used specific criteria for both STRUCTURE outputs and PCA eigenvector scores to analyze a SNP AIM panel using additional subjects of diverse ethnic group affiliation. Marginally better correspondence with self-identified ethnic affiliation was observed in this data set using the model dependent clustering algorithm applied in STRUCTURE compared to PCA (Table 2). However, PCA may offer substantial computational advantages if the AIMs are used for controlling population structure and substructure in association studies [3, 12]. Thus, at present, we would suggest using STRUCTURE results for limiting analyses to particular subject groups and using PCA or multidimensional scaling for association testing. The application of multi-dimensional scaling showed nearly identical results to those using PCA (data not shown).

It is worth noting that this current AIM panel excludes nearly all South Asian subjects from "other" European populations. As has been noted in previous studies, South Asian populations are much closer to European than East Asian or other continental groups and South Asian ethnic populations are variably grouped together with European population groups [23]. When comparing population differentiation using paired Fst values there is no clear distinction between these different European and South Asian ethnic groups (Table 1). For example, the following Fst values using random SNPs were observed: Balochi/Ashkenazi = 0.018, Balochi/Palestinian = 0.016, Balochi/Swedish = 0.021, Palestinian/Swedish = 0.020, Palestinian/Ashkenazi = 0.010, Ashkenazi/Swedish = 0.012. However, the current STRUCTURE results that show South Asian specific clusters, previous STRUCTURE analyses [20, 23], and PCA analyses using thousands of SNPs [5] indicate substantial differences in the allele patterns of South Asian compared to European subjects. Thus, it may be advantageous to exclude South Asian subjects in European association studies to reduce genetic heterogeneity. The current suggested criteria (Table 2) will probably exclude most South Asian individuals, although with the caveat that many South Asian ethnic groups have not been studied.

This 93 SNP AIM set also showed a partial ability to discern additional population substructure. For both Europeans and sub-Saharan Africans, there was apparent grouping of certain ethnic groups in additional clusters. This was most clear for K = 9 in the STRUCTURE analysis but was also suggested by the graphic representation in the PCA analysis (Fig 6). Thus, the differences between Arab and Ashkenazi European and northern European ethnic groups, and the difference between certain sub-Saharan African groups (e.g. Mbuti Pygmy) are partially discerned. However, previous studies by multiple groups indicate that additional panels of SNPs are necessary to most effectively control for differences in European population substructure [22, 2426]. In addition, the 93 SNP AIM panel did not show any substructure within the East Asian populations. Recent studies using HGDP and other sample sets show substructure within East Asian population groups further emphasizing the potential limitations of the 93 SNP AIM panel [5, 27]. The current AIM panel is designed to address continental differences and we caution that controlling for population stratification within particular continental groups requires additional panels of SNPs to further reduce false positive or negative results in association tests [3, 22, 2428]. Importantly, the current AIM set performs well with respect to ascertaining admixture proportions in African Americans and in Hispanic populations [4]. The need for utilizing additional SNPs for addressing population stratification will be highly dependent on the populations being used in a particular study and whether other strategies including demographic information are used for matching cases and controls.


The current study provides additional confidence that a panel of 93 AIMs can be effectively used to ascertain population genetic structure that results from the inclusion of subjects of diverse continental origins. Using either highly supervised clustering algorithms or largely unsupervised PCA, these SNP AIMS can be used to 1) identify continental subject groups for genetic studies, 2) identify study population outliers, and 3) control for admixture in association studies.


  1. Hoggart CJ, Parra EJ, Shriver MD, Bonilla C, Kittles RA, Clayton DG, McKeigue PM: Control of confounding of genetic associations in stratified populations. Am J Hum Genet. 2003, 72 (6): 1492-1504. 10.1086/375613.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  2. Halder I, Shriver M, Thomas M, Fernandez JR, Frudakis T: A panel of ancestry informative markers for estimating individual biogeographical ancestry and admixture from four continents: utility and applications. Hum Mutat. 2008, 29 (5): 648-658. 10.1002/humu.20695.

    Article  CAS  PubMed  Google Scholar 

  3. Tian C, Gregersen PK, Seldin MF: Accounting for ancestry: population substructure and genome-wide association studies. Hum Mol Genet. 2008, 17 (R2): R143-150. 10.1093/hmg/ddn268.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  4. Kosoy R, Nassir R, Tian C, White PA, Butler LM, Silva G, Kittles R, Alarcon-Riquelme ME, Gregersen PK, Belmont JW, et al: Ancestry informative marker sets for determining continental origin and admixture proportions in common populations in America. Hum Mutat. 2009, 30 (1): 69-78. 10.1002/humu.20822.

    Article  PubMed Central  PubMed  Google Scholar 

  5. Li JZ, Absher DM, Tang H, Southwick AM, Casto AM, Ramachandran S, Cann HM, Barsh GS, Feldman M, Cavalli-Sforza LL, et al: Worldwide human relationships inferred from genome-wide patterns of variation. Science. 2008, 319 (5866): 1100-1104. 10.1126/science.1153717.

    Article  CAS  PubMed  Google Scholar 

  6. Mitchell MK, Gregersen PK, Johnson S, Parsons R, Vlahov D: The New York Cancer project: rationale, organization, design, and baseline characteristics. J Urban Health. 2004, 81 (2): 301-310.

    Article  PubMed Central  PubMed  Google Scholar 

  7. Tian C, Hinds DA, Shigeta R, Adler SG, Lee A, Pahl MV, Silva G, Belmont JW, Hanson RL, Knowler WC, et al: A genomewide single-nucleotide-polymorphism panel for Mexican American admixture mapping. Am J Hum Genet. 2007, 80 (6): 1014-1023. 10.1086/513522.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  8. Belkhir K, Borsa P, Chikhi L, Raufaste N, Bonhomme F: GENETIX, software under WindowsTM for the genetic of populations. 2001, Montpellier, France: Laboratory Genome, Populations, Interactions CNRS UMR 5000, University of Montpellier II, 4.02

    Google Scholar 

  9. Weir B, Cockerham C: Estimating F-statistics for the analysis of population structure. Evolution. 1984, 38: 1358-1370. 10.2307/2408641.

    Article  Google Scholar 

  10. Pritchard JK, Stephens M, Donnelly P: Inference of population structure using multilocus genotype data. Genetics. 2000, 155 (2): 945-959.

    PubMed Central  CAS  PubMed  Google Scholar 

  11. Falush D, Stephens M, Pritchard JK: Inference of population structure using multilocus genotype data: linked loci and correlated allele frequencies. Genetics. 2003, 164 (4): 1567-1587.

    PubMed Central  CAS  PubMed  Google Scholar 

  12. Price AL, Patterson NJ, Plenge RM, Weinblatt ME, Shadick NA, Reich D: Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet. 2006, 38 (8): 904-909. 10.1038/ng1847.

    Article  CAS  PubMed  Google Scholar 

  13. The International HapMap Project. Nature. 2003, 426 (6968): 789-796. 10.1038/nature02168.

  14. Altshuler D, Brooks LD, Chakravarti A, Collins FS, Daly MJ, Donnelly P: A haplotype map of the human genome. Nature. 2005, 437 (7063): 1299-1320. 10.1038/nature04226.

    Article  Google Scholar 

  15. Bamshad MJ, Wooding S, Watkins WS, Ostler CT, Batzer MA, Jorde LB: Human population genetic structure and inference of group membership. Am J Hum Genet. 2003, 72 (3): 578-589. 10.1086/368061.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  16. Witherspoon DJ, Marchani EE, Watkins WS, Ostler CT, Wooding SP, Anders BA, Fowlkes JD, Boissinot S, Furano AV, Ray DA, et al: Human population genetic structure and diversity inferred from polymorphic L1(LINE-1) and Alu insertions. Hum Hered. 2006, 62 (1): 30-46. 10.1159/000095851.

    Article  CAS  PubMed  Google Scholar 

  17. Rosenberg NA, Pritchard JK, Weber JL, Cann HM, Kidd KK, Zhivotovsky LA, Feldman MW: Genetic structure of human populations. Science. 2002, 298 (5602): 2381-2385. 10.1126/science.1078311.

    Article  CAS  PubMed  Google Scholar 

  18. Parra EJ, Kittles RA, Shriver MD: Implications of correlations between skin color and genetic ancestry for biomedical research. Nat Genet. 2004, 36 (11 Suppl): S54-60. 10.1038/ng1440.

    Article  CAS  PubMed  Google Scholar 

  19. Salari K, Choudhry S, Tang H, Naqvi M, Lind D, Avila PC, Coyle NE, Ung N, Nazario S, Casal J, et al: Genetic admixture and asthma-related phenotypes in Mexican American and Puerto Rican asthmatics. Genet Epidemiol. 2005, 29 (1): 76-86. 10.1002/gepi.20079.

    Article  PubMed  Google Scholar 

  20. Yang N, Li H, Criswell LA, Gregersen PK, Alarcon-Riquelme ME, Kittles R, Shigeta R, Silva G, Patel PI, Belmont JW, et al: Examination of ancestry and ethnic affiliation using highly informative diallelic DNA markers: application to diverse and admixed populations and implications for clinical epidemiology and forensic medicine. Hum Genet. 2005, 118 (3–4): 382-392. 10.1007/s00439-005-0012-1.

    Article  PubMed  Google Scholar 

  21. Phillips C, Salas A, Sanchez JJ, Fondevila M, Gomez-Tato A, Alvarez-Dios J, Calaza M, de Cal MC, Ballard D, Lareu MV, et al: Inferring ancestral origin using a single multiplex assay of ancestry-informative marker SNPs. Forensic Sci Int Genet. 2007, 1 (3–4): 273-280. 10.1016/j.fsigen.2007.06.008.

    Article  CAS  PubMed  Google Scholar 

  22. Tian C, Plenge RM, Ransom M, Lee A, Villoslada P, Selmi C, Klareskog L, Pulver AE, Qi L, Gregersen PK, et al: Analysis and application of European genetic substructure using 300 K SNP information. PLoS Genet. 2008, 4 (1): e4-10.1371/journal.pgen.0040004.

    Article  PubMed Central  PubMed  Google Scholar 

  23. Rosenberg NA, Mahajan S, Gonzalez-Quevedo C, Blum MG, Nino-Rosales L, Ninis V, Das P, Hegde M, Molinari L, Zapata G, et al: Low levels of genetic divergence across geographically and linguistically diverse populations from India. PLoS Genet. 2006, 2 (12): e215-10.1371/journal.pgen.0020215.

    Article  PubMed Central  PubMed  Google Scholar 

  24. Seldin MF, Shigeta R, Villoslada P, Selmi C, Tuomilehto J, Silva G, Belmont JW, Klareskog L, Gregersen PK: European Population Substructure: Clustering of Northern and Southern Populations. PLoS Genetics. 2006, 2 (9): 1339-1351. 10.1371/journal.pgen.0020143.

    Article  CAS  Google Scholar 

  25. Bauchet M, McEvoy B, Pearson LN, Quillen EE, Sarkisian T, Hovhannesyan K, Deka R, Bradley DG, Shriver MD: Measuring European population stratification with microarray genotype data. Am J Hum Genet. 2007, 80 (5): 948-956. 10.1086/513477.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  26. Price AL, Butler J, Patterson N, Capelli C, Pascali VL, Scarnicci F, Ruiz-Linares A, Groop L, Saetta AA, Korkolopoulou P, et al: Discerning the ancestry of European Americans in genetic association studies. PLoS Genet. 2008, 4 (1): e236-10.1371/journal.pgen.0030236.

    Article  PubMed Central  PubMed  Google Scholar 

  27. Tian C, Kosoy R, Lee A, Ransom M, Belmont JW, Gregersen PK, Seldin MF: Analysis of East Asia genetic substructure using genome-wide SNP arrays. PLoS ONE. 2008, 3 (12): e3862-10.1371/journal.pone.0003862.

    Article  PubMed Central  PubMed  Google Scholar 

  28. Seldin MF, Price AL: Application of ancestry informative markers to association studies in European Americans. PLoS Genet. 2008, 4 (1): e5-10.1371/journal.pgen.0040005.

    Article  PubMed Central  PubMed  Google Scholar 

Download references


We thank Stephen Johnson and Robert Lundsten for informatics support on the New York Cancer Project samples. We also thank Anthony Liew and Houman Khalili for expert assistance with genotyping. We thank the volunteers from the different populations for donating blood samples. This work was supported by NIH grants DK071185, AR050267, and AR44422.

Author information

Authors and Affiliations


Corresponding author

Correspondence to Michael F Seldin.

Additional information

Competing interests

PAW and FMDLV are employees of Applied Biosystems.

Authors' contributions

The study was conceived by MFS, FMDLV and designed by RN, RKo, and MFS. PKG, MEAR, RKi, LMB, GS, and JWB recruited subjects, and obtained and prepared DNA samples used in these studies. Genotyping was performed by RN and PKG. Analyses were performed by RN, RKo, CT and MFS. The manuscript was written by RN and MFS with contributions from RKo, CT, LMB, PAW, FMDLV and PKG. All authors have read and approved the final manuscript.

Electronic supplementary material


Additional file 1: SNP Ancestry Informative Markers. List of SNPs including sequence context and chromosomal locations. (XLS 34 KB)


Additional file 2: Correlation of F st values between 93 SNP AIM set and 3500 random SNPs. Figure showing the correlation between interpopulation Fst values calculated using the 93 SNP AIM set compared with the result using random 3500 SNPs (mean from three independent sets). (TIFF 473 KB)

Authors’ original submitted files for images

Rights and permissions

Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Nassir, R., Kosoy, R., Tian, C. et al. An ancestry informative marker set for determining continental origin: validation and extension using human genome diversity panels. BMC Genet 10, 39 (2009).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: