We consider the problem of inferring the ancestral-population origin of the two phased haploid genomes of sampled individuals who can trace their recent ancestry to k genetically distinct populations. This is accomplished by training classifiers on k′≥ k putative ancestral populations where these k′population must contain either the true k ancestral populations or good proxy populations for them. For each population multiple individuals have to be sampled at the same N polymorphic sites as the admixed individuals. Every ancestral and admixed haploid genome, i, we represent as a vector where -1 indicates the minor allele and 1 the major allele for the N sampled polymorphic sites. In addition every ancestral haploid genome is associated with a scalar y
i
∈ 1.. k′identifying the population of the haploid genome.
To find region-, or locus-specific ancestry assignments in admixed individuals, the vectors of genotypes, x
i
are divided into windows, , of w consecutive loci where j indexes the N/wwindows in individual haploid genome i. The inference problem can therefore be formalized as finding the appropriate label , for each window , in an admixed individual i. We use support vector machines (SVM) [21] for this classification problem and apply a hidden Markov model (HMM) to the output of the SVMs to smooth the population assignment across windows.
Training for ancestry classification with support vector machines
For each window of the genome we train a unique and independent SVM to optimize the classification of the k′ancestral populations. For an excellent tutorial to SVMs we refer the reader to the article by Asa Ben-Hur [22]. Specifically a linear discriminant function is trained on each window, j, consisting of w markers. For the case k′= 2 (more on k′ > 2 below), the labels y
i
become -1 and 1 and the discriminant function is found by the following constrained optimization problem:
where wis a vector and b a scalar that together define a w-1 dimensional hyperplane. This hyperplane optimally separates the two populations, in the w dimensional genotype space, subject to a penalty of misclassified individual samples proportional to the "slack" variable C and the distance, ε
i
, on the wrong side of the plane of sample i. The determined discriminant function classifies an unknown genome window by determining of the window. For k′ > 2 the standard one-against-one trick for binary classifiers is used, that is all k′(k′− 1)/2 pairs of populations are trained against each other using the above standard binary SVM [23]. Then the unknown admixed window is tested using all the k′(k′− 1)/2 classifiers and the final ancestry is assigned the most common ancestry assignment of all these classifiers.
In addition to training the SVMs on the ancestral data, accuracy of the SVM classification is estimated with a three-way cross-validation. This was accomplished by subdividing the individuals in the ancestral populations into three independent sets, where SVMs were trained on two of the subsets and tested on the remaining third subset three different times. The accuracies of these three sub-samplings were averaged and taken as a measure of the success rate for each window across the k′populations. This success rate is used heuristically to determine w, the size of the window necessary to discriminate the different ancestral populations while keeping w as small as possible to preclude recombination in the admixed genomes within each window and is also used as a parameter for the HMM.
Application of SVMs for classification using hidden Markov models
The output classification of each admixed individual is used as the input to an HMM to extract the ancestral origins of each window. Each SVM is independent from every other SVM and as such, there is no assumed correlation between ancestral states across window boundaries. Biologically this correlation generally exists within genomic blocks as a consequence of linkage disequilibrium (LD). By introducing a Markov condition across windows, we reintroduce this correlation. Specifically the HMM has k′ hidden states and k′output states. Assuming recombination occurs at every generation, recombination points are modeled along the chromosome as a Poisson process dependent on the recombination rate and number of generations, g, since admixture [24]. The transition probabilities between the hidden states is therefore modeled as (1 − e − gd)/(k′− 1) where d is the genetic distance between windows, measured in Morgans. The emission probabilities are p for the corresponding hidden state and (1−p)/(k′− 1) for the other possible states, where p is the success rate of the SVM at the corresponding window as determined by the cross-validation (described in previous section). This HMM, in addition to accounting for LD, has the advantage of smoothing out the classifications returned by the SVM, so as to limit the effect of regions with poor information content. The posterior probability of the ancestry at each window location is determined using the forward-backward algorithm and the most probable ancestry is used as the estimated ancestry.
The genetic distance d between neighboring windows was estimated from the combined HapMap genetic map[25]. To test whether variations in the genetic map drastically affects the results a subset of simulations were repeated with a fixed d across the entire genome, where d was chosen to match the mean across the genetic map.
Genotype samples from Qatar and the HGDP-Ceph panel
The 156 unrelated Qataris were processed on the Affymetrix 5.0 genotype array according to the protocols described by Hunter-Zinck et al. [4]. The 440,794 genotyped autosomal SNPs were pruned using PLINK v1.06 [26] to 327,044 SNPs with a minor allele frequency greater than 5%, a missing rate less than 5% and a Hardy-Weinberg equilibrium (HWE) deviation p-value of no less than 0.001.
The remaining SNPs were phased using Beagle [27]. For the examination of the Qatari sample using STRUCTURE the data was further pruned for pairwise linkage disequilibrium with a threshold of 0.5 in any 100 SNP window using PLINK's –indep-pairwise command and thinned to 20% of SNPs using –thin, resulting in 28,457 SNPs.
To assign segments of the genomes of each Qatar individual to possible ancestral populations, the HGDP sample data [20] was used. For these putative ancestral populations, we filtered for a minor allele frequency of 5% and removed related, unannotated and poor quality samples, retaining 886 individuals from 55 populations, all of which were phased using Beagle [27]. As the sample sizes of the different HGDP populations were very different the analysis was also repeated by subsampling 9 individuals from each HGDP population to verify that any results were robust to sample size changes. As the HGDP and the Qatar samples were run on different platforms only 71,982 SNPs remained for the analysis using SupportMix and PCA.
Generation of in silico admixed population samples
A total of 651 unrelated individuals from 31 global populations in the HGDP-CEPH dataset that each have at least 10 sampled individuals (20 phased haploid genomes) were used for in silico analysis. The other HGDP populations were excluded as there were not enough individuals to simulate admixture and also exclude the ancestral indiviudals when training SupportMix. Admixed populations were generated by in silico mating of individuals from the different populations using a method similar to the one proposed by Price et al. [15] but extended to handle more than two ancestral populations. Specifically admixture between two ancestral populations in proportions α :(1 − α) was achieved by sampling l haploid chromosomes from each of the ancestral populations and mating each chromosome i ∈ (1..l) with the corresponding chromosome in the other ancestral population using a recombination model. The origin of the admixed genomic regions was determined using two steps: 1. the initial origin was determined by probability α to be from population one, 2. the recombination spots along the genome were determined using a Poisson process such that the expected run lengths follow 1 −e−gd, where g is the number of generations since admixture and d is the distance along the genome measured in Morgans as determined by HapMap [25]. At each recombination site the originating genome was choosen with probabilities in the ratio α to (1 − α). For three or more populations, the admixture was determined by choosing the origin populations in ratios of α1: α2 : α3 : .... For each admixed population, four haploid genomes were generated and the "ancestral" haploid genomes were discarded before the training of SupportMix and LAMP-ANC.
For admixture between two populations, 465 admixed populations were generated using all 31 ancestral populations with α = 0.5, g = 5. To explore the effects of α, g, window size and genetic map used, seven simulated admixed populations were used that spanned the range of population structure in these 465 populations. For three way admixture, mixture between Yoruban, French, and each of the other HGDP populations, as well as Yoruba, Bedouin and each of the other HGDP populations were simulated separately, with equal proportions for the populations (i.e. equal values of α) and g = 5.
Comparison to other methods
SupportMix was compared to LAMP-ANC for in silico data and to PCA and STRUCTURE for the Qatar sample. The PCA was carried out on the 71,982 genotype SNPs common to both Qatar and HGDP using the singular value decomposition and STRUCTURE, using the built in admixture model [12], was run on the thinned-genotype dataset for k = 3 using 20,000 burn-in iterations and 10,000 iterations after burn-in. LAMP is a very accurate method for ancestral deconvolution of genotype data that has consistently been shown to do better than most previously published methods, while being able to handle more than two ancestral populations [17]. We did a quick comparison between LAMP-ANC and STRUCTURE run in the Linkage mode on in-silico generated African Americans. LAMP-ANC assigned 5% more loci correctly with the same number of loci used by both methods. LAMP is over an order of magnitude slower than SupportMix so to compare accuracy, a broad range of populations with different degrees of population structure were chosen. Specifically French-Bedouin, Bedouin-Yoruba, Han-Bedouin, French-Yoruba, Han-Yoruba, Papuan-Yoruba and Papuan-Karitiana admixture were examined with LAMP and SupportMix. LAMP was run in LAMP-ANC mode as described by Pasaniuc et al. [17].