POPRES project
POPRES is a project fostering large Population Reference Samples of different ethnic origins [22]. The original POPRES project contains nearly 5,000 individuals of African-American, East Asian, South Asian, Mexican and European origin. Individuals included in the POPRES study are collected from different study groups all over the world. POPRES performed Genome-wide genotyping of these individuals on the Affymetrix (Mountain View, CA) GeneChip 500 K Array set with the published protocol for 96-well-plate format. Sample collection and methods for POPRES are described elsewhere [22]. The datasets used for the present analyses were obtained from dbGaP [23] through dbGaP accession number phs000145.v4.p2.
Datasets
We considered chromosome 22 from the POPRES dataset for our research. This dataset originally consisted of 5,637 SNPs measured in individuals from 35 different populations. To avoid biases due to different sample sizes and to include as many populations as possible into our analysis, we considered an equal number of individuals for each sub-population (N = 40). If more than 40 individuals are available, a random sample of N = 40 was drawn to rule out effects caused by differing sample sizes. Population groups with less than 40 members were discarded resulting in a total of 20 different ethnic subsets, namely 15 populations of Caucasian origin: Australian, Canadians, German, French, Swiss-French, Swiss-German, Swiss, Italian, Spanish, Irish, British, Belgish, Portuguese, former Yugoslavia, a mixed group of east European origin (a mixture of people from Czech-republic, Hungary, Poland); two populations of South-Asian origin: Indians and Punjabis, one east-Asian population: Japanese, one Mexican population: Mexican, and finally, a mixed-population of African-Americans (AfAm). Study populations which do not match very closely with the available HapMap references CEU, JPT, CHB, YRI (see below) were supposed to indicate the impact of imperfect reference panels on the target populations. This might be applicable for the following populations: Indian, Punjabis, Yugoslavians, East-EU, Portuguese and African-Americans. Target populations like Europeans are abbreviated by EU, East European populations (eastEU and Yugoslavia) by EEU, South Asians (Punjabi and India) as SASI, Japanese by Jap, Mexican by MEX, South European (Italian, Portuguese) by SEU and African Americans by AfAm.
Reference Panel
1000 Genomes datasets were based on low depth whole genome sequencing data and are generally considered to have lower accuracy than HapMap data. Thus we considered HapMap3 [24] reference panels (NCBI Build 36) to impute the above mentioned populations. Four different pre-formatted reference panels: CEU, YRI, MEX and JPT + CHB provided by the MaCH-developers [25] and IMPUTE-developers [26] were considered. In a full-factorial design, we imputed our target populations with these reference panels.
Strand verification of SNPs
Genomic assembly of the original POPRES data was identical to Affymetrix release 25 NSP25 and STY25 and the corresponding rs-IDs were identified by NCBI build “b36” with UCSC version “hg18”. Strand alignment between study sample and reference data was performed using fcGENE [27] and PLINK [28]. SNPs with ambiguous strands and SNPs which could not be found in the HapMap3 reference panel were removed. In total 1,014 SNPs could not be matched to HapMap3 reference panels and were excluded. 4,623 SNPs overlapping with HapMap reference panel remained for further analyses.
Selection of good-quality (GQ) SNPs
Good quality (GQ)-SNPs were selected with stringent filtering criteria of genotype quality. These GQ SNPs were then assumed to express true genotypes for our experimental study. In analogy to our previous research [29], we masked these SNPs and re-imputed them to evaluate the imputation accuracy as explained in the next section. More precisely, we compared the posterior genotype probability distributions produced by the imputation software with the corresponding true genotypes. To select GQ SNPs, we apply the following quality criteria: average call rate (CR averaged over all populations > =95 %), average minor allele frequency (MAF averaged over all populations > =0.1) and p-values of stratified Hardy Weinberg Equilibrium Test (p (HWE) > =10e-2). Since the samples were from multiple ethnic group, we used exact stratified test of HWE [30]. A total of 457 SNPs passed these quality criteria.
Masking Process
We performed the masking process in two phases. First, we masked all good quality SNPs and imputed them with the imputation frameworks: MaCH, MaCH-minimac and IMPUTE2. We also considered additional scenarios where we masked only 70 % and 50 % of the previously selected good quality SNPs. This type of masking was performed in such a way that all SNPs masked in the lower percentage of missingness were also masked in the higher percentages of missingness. The first type of masking (100 %) was used to investigate the relationships between G
ST
and F
ST
-related scores and corresponding imputation accuracy. The second type of masking was used to study the impact of different degrees of missingness on these relations. For this purpose, we only compared the 50 % GQ SNPs which were masked in all three missingness scenarios to avoid bias introduced by SNP selections.
Imputation
Imputations were performed separately for each of the previously mentioned sets of populations combined with any of the four reference panels. As suggested by MaCH developers, imputation with this software was performed in two steps: In the first step, imputation error rate and recombination rate were estimated. These two model parameters were determined by running the “greedy” algorithm for 100 iterations and were used in the second step to determine the transition probabilities of the underlying Hidden Markov Model [8]. In the second step, the most likely genotype probability distributions of each genotype at each individual and the imputation quality measured by the software specific Rsq score were determined. Commands used for MaCH-imputation are provided in Additional file 1. The relative performance of imputation methods differ greatly as a function of sample sizes, marker densities and parameters of the algorithm such as the number of EM iterations. Therefore, the same standard parameter settings were used for each imputation process.
Imputation with MaCH-minimac was also performed in two steps. In the first step, MaCH was used to predict the haplotypes of the study data sets, and then, minimac was used to calculate the posterior probabilities of the genotypes using these haplotypes.
As suggested by the software developer, imputation with IMPUTE2 was performed in a segmented way by defining different genomic intervals approximately of size 5 MB. An internal buffer region of 250 kb on both sides of the analysis interval was used to avoid the margin effects of chromosome segmentation.
After imputation, we compared the estimated posterior distribution with the measured genotypes as explained below. Considering four reference panels, 20 target populations, two missing scenarios and three software packages, a total of 480 imputations were performed.
Assessment of imputation accuracy
A common strategy for determining imputation performance is to compare true genotypes (genotypes measured by a technique with high confidence or consensus genotypes) with corresponding imputation results. Here, we directly compared the posterior distributions of our re-imputed GQ-SNPs with corresponding measured genotypes applying our recently proposed Hellinger and SEN score [31]. Both measures are platform independent. While Hellinger score measures the distance of imputed and measured genotype probability distribution, SEN score is maximal if their expectations are identical. Thus, SEN essentially compares gene doses. Cut-offs of 0.95 for SEN score and 0.45 for Hellinger score respectively were considered as indicators of good imputation accuracy (for motivation, see also Fig. 2 below). We also analysed imputation accuracy using the software specific measures MaCH-Rsq and IMPUTE-info determined during the imputation process. MaCH-Rsq measure is basically defined as the ratio of the empirically observed variance of the allele dosage to the expected binomial variance under Hardy-Weinberg equilibrium [32]. Similarly, IMPUTE-info score is the relative statistical information about the SNP allele frequency derived from the imputed data [33]. These two software-specific measures are defined at SNP-level and are useful to assess imputation quality of SNPs for which no measurements are available. These scores are widely applied to remove SNPs with low imputation accuracy during post-imputation quality control.
While Hellinger and SEN scores assess agreement of imputed and observed genotypes individually, the software specific measures assess imputation quality for entire SNPs, i.e. cannot be interpreted for single genotypes.
Estimation of G
ST
and other F
ST
-related measures
Nei’s G
ST
is defined as the ratio of average gene diversity within subpopulations and the gene diversity of the total pooled population:
$${G}_{ST}=\frac{{D^{\hbox{'}}}_{ST}}{H_T},$$
where H
T
is the heterozygosity expected under Hardy-Weinberg equilibrium for the total pooled population and D’
ST
is the average gene diversity between the subpopulations [13, 14, 19]. For two-allelic markers, Bhatia et al. [16] recommended the estimator of G
ST
at any particular k
th marker as:
$$\begin{array}{l}{G}_{ST}^{\left[k\right]}=\frac{D^k}{H_T^k},{D}^k={\left({p}_1^k-{p}_2^k\right)}^2, \\ {}\kern2.64em {H}_T^k=2{p}_{avg}^k\left(1-{p}_{avg}^k\right),\\ {}\kern3.48em {p}_{avg}^k=\frac{p_1^k+{p}_2^k}{2}\end{array}$$
where p
k1
and p
k2
are the allele frequencies of the reference allele at the k
th marker in the two populations. To calculate G
ST
between two population groups genotyped at N markers, one can use the formula:
$${G}_{ST}=\frac{{\displaystyle {\sum}_{k=1}^N}{D}^k}{{\displaystyle {\sum}_{k=1}^N}{H}_T^k}$$
Computation of pair-wise G
ST
between any two population is implemented in the most recent version of fcGENE [27]. Small values of G
ST
indicate that allele frequencies between the two populations are similar, i.e. the genetic distance between them is small.
Regarding F
ST
-related measures, we considered F
R
ST
described in the work of Reich et al. [17], and implemented in the program EIGENSOFT, in which a block-jack knife procedure is used to estimate the standard error of F
R
ST
. For any k
th Marker, F
R
ST
is calculated as
$$\begin{array}{l}\kern1.68em {F}_{ST}^{R\left[k\right]}=\frac{N^k}{D^k}\\ {}{N}^k={\left(\frac{a_1}{m_1}-\frac{a_2}{m_2}\right)}^2-\frac{h_1}{m_1}-\frac{h_2}{m_2}\\ {}\kern0.84em {D}^k={N}^k+{h}_1+{h}_2,\end{array}$$
where a1 and a2 are the specific allele counts and m1 and m2 are the total allele counts of the marker in two population. Heterozygosities of the markers are 2 h1 and 2 h2, with \({\mathrm{h}}_1=\frac{{\mathrm{a}}_1\left({\mathrm{n}}_1-{\mathrm{a}}_1\right)}{{\mathrm{n}}_1\left({\mathrm{n}}_1-1\right)}\), \({\mathrm{h}}_2=\frac{{\mathrm{a}}_2\left({\mathrm{n}}_2-2\right)}{{\mathrm{n}}_2\left({\mathrm{n}}_2-1\right)}\) respectively. Let n1 and n2 be the numbers of individuals genotyped in the two populations at the k
th marker. The allele counts a1 and a2 and the total allele counts m1 and m2 can be determined as a1 = 2u1 + v1, a2 = 2u2 + v2, m1 = 2n1, m2 = 2n2, where u1 and v1, and u2 and v2 are the counts of homozygotes and heterozygotes in the first, and in the second population respectively. Now if there are N markers genotyped in each population, an unbiased estimator of FST can be defined as
$${\mathrm{F}}_{\mathrm{ST}}^{\mathrm{R}}=\frac{{\displaystyle {\sum}_{\mathrm{k}=1}^{\mathrm{N}}}{\mathrm{N}}^{\mathrm{k}}}{{\displaystyle {\sum}_{\mathrm{k}=1}^{\mathrm{N}}}{\mathrm{D}}^{\mathrm{k}}}$$
In order to compare the relative performance of different F
ST
-related measures in predicting imputation accuracy, we also computed the original and modified estimators of F
ST
(denoted by F
WC
ST
and F
mWC
ST
) proposed by Weir and Cockerham [11]. F
WC
ST
between two population was calculated as follows
$${F}_{ST}^{WC}=\frac{{\displaystyle {\sum}_{k=1}^N}{N}^k}{{\displaystyle {\sum}_{k=1}^N}{D}^k}$$
where
$$\begin{array}{c}\hfill {N}^k={s}^2-\frac{1}{2\overline{n}-1}\left[\overline{p}\ \left(1-\overline{p}\right) - \frac{\ {s}^2}{2}-\frac{\overline{h}}{4}\right]\hfill \\ {}\hfill {D}^k=\overline{p}\ \left(1-\overline{p}\right) + \frac{\ {s}^2}{2}\hfill \\ {}\hfill \begin{array}{l}{s}^2=\frac{n_1\left({p}_1^k-\overline{p}\right)+{n}_2\left({p}_2^k-\overline{p}\right)}{\overline{n}},\overline{p}=\frac{\left({p}_1^k+{p}_2^k\right)}{2}\\ {}\\ {}\overline{h}=\frac{2{n}_1{p}_1^k\left(1-{p}_1^k\right)+2{n}_2{p}_2^k\left(1-{p}_2^k\right)}{2\overline{n}},\overline{n}=\frac{\left({n}_1+{n}_2\right)}{2}\end{array}\hfill \end{array}$$
A modified estimator F
mWC
ST
of Weir and Cockerham’s F
ST
is defined as follows [16]:
$$\begin{array}{c}\hfill {F}_{ST}^{mWC}=\frac{{\displaystyle {\sum}_{k=1}^N}{N}^k}{{\displaystyle {\sum}_{k=1}^N}{D}^k}\hfill \\ {}\hfill {N}^k={\left({p}_1^k-{p}_2^k\right)}^2\hfill \\ {}\hfill {D}^k=\left({p}_1^k-{p}_2^k\right)+\frac{2}{n_1+{n}_2}\left[{n}_1{p}_1^k\left(1-{p}_1^k\right)+{n}_2{p}_2^k\left(1-{p}_2^k\right)\right]\hfill \end{array}$$
These formula of F
WC
ST
and F
mWC
ST
are also implemented in fcGENE. While computing Weir and Cockerham F
ST
and Nei’s G
ST
, we used the same reference alleles throughout all populations.
In previous studies [20], F
ST
was computed for individual SNPs and then averaged across SNPs. However these F
ST
estimators does not account for haplotype diversity very well [34]. Therefore, in our formula all quantities were averaged over all SNPs first, and then, F
ST
is calculated. This estimate is more precise, i.e. results in smaller standard errors as pointed out in [17, 35].
Correlation statistics
Calculation of imputation quality scores is based on GQ SNPs masked prior to imputation. After calculation of G
ST
and other F
ST
related measures between POPRES populations and the four reference panels considered, we compared population distances with corresponding imputation accuracy scores. Different scatter plots were generated using the R-package ‘ggplot2’ [36] allowing to construct smoothed curves of non-linear relationships. To determine the correlation among G
ST
and other F
ST
-related measures, we used Kendall’s rank correlation coefficient (Kendall’s tau coefficient) [37], which measures the similarity of the ordering of the data to be compared.