On rare variants in principal component analysis of population stratification

Background Population stratification is a known confounder of genome-wide association studies, as it can lead to false positive results. Principal component analysis (PCA) method is widely applied in the analysis of population structure with common variants. However, it is still unclear about the analysis performance when rare variants are used. Results We derive a mathematical expectation of the genetic relationship matrix. Variance and covariance elements of the expected matrix depend explicitly on allele frequencies of the genetic markers used in the PCA analysis. We show that inter-population variance is solely contained in K principal components (PCs) and mostly in the largest K-1 PCs, where K is the number of populations in the samples. We propose FPC, ratio of the inter-population variance to the intra-population variance in the K population informative PCs, and d2, sum of squared distances among populations, as measures of population divergence. We show analytically that when allele frequencies become small, the ratio FPC abates, the population distance d2 decreases, and portion of variance explained by the K PCs diminishes. The results are validated in the analysis of the 1000 Genomes Project data. The ratio FPC is 93.85, population distance d2 is 444.38, and variance explained by the largest five PCs is 17.09% when using with common variants with allele frequencies between 0.4 and 0.5. However, the ratio, distance and percentage decrease to 1.83, 17.83 and 0.74%, respectively, with rare variants of frequencies between 0.0001 and 0.01. Conclusions The PCA of population stratification performs worse with rare variants than with common ones. It is necessary to restrict the selection to only the common variants when analyzing population stratification with sequencing data.


Background
Genome-wide association studies (GWAS) [1] have identified a considerable number of sequence variants, such as single nucleotide polymorphisms (SNPs), associated with human diseases or traits. Population stratificationallele frequencies of genetic markers of the studied samples having significant differences owing to systematic ancestry differences-can cause false positive results in case-control as well as cohort studies [2,3]. Association mapping based on rare variants are much more susceptible to subtle effects of population stratification and therefore, more likely to yield false positive results [4]. From a population genetics point of view, exploring population structure is important for understanding the evolutionary history of populations. Many methods and software are available to study the population stratification, such as the principal component analysis (PCA) implemented in EIGENSOFT [5,6], the multidimensional scaling analysis in PLINK [7], the clustering analysis in STRUCTURE [8,9], and fastSTRUCTURE [10].
Recently, controlling population stratification in the association analysis using linear mixed models [11][12][13][14] was also suggested. Based on a large number of common variants whose minor allele frequencies (MAFs) are larger than 5%, the PCA of population structure is widely applied in GWAS.
With the advance of high-throughput sequencing technology, as well as the enormous reduction of the cost, it is capable and affordable in genetic studies to detect additional low-frequency and rare variants (MAF < 1%) [15]. Furthermore, existing sequencing data suggest that the vast majority of rare variants are populationspecific. In the 1000 Genomes Project data [16,17], there are a total of 77 million biallelic SNPs, among which 65 million are rare and 52 million are polymorphic in one of the five continental ancestry populations: East Asian (EAS), South Asian (SAS), African (AFR), European (EUR), American (AMR). It seems that rare variants are more informative in distinguishing population structure than common ones. However, the efficacy of using rare variants in population stratification analysis remains controversial [18][19][20][21].
A number of efforts have been made concerning the use of low-frequency and rare variants in population stratification analysis. Baye et al. illustrated that more fine substructure can be detected using rare variants and suggested that more SNPs are required to account for a similar level of population structure using rare variants compared to common ones [18]. Siu et al. showed that rare variants have a much higher power to identify population substructure than common variants [19]. In contrast, Zhang et al. demonstrated that PCAs based on common and less-frequency SNPs perform better than those based on rare ones in separating European and African samples [20]. The authors further concluded that there is little added value for PCA of population stratification with rare variants only [21]. All existing work was based on analysis of genotype data from the 1000 Genomes Project with known population structure.
In this work, we investigate how rare variants affect PCA of population stratification from a theoretical perspective. We derive mathematical expectation of the genetic relationship matrix (GRM) [22]. The GRM is commonly computed from the observed genotypes and eigen-decomposed in the analysis of population stratification. Elements of the expected genetic relationship matrix (EGRM), however, depend explicitly on the allele frequencies of the markers used. We show that interpopulation variance is solely contained in K principal components (PCs) and mostly in the largest K-1 PCs, where K is the number of populations in the sample. We propose F PC , ratio of the inter-population variance to the intra-population variance in the K population informative PCs, and d 2 , sum of squared distances among populations, as measures of population divergence. We show analytically that when allele frequencies become small, the ratio F PC abates, the population distance d 2 decreases, and portion of variance explained by the K PCs diminishes. Therefore, the PCA of population stratification performs worse with rare variants than with common ones. The results are further validated in the analysis of the 1000 Genomes Project data with 2504 individuals from five continental populations.

Genetic relationship matrix
In the scenario where genotype data of individuals is sampled from K populations, there are N k individuals in population k and the number of individuals in the total population is N = N 1 + N 2 + ⋯ + N K . We have M SNPs, whose frequencies of their coded alleles in population k are [ f k1 , f k2 , ⋯, f kM ]. Let X be the genotype matrix of dimension N × M. The genotypic value X(n,m) is the number of the coded allele of SNP m for individual n, where n = 1, 2, ⋯, N and m = 1, 2, ⋯, M. Typically, the number of individuals is less than the number of markers, i.e. N < M. We assume that all SNPs are under the Hardy-Weinberg equilibrium in each population. The GRM can be calculated as where each entry of Y is a normalized version of the coded genotype in X, i.e.
for n = 1, 2, ⋯, N and m = 1, 2, ⋯, M. Here, μ m and σ m denote the genotypic mean and standard deviation of SNP m in the total population, respectively. It can be shown that μ m and σ m relate to the population structure and allele frequencies as follows (Supplemental Text S1) The coded-allele frequency of SNP m in the total population can be found as f m = μ m /2, where m = 1, 2, ⋯, M. The GRM is of dimension N × N, whose diagonal elements are genotypic variance of individuals and offdiagonal elements are genotypic covariance between two individuals. It should be noted that genotypes follow mixed binomial distributions, and elements of Z are sample variances and covariance computed from the genotype data. The PCA analysis of population stratification is based on the eigen-analysis of the observed GRM Z.
In practice, μ m and σ m are unknown, and therefore sample meanμ m and sample standard deviationσ m or some other quantities similar are used instead. Usually, μ m ¼ 2f m is used for the centralization in (2). In EIGENSOFT, is adopted for the normalization in (2), while is employed in GCTA [22]. Different estimates of the allele frequency f m were suggested as well [5,6]. In the following theoretical derivations, we assume that μ m and σ m are known for the sake of simplicity. This will bring about clear connections between variance and covariance elements of the EGRM and allele frequencies of the SNPs used in the analysis. The connections further provide clues and insights for understanding the effect of rare variants on the PCA of population stratification.

Expected genetic relationship matrix
We assume that all individuals are unrelated. When the number of markers M goes large, the sample variance and covariance elements in the GRM will converge to their mathematical expectations in probability due to the law of large numbers. We denote the EGRM as Z, which is the expectation of the GRM Z. Without loss of generality, we assume that individuals are ordered according to their population memberships. As such, the first N 1 rows and columns correspond to individuals from population 1, the next N 2 rows and columns are from population 2, and so on. Thus, the EGRM can be partitioned into a block matrix consisting of K × K submatrices Diagonal sub-matrices of the EGRM Z have the following structure Here, diagonal elements of the submatrix Z kk are of the mathematical form which is the genotypic variance of individuals from population k. All off-diagonal elements share the form which is the genotypic covariance between two individuals from population k.
The off-diagonal sub-matrices of the EGRM Z are structured as follows Elements of Z kl share the value which is the genotypic covariance between one individual from population k and one from population l. Details of the derivations are presented in Supplemental Text S2.
The EGRM Z, the mathematical expectation of GRM Z, depends only on the population sizes N 1 , N 2 , ⋯, N K and allele frequencies of the M SNPs in K populations [ f k1 , f k2 , ⋯, f kM ], k = 1, 2, ⋯, K. Here, we treat the SNP number M and the allele frequencies as fixed numbers. A theoretical formulation of the PCA that considers genotypic values as a random vector and allele frequencies in different populations being random was proposed in Ma and Amos, 2010 [23]. Based on different assumptions, a genotypic variance-covariance matrix of the same structure was attained; nevertheless, elements of the EGRM are different from those of the variancecovariance matrix in [23].

Rare variants on the eigenvalues
Carrying out eigen-decomposition on the EGRM, it can be shown that there are N k − 1 eigenvalues of value z k − z kk , for k = 1, 2, ⋯, K. Here, The sum of the N − K eigenvalues is Clearly, variations in the N − K PCs are entirely intrapopulation variance of the K populations. The sum of the other K eigenvalues is where represents the inter-population variance component and stands for the intra-population variance component.
Here, the intra-population covariance z kk of the EGRM factor in the K PCs as the inter-population variance after the eigen-decomposition. Note that all inter-population variance is solely contained in the K PCs. Hence, it is sufficient to conduct the population stratification analysis based on the K PCs alone. Given a set of SNPs, the divergence among the K populations can be measured by the ratio of the two variance components, i.e.
Its normalized version can be defined as which measures the portion of inter-population variance in the K population informative PCs and takes a value between 0 and 1. The larger the F PC and F Ã PC are, the more divergence among the populations.
are quadratic functions of f km , k = 1, 2, ⋯, K, m = 1, 2, ⋯, M. However, terms in σ 2 W are linear and quadratic functions of the frequencies. Therefore, F PC will decrease when frequencies of the coded alleles become smaller, see Supplemental Text S3 for more details. As a result, instead of improving the population stratification analysis, using rare variants will deteriorate the analysis performance. Meanwhile, since σ 2 B decreases faster than σ 2 W , the K eigenvalues will be closer to the other N − K eigenvalues when frequencies of the coded alleles become smaller. Thus, the percent of variance explained by the K PCs will be smaller. It can be shown that among the K eigenvalues, K − 1 are of large values and one small. When intrapopulation variance z k − z kk of the K populations are equal, all inter-population variance is contained in the largest K − 1 eigenvalues. In addition, when the sample size is large and the portions of populations remain, inter-population variance contained in the small eigenvalue is negligible, almost all information on the population structure is contained in the largest K − 1 PCs.
For cases with two populations, it can be shown that the two eigenvalues are When inter-population variance of the two populations are equal, i.e. z 1 −z 11 That is, all information on the population structure is contained in the largest PC. All proofs are presented in Supplemental Text S4.

Rare variants on the population distance
Suppose that x k , k = 1, 2, ⋯, K are the eigenvectors associated with the K eigenvalues containing interpopulation variance. We can represent each individual as a point in the K-dimension space. Vector ffiffiffiffi ffi λ k p x k consists of coordinates of N individuals in the k-th dimension. Average value ffiffiffiffi ffi λ k p x T k 1 N =N represents center of the total population in the k-th dimension, where 1 N is a column vector of dimension N and with each element as 1. Due to the structure of Z, individuals from the same population share the same coordinates in the K-dimension space, and the common points denote the representative points of the populations, or centers of the populations [23]. We define which measures the population divergence as the sum of squared distances among populations. The proof is shown in Supplemental Text S5. Here, the second term in (15) is an average intra-population variance. As explained earlier that when allele frequencies become smaller, the K eigenvalues decrease. Hence, the population distance d 2 is smaller when using rare SNPs compared to common ones.  [24]. The SNPs were divided into six frequency bins according to their MAFs, as shown in Table 1. For each bin, we randomly sampled approximately 100,000 SNPs using PLINK for the population stratification analyses. Here, MAFs of the SNPs in the total population were used, hence their frequencies in the five populations may be different and may not be in the same bins as in the total population. For SNPs in bin 5 and 6, they are polymorphic in the total population and may not be polymorphic in all of the five populations. PCAs were carried out, with GRMs computed by EIGENSOFT and PCAs on EGRMs conducted using GCTA. Default parameters were used when analyzing with EIGENSOFT, which excluded 68 and 116 outliers in the analyses of the data from frequency bin 5 and 6, respectively.

Theoretical and empirical EGRMs
To calculate the theoretical results (5)-(10), we computed MAFs of the SNPs with PLINK. Values of variance z k and covariance z kk , z kl , k, l = 1, 2, ⋯K, were calculated as in (7), (8), and (10), respectively, in which μ m was computed with (3) and σ 2 m ¼ 2 f m ð1−f m Þ, m = 1, 2, ⋯M. Values of z k and z kk for the five populations with SNPs from the six frequency bins are presented in Table 2. Absolute values of the inter-population covariance z kl are much smaller and the results are shown in Supplemental Tables S1-6.
To obtain the empirical values of variance z k , as well as covariance z kk and z kl , we first computed GRMs with SNPs from the six bins using EIGEN-SOFT. Each GRM included N(N + 1)/2 variance and covariance terms of N individuals based on the observed genotype data. Empirical value of z k was computed as the average variance of the N k individuals from population k. The empirical value of z kk is the average covariance of N k (N k − 1)/2 pairs of individuals from population k. Lastly, the value of z kl is the average covariance of N k N l pairs of individuals, one from population k and one from population l. The results of z k and z kk are shown in Table 2, and those of z kl are presented in Supplemental Tables S1-6.
We can see that across the six frequency bins, theoretical values of z k , z kk , and z kl predicted by (7), (8), and (10), respectively, are close to their empirical values. When MAFs of the SNPs become smaller, intra-population covariance z kk decreases. For example, z kk was 0.2 for EAS with SNPs whose MAFs are between 0.4 and 0.5, which reduced to 0.003 in where empirical values of z k and z kk were used. The F PC decreases from 93.85 in bin 1 to 55.01 in bin 5, and further to 1.83 in bin 6. Thus the divergence among the populations is much larger when measured by common SNPs than by rare ones.

PCAs of the 1000 genomes project data
With genotypes of SNPs from each frequency bin, we carried out PCAs of population stratification by EIGEN-SOFT, which was essentially based on the eigen-analysis of the observed GRMs. Scatter plots of the largest three PCs are shown in Figs. 1 and 2, where eigenvectors were scaled by square roots of their corresponding eigenvalues. From Figs. 1 and 2, we can see patterns of population structure computed with common and lessfrequency SNPs. For example, Figs. 1a-e and 2a-e displayed similar patterns, whereas the scatter plots based on rare SNPs differed significantly. For example, AMR and SAS are separated mostly by the third PC with common SNPs, while they are distinguished by the second PC with rare ones. The third PC from rare SNPs reveals mostly substructure of AFR, likely because more rare SNPs are polymorphic in AFR than in other populations. Portions of variance explained by the largest five PCs decrease from 17.09% in bin 1 to 10.41% in bin 5, and it falls dramatically to 0.74% with rare SNPs only. As a result, the five populations are more closely distributed around the origin in Figs. 1f and 2f, compared with those in Figs. 1a-e and 2a-e. Clearly, common variants show much better performance in dissecting the population structure than rare variants do.

PCAs of EGRMs
For each frequency bin, we also constructed a EGRM with structure as described in (5), (6), and (9), whose variance and covariance elements were their theoretical values calculated by (7), (8), and (10), respectively. We conducted PCAs of the EGRMs using GCTA, and scatter plots of the largest three PCs shown in Figs. 1 and 2. Large symbols in black are representative points or centers of the five continental populations from eigenanalyses of the EGRMs. Similarly, coordinates were scaled by square roots of their eigenvalues.
Upon comparing the representative points in Figs. 1 and 2, we can see that distances between populations decrease as the SNPs change from common to rare. Sum of the squared distance d 2 was calculated for the six frequency bins by (15), where λ k , k = 1, 2, ⋯K were the eigenvalues of the EGRM Z and z k , z kk , k = 1, 2, ⋯K were their theoretical values. The d 2 decreases from 444.38 in bin 1 to 254.10 in bin 5, and further to 17.83 in bin 6.
In addition, when portions of variance explained by the PCs become small, deviations between the representative points of the populations and true centers of the populations can be observed. This is particularly evident in the scatter plots with rare SNPs. In the PCAs of a single population, such deviations are more obvious when percents of variance explained by the largest PCs are much smaller.

Discussion
We showed that all information about the population structure is contained in K PCs. Genotypic variance explained by the K PCs can be further decomposed into the intra-population variance σ 2 W and interpopulation variance σ 2 B . Using more SNPs will improve convergence of the realized GRM to its mathematical expectation, i.e. the EGRM. As a result, individuals belonging to the same population will be more closely distributed around the representative point or center of the population on the PC-PC plots. On the other hand, note that σ 2 B is the average interpopulation variance contributed from M SNPs. When rare variants are used, adding more SNPs will not increase the average level of σ 2 B , hence neither the ratio F PC nor the sum of squared distances d 2 will improve. For same reason, using a combination of common and rare SNPs will result in lower F PC and d 2 compared with using common SNPs only and therefore result in worse performance.
In the case where there is one SNP, our F PC and F Ã PC can be further reduced to where f k is the allele frequency in the population k, and f the frequency in the total population. The classical Wright's fixation index F ST is widely used to gauge population stratification [25], which measures the deviation from Hardy-Weinberg equilibrium at the total population level. In this case, it can be shown that Therefore, our F Ã PC is much larger than F ST. It is worth pointing out that we assign numeric values to genotypes as numbers of the coded alleles, hence our results are dependent on such coding scheme. F ST , however, does not involve in the numeric coding of genotypes. Note also that F Ã PC measures the portion of inter-population variance in the K population informative PCs. After the eigen-decomposition, most of the intra-population variance associated with the other N-K PCs was excluded. If our F PC were defined as the ratio of the inter-population variance to the intra-population variance in the N PCs, it would be related to F ST as F PC = 2F ST /(1 − F ST ).
In GWAS, it is a common practice to conduct population stratification analyses using a large number of random markers [26], which usually yields satisfactory results. As shown in this work, the capability of dissecting population structure depends on the allele frequencies of markers used in the analyses, and common variants perform much better than rare ones. This is not much of a concern for GWAS because SNP panels implemented in the genotyping platforms are mostly of common ones. In sequencing studies, however, the majority of the called variants are rare, and selecting SNPs randomly will yield a large portion of rare SNPs, which will deteriorate the analysis performance. Therefore, it is necessary to restrict the selection to only the common SNPs when analyzing population stratification with sequencing data. This would also be true for controlling population stratification based on the linear mixed models [11][12][13][14].
In this work, we assumed that μ m and σ m are known constants in (2) in order to simplify the theoretical derivations. Our results are approximations of those when estimates of the two quantities are used. When sample size N is large, variations associated withμ m andσ m are much smaller than those with the genotype data. Therefore, the mathematical expectations are largely taken with respect to the genotypes and difference between the two sets of results would be small. As shown in Table 2, the predicted values of the EGRM are close to their empirical values in the 1000 Genome Project data. We carried out additional simulation studies to evaluate the effect of lacking knowledge on μ m and σ m . We randomly chose one SNP from each of the six frequency bins (Supplemental Table S7). Based on their MAFs observed in the five populations of the 1000 Genomes Project, we simulated genotypes of five populations each with 500 individuals. Values of μ m and σ m were computed with the assumption of known population structure and MAF information, and theoretical values of z k and z kk were then calculated. For comparison, we first estimatedμ m andσ m from the simulated genotype data. Y(n, m) were normalized withμ m andσ m , and z k and z kk were obtained as averages of sample variance and covariance from 1000 replicates. The two sets of results are presented in Supplemental Table S8 and the differences between the two sets of results are negligible except for small differences in the results with the rare SNP.
Inferring population structure based on a large number of genome-wide markers are likely to include markers in linkage disequilibrium (LD). Practical concerns on the LD and choice of markers were extensively discussed in [5]. It is worth noting that each marker contributes to the elements in GRM additively, see eqs. S1-3 in Supplemental Text S2. Because of the linearity of expectation, our EGRM formulae as well as the eigen-analysis on the EGRM still hold when LD exists among markers. When the number of markers goes large, convergence of the GRM to EGRM will be slower with LD among markers, compared with the case that independent markers are used. Since there are always limited number of markers in the PCA practice, our EGRM and the eigen-analysis on it represent asymptotic results of the real PCA analysis.
Despite the fact that the vast majority of rare variants are population-specific, we showed that performance of the PCA of population stratification is better when based on common SNPs rather than rare ones. On the other hand, the PCA results with rare SNPs do reveal a population structure that differs from that of common SNPs. Existing methods may not exploit ancestry information embedded in the rare variants efficiently, and different approaches from those applied to common variants should be developed [26].

Conclusions
To quantify population divergence as a function of allele frequencies of genetic markers used in the PCA analysis, we derived the expected genetic relationship matrix. We proposed F PC , ratio of the inter-population variance to the intra-population variance, and population distance d 2 as measures of population divergence. Our theoretical results as well as the analyses of the 1000 Genomes Project data showed that employing rare variants yielded smaller F PC in the K population informative PCs, smaller d 2 , and smaller portion of variance explained by the K PCs than those using common variants. Therefore, the PCA of population stratification performs worse with rare variants than with common ones. When analyzing population stratification with sequencing data, it is necessary to restrict the selection to only the common variants.
Additional file 1 Text S1. Proof of eq. (4). Text S2. Proofs of eqs. (5)(6)(7)(8)(9)(10). Text S3. F PC as a function of allele frequencies. Text S4. Proofs of rare variants on the eigenvalues. Text S5. Proof of eq. (15). Table S1. Theoretical and empirical values of the inter-population covariance of EGRM (0.4 < MAF ≤ 0.5). Table S2. Theoretical and empirical values of the inter-population covariance of EGRM (0.3 < MAF ≤ 0.4). Table S3. Theoretical and empirical values of the inter-population covariance of EGRM (0.2 < MAF ≤ 0.3). Table S4. Theoretical and empirical values of the interpopulation covariance of EGRM (0.1 < MAF ≤ 0.2). Table S5. Theoretical and empirical values of the inter-population covariance of EGRM (0.01 < MAF ≤ 0.1). Table S6. Theoretical and empirical values of the interpopulation covariance of EGRM (0.0001 < MAF ≤ 0.01). Table S7. MAFs of the six SNPs used in the simulations. Table S8. Expected variance and covariance with and without the knowledge of μ m and σ m . Authors' contributions SM: conceived the concept, conducted the analyses, and drafted the manuscript. GS: conceived the concept, supervised the work, reviewed and revised the manuscript. All authors have read and approved the manuscript.

Funding
This work was supported by the national Thousand Youth Talents Plan. The funding body played no role in the design of the study and collection, analysis, and interpretation of data and in writing the manuscript.

Availability of data and materials
The datasets analyzed during the current study are available at https://www. internationalgenome.org. The accession number at https://www.ebi.ac.uk/ ena is PRJNA262923.
Ethics approval and consent to participate Not applicable.

Consent for publication
Not applicable.