Skip to main content
  • Research article
  • Open access
  • Published:

Population size influences the type of nucleotide variations in humans

Abstract

Background

It is well known that the effective size of a population (Ne) is one of the major determinants of the amount of genetic variation within the population. However, it is unclear whether the types of genetic variations are also dictated by the effective population size. To examine this, we obtained whole genome data from over 100 populations of the world and investigated the patterns of mutational changes.

Results

Our results revealed that for low frequency variants, the ratio of AT→GC to GC→AT variants (β) was similar across populations, suggesting the similarity of the pattern of mutation in various populations. However, for high frequency variants, β showed a positive correlation with the effective population size of the populations. This suggests a much higher proportion of high frequency AT→GC variants in large populations (e.g. Africans) compared to those with small population sizes (e.g. Asians). These results imply that the substitution patterns vary significantly between populations. These findings could be explained by the effect of GC-biased gene conversion (gBGC), which favors the fixation of G/C over A/T variants in populations. In large population, gBGC causes high β. However, in small populations, genetic drift reduces the effect of gBGC resulting in reduced β. This was further confirmed by a positive relationship between Ne and β for homozygous variants.

Conclusions

Our results highlight the huge variation in the types of homozygous and high frequency polymorphisms between world populations. We observed the same pattern for deleterious variants, implying that the homozygous polymorphisms associated with recessive genetic diseases will be more enriched with G or C in populations with large Ne (e.g. Africans) than in populations with small Ne (e.g. Europeans).

Background

The out of Africa hypothesis predicts that the ancestors of the human populations around the world originated in Africa, migrated out of the continent and eventually colonized different parts of the world [1]. During this process, the ancestral populations underwent a series of population bottlenecks along the migratory routes. Due to this founder effect, the ancestral population size is expected to decline with increasing distance from Africa. Previous empirical studies confirmed this prediction and showed that populations in Africa are the most genetically diverse and that the diversity declined with increasing geographic distance from Africa particularly along the colonization routes [2,3,4,5,6]. These observations clearly suggest significant variation in the nucleotide diversity among global populations.

Nucleotide diversity (π = 4Neμ) is a measure of genetic variation, which is determined by mutation rate (μ) and effective population size (Ne). Since mutation rate is similar across human populations, the observed difference in the diversity of world populations is largely due to the variations in effective population sizes. Although a recent study suggested a higher rate of mutation in non-Africans, the magnitude of this effect was very small (~ 5%) [7]. Recent population genomic studies showed large variation in the number of polymorphisms observed between world populations [8]. Populations in Africa have ~ 5 million Single Nucleotide Variations (SNVs) whereas those in East Asia have ~ 4.1 million, which is ~ 20% less. Although the variation in the number of polymorphisms is well known, it is unclear if there are differences in the types of polymorphisms between world populations. Are the frequencies of different types of nucleotide changes (eg. A → G or T → C) similar across populations? This question arises from our understanding of the phenomenon of GC-biased gene conversion (gBGC).

gBGC is a recombination-associated process that favors G/C over A/T nucleotides during the repair of mismatches that occur in heteroduplex DNA during meiosis [9,10,11,12,13]. Although this process is not associated with natural selection, the efficiency of gBGC could also be reduced by genetic drift. Therefore, the effect of gBGC is expected to be much weaker in small populations than in large populations [12, 13] and consequently, the frequencies of AT→GC polymorphisms are expected to differ among world populations. It is important to characterize the population specific patterns of genetic variants, as these patterns may have immense implications for human health.

For instance, previous studies have shown a positive correlation between the number of deleterious homozygous SNVs present in human populations and their distance from East Africa [14]. Furthermore, non-Africans were found to have much higher proportion of high frequency or homozygous deleterious variants than Africans [15, 16]. These observations suggest that due to the effects of genetic drift, small populations (typically located away from Africa) have a higher fraction of high frequency (and homozygous) deleterious mutations than large populations. On the other hand, a gBGC mediated skews in the frequencies of deleterious AT→GC (relative to GC → AT) polymorphisms were also reported [17, 18]. However, it is unclear whether the extent of such skews is influenced by the effective population sizes of various global populations. Therefore, using data from the 1000 Genomes Project we investigated the pattern of nucleotide changes observed in the SNVs segregating in different allele frequencies [8]. Furthermore, we also analyzed homozygous and heterozygous variant data for 126 distinct populations from around the world, obtained from the Simons Genome Diversity Project [7].

Results

Allele frequencies and types of nucleotide changes in human populations

To quantify the difference in the patterns of observed AT→GC and GC → AT changes we derived a measure β, based on the Waterston estimator (θW) as described in the methods (Eq. 1). The measure β is the ratio of AT→GC (μAT → GC) and GC → AT (μGC → AT) mutation rates, which captures the mutational equilibrium between AT and GC nucleotides. The ratio β is expected to be 1 if the observed AT→GC and GC → AT changes are due to the forward and reverse mutation rates alone. Any deviation from this ratio (β = 1) suggests a bias in the substitution of one type over the other. To examine the variation in the patterns of nucleotide changes we obtained the 1000 Genomes phase II data for 26 distinct populations of the world. Since most of the genomes from Latin America were admixed with Europeans/Africans, we separated 20 Peruvian genomes that had < 0.5% admixture. We used these to represent an un-admixed Native American population and hence the total number of populations analyzed was 27. The SNVs of these populations were grouped into eleven categories based on their Derived Allele Frequencies (DAF). The ratio β was estimated for each category of SNVs belonging to each population. We estimated the effective size (Ne) of each population based on the mutation rate (μ) and nucleotide diversity (π) (see methods).

Figure 1 shows the relationship between Ne and β for SNVs belonging to two extreme allele frequencies. For DAF < 0.025, the estimates of β were almost equal to 1 for all populations. In contrast, we observed a significant positive correlation (P < 10− 6) for SNVs with very high DAF (> 0.9). Another major pattern was that the β estimates were similar among populations belonging to the same geographical locations and very different among populations from distinct locations. However, this was not true for admixed Americans (black dots). We then examined this relationship for SNVs with different derived allele frequencies. We did not find any significant relationship between Ne and β for low frequency SNVs until the DAF was ≤0.2 (P > 0.10) (Fig. 2a). However, significant positive correlations (at least P < 10− 2) were observed between Ne and β for SNVs with DAF > 0.2. The magnitude of the correlation increased with the increase in DAF, which is evident from the rise in the slopes of the regression lines. This increase is clearer in Fig. 2b, which shows the positive relationship between DAF and the slopes of the regression lines shown in Fig. 2a. The slope observed for SNVs with DAF > 0.9 was 6 × 10− 5, which is almost 15 times higher than for SNVs with DAF = 0.2–0.3 (4 × 10− 6). For high frequency SNVs (DAF > 0.9), the difference between the mean β estimated for African genomes (1.72) was 79% higher than that observed for Peruvian genomes (1.35).

Fig. 1
figure 1

Relationship between the effective population size (Ne) and the ratio of AT→GC to GC → AT (β) changes. The genotype data was obtained for 27 populations of the world. The colors indicate different geographical locations of the populations. The relationship was not significant for SNVs with DAF < 0.025 (P = 0.11, using the Spearman rank correlation) but highly significant for those with DAF > 0.9 (P < 10− 6). Nucleotide substitutions (SNVs with DAF =1) were excluded

Fig. 2
figure 2

(a) The correlation between Ne and β observed for SNVs belonging to eleven allele frequency categories with DAF: < 0.025, 0.025–0.1, 0.1–0.2, 0.2–0.3, 0.3–0.4, 0.4–0.5, 0.5–0.6, 0.6–0.7, 0.7–0.8, 0.8–0.9 and > 0.9. Nucleotide substitutions (SNVs with DAF =1) were excluded. Only the upper values of the range are shown in X-axis. The correlations observed for SNVs with DAF < 0.2 were not statistically significant (P > 0.10) and remaining relationships were highly significant (at least P < 10− 2). (b) Scatter plot showing the magnitude of the relationship shown in Fig. 2a for difference allele frequency categories. X-axis shows the mid-values of the derived allele frequency categories and the slopes of the regression lines of Fig. 2a are shown on the Y-axis. The slopes of the lines for SNVs with DAF ≤ 0.2 are not included as those relationships were not statistically significant

We also plotted β against DAF to show how this estimate changes with increasing DAF. For this purpose, we selected five representative populations with significantly different Ne. As shown in Fig. 3, positive relationships between DAF and β against were observed for all populations and β increased with increasing DAF. However, based on the slopes of the regression lines (see Fig. 2 legend-inset) the rate of increase correlates with Ne. The slope was the highest for Africans (0.55), intermediate for Eurasians (0.34-0.43) and the lowest for Peruvians (0.28).

Fig. 3
figure 3

The magnitude of β with respect to the derived allele frequencies. The correlation between DAF and β were highly significant (at least P < 0.0017). Although all populations show an increasing trend, the rate of increase varies between them, which are manifested in the slopes of the regression lines (shown in the figure legend)

Patterns of homozygous and heterozygous variations

To further examine the patterns of substitutions in a much larger number of populations we obtained the genotype data of the Simons Genome Diversity Project for genomes representing 126 distinct ethnic populations. We examined the patterns of nucleotide changes for homozygous and heterozygous SNVs. For each genome, we estimated β for homozygous and heterozygous SNVs. The nucleotide diversity was estimated by comparing the pairs of chromosomes of each genome and Ne was calculated using the mutation rate obtained from previous studies (see methods). We then plotted β against Ne. For homozygous SNVs, our regression analysis produced a significant positive correlation (P < 10− 6) between the two variables (Fig. 4a). Although the effective population sizes vary widely between world populations, they are roughly similar for the populations in a geographical location. Interestingly, the β values for homozygous SNVs are also varied considerably but were similar for populations within a geographical region. This is clear from the mean estimates shown in the inset of Fig. 4a. For consistency, we also examined this relationship using the 1000 Genomes phase II data and found similar highly significant relationship between Ne for homozygous SNVs (P < 10− 6) (Additional file 1: Figure S1). On the other hand, Ne for heterozygous SNVs did not show any significant relationship (P = 0.36) with β (Fig. 4b).

Fig. 4
figure 4

The relationship between effective population size (Ne) and the ratio AT→GC to GC → AT (β) estimated for homozygous (a) and heterozygous (b) SNVs present in individual genomes belonging to 126 distinct populations of the world. The data was obtained from the Simons Genome Variation Project. The correlation for homozygous SNVs was highly significant (P < 10− 6) but not statistically significant for heterozygous SNVs (P = 0.36). The colors represent populations from distinct geographical locations. The relationship shown in the inset was based on the estimates of Ne and β averaged for the populations belonging to a specific geographical location. The relationship shown in the inset was also significant (P < 0.026)

Proportion of deleterious AT→GC and GC → AT changes in global populations

To understand the implications of the observed patterns on human health, we examined the patterns of nucleotide changes for deleterious SNVs. To determine the deleteriousness of the SNVs we used the robust method, Combined Annotation-Dependent Depletion (CADD). This method integrates over 60 diverse annotations (to measure the extent of deleteriousness of a variant) into a single measure (C score) [19]. We designated SNVs with a C score of > 15 as deleterious. To distinguish the frequency of different types of nucleotide changes we estimated the proportion of deleterious AT→GC (PAT) and GC → AT (PGC) changes using eqs. 2 and 3 (methods). These estimates were plotted against Ne. For deleterious SNVs with DAF > 0.9 we found highly significant relationships between Ne and PGC (P < 10− 6) and between Ne and PAT (P < 10− 6) (Fig. 5a). Importantly for large populations (Africans) the proportion of deleterious AT→GC mutations was 54% higher than the proportion of GC → AT mutations. For small populations (Peruvians) this difference was only 26%. We performed a similar analysis using the homozygous SNVs from 126 world populations and obtained similar results (Fig. 5b). The difference between PGC and PAT was highest for Africans (51%) and lowest (22%) for Native Americans.

Fig. 5
figure 5

Variations in the proportions of deleterious AT→GC and GC → AT SNVs among world populations. These proportions correlate significantly with the effective population sizes of the populations. The trends were positive for AT→GC SNVs and negative for GC → AT SNVs. The relationships were based on (a) high frequency SNVs with DAF > 0.9 and (b) homozygous SNVs. All relationships were highly significant (at least, P < 10− 6)

Discussions

In this study, we showed that the types of nucleotide changes in world populations are shaped by their effective population sizes. Our results revealed a much higher proportion of AT→GC variations in populations with large effective sizes (eg. Africans) compared to those with small sizes (eg. Native Americans). These observations could be explained based on the well-known recombination associated GC-biased gene conversion (gBGC) [9,10,11,12,13]. The two strands of DNA are connected by double hydrogen bonds between A and T bases (weak) as well as triple hydrogen bonds between G and C bases (strong). It has been shown that gBGC favors the changes involving AT→GC (or weak → strong) compared to GC → AT (or strong → weak) during the process of fixation. We developed a measure β to capture this bias in substitutions. For rare SNVs, β estimates were close to 1 (β ≈ 1) for all populations (Fig. 1). This suggests that the rate of AT→GC mutations per (A or T) base is equal to GC → AT mutations per (G or C) base. Therefore, the observed number of rare SNVs reflect the mutation patterns alone.

In contrast, for high frequency variations β estimates were significantly higher than 1 (β > 1). Importantly β increased with the increase in DAF as shown clearly in Fig. 3. These results suggest a preferential fixation of AT→GC over GC → AT mutations over time. However, based on the slopes of the regression lines the results also suggest that the rate of preferential of fixation of AT→GC mutations in small populations is low compared to that in large populations because of genetic drift that reduces the efficiency of gBGC. This result is supported by a previous study using human population genetic data that suggested that on an average gBGC is stronger in African than non-African populations [13].

To further support our claim, we examined the changes between A↔T (weak bond) and between G↔C (strong bond) nucleotides using high frequency SNVs (DAF > 0.9). Our results showed that β estimated for A → T/T → A or C → G/G → C were close to 1 for all populations and there was no significant relationship with Ne (Additional file 1: Figure S2). We also examined this using homozygous and heterozygous SNVs from the 126 populations and observed no significant relationship between β and Ne (Additional file 1: Figure S3A and S3B). Since the changes are between the same types of nucleotides (with respect to weak or strong hydrogen bonds) there was no effect of gBGC on the fixation of one type of nucleotide over other. This provides further evidence that the results of our study are not due to methodological artifacts.

Since gBGC does not affect the changes between A↔T (weak bond) and between G↔C (strong bond) previous studies have used the rate of these changes as a normalizing factor to assess the magnitude of gBGC on GC↔AT changes (Lachance and Tishkoff 2014; Glemin, et.al 2015; Xue and Chen 2016). Following this, we normalized AT→GC with A↔T changes and GC → AT with G↔C changes respectively and developed a normalized ratio, β’ (eq. 2 - methods). However, the relationship between Ne and β’ was also highly significant and comparable to previous results obtained for high frequency (Additional file 1: Figure S4A) and homozygous SNVs from the 1000 Genomes project (Additional file 1: Figure S4B) and the Simons Genome Diversity Project (Additional file 1: Figure S4C). This further supports our results as the normalization eliminates any variation in the mutation rates between populations.

The results based on homozygous and heterozygous SNVs shown in Fig. 4 further support to the results based on DAF presented in Figs. 1 and 2. For instance, Figs. 1 and 2a. showed there was no significant relationship between β and Ne for low frequency SNVs. Since low frequency SNVs predominantly exist as heterozygous SNVs in a genome, results based on the former are expected to be similar to those based on the latter (Figs. 1 and Fig. 4b). Similarly, as high frequency SNVs are more likely to be present as homozygous SNVs in genomes, the results for these two types of SNVs are alike. This is evident from the results shown in Figs. 1 and 4a.

Since gBGC is mediated by recombination it effects were found to be strong for highly recombining regions. To examine this, we obtained SNVs from regions with low (< 2 cM), medium (2–20 cM) and high (> 20 cM) rates of recombination. The results showed a much higher β for the variants in high recombination regions (Fig. 6). However, the magnitude of the relationship between β and Ne was similar for all three regions; the slopes of the regression lines were 0.000039 (low), 0.000042 (medium) and 0.000041 (high). This suggests that the influence of population size on gBGC is relatively similar across chromosomal regions with varying degrees of recombination.

Fig. 6
figure 6

Effects of recombination on the rates of AT→GC and GC → AT changes. The relationship between Ne and β for the SNVs present in low (< 2 cM), medium (0.2–20 cM) and high (> 20 cM) recombining regions. All three relationships were highly significant (P < 0.0002) and the magnitudes of the relationships were similar as revealed by the slopes of the regression lines (0.000039, 0.000042 and 0.000041 respectively)

Previous studies have shown a significant negative correlation between heterozygosity and the distance (of the location of the populations) from Africa. This correlation is expected based on the prediction that during migration out of Africa human populations underwent a series of population bottlenecks or founder effects along the migratory route [2,3,4,5]. This is because only a subset of people migrated from the original location to new sites and hence the size of the populations reduced with distance from Africa. From previous studies, we obtained the geographic distance of 41 non-African populations from Eastern Africa (Addis Ababa) [2, 6] and we plotted the estimates of β against them. We obtained a highly significant negative correlation for homozygous variants (P < 10− 6) (Fig. 7a) but not for heterozygous variants (P = 0.2) (Fig. 7b). This is very similar to the results shown in Fig. 4. In this analysis, the geographic distance from Africa was used as a proxy for Ne. Hence this result independently confirms our findings and also justifies the method used in this study to estimate Ne from nucleotide diversity.

Fig. 7
figure 7

Correlation between the geographical distance of non-African populations from East Africa (Addis Ababa) and the ratio AT→GC/GC → AT (β) using (a) Homozygous (b) Heterozygous SNVs. The relationship was highly significant for homozygous SNVs (P < 10− 6) and not for heterozygous SNVs (P = 0.2). The distances between the locations of the non-African populations and Addis Ababa were obtained from previous studies [2, 6]

We estimated Ne under the assumption that mutation rate (μ) and the rate of accumulation of mutations are both similar between populations. However, a recent study suggested a slightly (~ 5%) higher rate of mutation accumulation in non-Africans compared to Africans [7]. To accommodate the elevated diversity, we subtracted 5% of the observed divergence for non-Africans while estimating Ne and re-analyzed the data. This produced almost identical patterns and similar strengths of correlation to that reported in Fig. 4 (Additional file 1: Figure S5).

Conclusions

We have shown that the types of SNVs observed in different human populations are very likely to be modulated by their effective population sizes. Since this pattern was universal for genome-wide variations we showed that deleterious SNVs also follow this pattern. Our results showed that populations with large effective sizes (e.g. Africans) displayed the greatest difference between the proportions of high frequency deleterious AT→GC and GC → AT SNVs. This difference was much lower in populations with small effective sizes (e.g. Native Americans). This has significant implications in human health as it implies that high frequency diseases-associated mutations in Africans will be more enriched with AT→GC SNVs than in Native Americans. Furthermore, we showed that deleterious homozygous SNVs are also predominantly AT→GC in Africans, and to a greater extent than in non-Africans. This suggests the possibility that recessive genetic disorders in Africans are more likely to be caused by AT→GC variants than in non-Africans. Therefore, our study recommends that genome-wide association studies should consider the frequency of population specific nucleotide changes.

Methods

Genome data

We obtained genotype data from the 1000 Genomes Project – Phase II [8]. The genome-wide variations from the 26 populations including Africans (seven populations: 661 individuals), South Asians (five: 489), European (five: 503), East Asian (five: 504) and South Americans (four: 347). Although there were 85 Peruvian genomes available, most of these were admixed with Europeans and Africans. Hence, we used the likelihood based clustering algorithm Admixture [20] and examined the proportion of admixture in each Peruvian genome. Our results showed that only 20 genomes (40 chromosomes) had < 0.5% admixture from other populations and we included these un-admixed Peruvians as the 27th population in our analyses. To identify derived alleles, orientations of SNVs were determined using the ancestral state of the nucleotides, which was inferred from six primate EPO alignments [21]. The SNVs were divided into eleven categories based on their derived allele frequencies (see Fig. 2-legend). For the SNVs in each category we computed the counts of six types of changes: A/T → G/C, G/C → A/T, C → G, G → C, A → T and T → A (see below).

We also obtained the genotype data from the Simon Genome Diversity Project [7]. To examine the patterns of nucleotide changes we used the homozygous and heterozygous SNVs present in a single representative genome from each of the 126 populations. We excluded four African hunter-gatherer populations from our analysis as it was difficult to ascertain the correct orientation of the nucleotide changes in these genomes. For each genome, we estimated the number of homozygous and heterozygous changes belonging to the six types described above.

Deleterious mutations

To determine the deleterious nature of a SNV we used a robust method, Combined Annotation-Dependent Depletion (CADD) that integrates diverse annotations into a single measure (C score) [19]. The extent of deleteriousness was further determined by estimating the corresponding selective coefficients for these scores [22]. For instance, SNVs with a CADD score of 15–20 have a selection coefficient (s) of 0.0001 and this is considered to significantly affect the fitness of humans as most of the nonsynonymous polymorphisms have a score above this threshold. The C scores for each SNV in the 1000 Genomes Project data were publicly available (http://cadd.gs.washington.edu/download). Using an in-house Perl script, we combined this score with the genome data by using the chromosomal co-ordinates of the SNVs. For the deleterious variant analysis, we included only the SNVs for which the C score was available. We used a C score of ≥15 to determine a mutation to be deleterious in nature following previous studies [19, 23]. However, using a different threshold produced almost identical results.

Estimating the ratio of mutation rates

The ratio of AT→GC and GC → AT mutation rates could be estimated based on the Waterson estimator [24], θ = 4Neμ = S/an, where Ne is the effective population size and S is the number of segregating sites per site and an = \( \sum \limits_{i=1}^{n-1}\frac{1}{i} \). We can use this estimator considering only one type of mutation as:

$$ {\theta}_{A\to G}=4{N}_e{\mu}_{A\to G}=\frac{S_{A\to G}}{a_n} $$
$$ {\theta}_{G\to A}=4{N}_e{\mu}_{G\to A}=\frac{S_{G\to A}}{a_n} $$

The ratio of forward and reverse nucleotide changes (β) could be obtained as:

$$ \beta =\frac{\mu_{A\to G}}{\mu_{G\to A}}=\frac{\theta_{A\to G}}{\theta_{G\to A}}=\frac{S_{A\to G}}{S_{G\to A}} $$

The number of segregating sites per site or SNVs per site of a genome can be estimated as:

$$ {S}_{A\to G}=\frac{M_{A\to G}}{N_A}\ and\ {S}_{G\to A}=\frac{M_{G\to A}}{N_G} $$

where MA → G and MG → A are the number of observed A → G and G → A mutations in a genome respectively and NA and NG are the number of ancestral A and G nucleotides. This formula can be extended for the combined AT→GC and GC → AT mutation rates because each pattern is mutually exclusive and hence the ratio of nucleotide changes (β) is:

$$ \beta =\frac{\mu_{AT\to GC}}{\mu_{GC\to AT}}=\frac{\theta_{AT\to GC}}{\theta_{GC\to AT}}=\frac{S_{AT\to GC}}{S_{GC\to AT}} $$

The number of segregating sites or SNVs in a genome can be calculated as:

$$ {S}_{AT\to GC}=\frac{M_{A\to G}+{M}_{A\to C}}{N_A}+\frac{M_{T\to C}+{M}_{T\to G}}{N_T} $$
$$ {S}_{GC\to AT}=\frac{M_{G\to A}+{M}_{G\to T}}{N_G}+\frac{M_{C\to T}+{M}_{C\to A}}{N_C} $$

Since A and T as well as C and G are complementary to each other in a double-stranded DNA they are equal in number. Therefore, β can be expressed as,

$$ \beta =\frac{M_{A\to G}+{M}_{A\to C}+{M}_{T\to C}+{M}_{T\to G}}{M_{G\to A}+{M}_{G\to T}+{M}_{C\to T}+{M}_{C\to A}}\times \frac{N_{GC}}{N_{AT}} $$
$$ \beta =\frac{M_{AT\to GC}}{M_{GC\to AT}}\times \frac{N_{GC}}{N_{AT}}\to (1) $$

This derivation proves that the ratio of forward and reverse rates of changes can be calculated by simply taking the ratio of the observed counts of AT→GC (MAT → GC) and GC → AT (MGC → AT) changes and multiplying with the ratio of the number of GC (NGC) to AT (NAT) nucleotides in a genome. Since eq. 1 represents the ratio of mutation rates (μAT → GC and μGC → AT) this ratio is expected to be 1 (β = 1) if the observed nucleotide changes are solely due to the result of these mutation rates. Any deviation from this suggests a bias in the substitution process. While β > 1 indicate an excess of AT→GC substitutions β < 1 imply an excess of GC → AT substitutions.

GC-biased gene conversion is known to affect only the changes involving weak (A or T) to strong (G or C) nucleotides but not the changes within weak (A↔T) or within strong (G↔C) nucleotides. Hence the latter is not expected to vary between populations with different Ne. Therefore, we used this as a normalization factor and developed a normalized ratio of AT→GC to GC → AT (β’).

The A → G mutation rate can be normalized using A → T rate and the normalized rate (τA → G) can be expressed as:

$$ {\tau}_{A\to G}=\frac{\mu_{A\to G}}{\mu_{A\to T}}=\frac{\theta_{A\to G}}{\theta_{A\to T}}=\frac{S_{A\to G}}{S_{A\to T}}=\frac{\left({M}_{A\to G}/{N}_A\right)}{\left({M}_{A\to T}/{N}_A\right)}=\frac{M_{A\to G}}{M_{A\to T}} $$

Similarly, we can obtain this expression for AT→GC and GC → AT rates as:

$$ {\tau}_{AT\to GC}=\frac{\mu_{A\to G}}{\mu_{A\to T}}+\frac{\mu_{A\to C}}{\mu_{A\to T}}+\frac{\mu_{T\to C}}{\mu_{T\to A}}+\frac{\mu_{T\to G}}{\mu_{T\to A}} $$
$$ {\tau}_{AT\to GC}=\frac{M_{A\to G}+{M}_{A\to C}}{M_{A\to T}}+\frac{M_{T\to C}+{M}_{T\to G}}{M_{T\to A}} $$
$$ {\tau}_{GC\to AT}=\frac{\mu_{G\to A}}{\mu_{G\to C}}+\frac{\mu_{G\to T}}{\mu_{G\to C}}+\frac{\mu_{C\to T}}{\mu_{C\to G}}+\frac{\mu_{C\to A}}{\mu_{C\to G}} $$
$$ {\tau}_{GC\to AT}=\frac{M_{G\to A}+{M}_{G\to T}}{M_{G\to C}}+\frac{M_{C\to T}+{M}_{C\to A}}{M_{C\to G}} $$

Therefore, the normalized ratio of AT→GC to GC → AT (β’) is:

$$ {\beta}^{\prime }=\frac{\tau_{AT\to GC}}{\tau_{GC\to AT}}\to (2) $$

The relationship between nucleotide diversity (π), mutation rate (μ) and effective population size (Ne) for diploid organisms is π = 4Neμ. Using this relationship, we calculated the effective population size as Ne = π/4 μ. We used the observed nucleotide diversity of a population or of a diploid genome and used a mutation rate of 1.2 × 10− 8 substitutions per site per generation following many studies based on human pedigree genome data [25, 26]. A recent suggested that the rate of mutation accumulation in non-European genomes could be slightly (5%) higher than that of Africans [7]. To accommodate this difference, we subtracted 5% of the nucleotide diversity for non-African populations only while calculating the effective population sizes and obtained almost identical results (Additional file 1: Figure S4).

Estimation of the proportion of AT→GC counts

We also estimated the proportion of AT→GC counts (PGC) and GC → AT (NGC → AT) counts, which were calculated as: \( {P}_{GC}=\frac{N_{A\to G}+{N}_{A\to C}+{N}_{T\to C}+{N}_{T\to G}}{N}\to (3) \)

$$ {P}_{AT}=\frac{N_{G\to A}+{N}_{G\to T}+{N}_{C\to A}+{N}_{C\to T}}{N}\to (4) $$

N is the number of all types of nucleotide changes. The standard error of PAT → GC and PGC → AT were calculated using the binomial variance.

Availability of data and materials

The whole genome datasets analyzed during the current study were obtained from the 1000 genome project – Phase II (ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp) and the Simons Genome Diversity Project (https://www.simonsfoundation.org/simons-genome-diversity-project/).

Abbreviations

CADD:

Combined Annotation-Dependent Depletion

DAF:

Derived Allele Frequencies

gBGC:

GC-Biased Gene Conversion

SNV:

Single Nucleotide Variation

References

  1. Stringer C. Human evolution: out of Ethiopia. Nature. 2003;423(6941):692–3 695.

    Article  CAS  PubMed  Google Scholar 

  2. DeGiorgio M, Jakobsson M, Rosenberg NA. Out of Africa: modern human origins special feature: explaining worldwide patterns of human genetic variation using a coalescent-based serial founder model of migration outward from Africa. Proc Natl Acad Sci U S A. 2009;106(38):16057–62.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. Handley LJ, Manica A, Goudet J, Balloux F. Going the distance: human population genetics in a clinal world. Trends Genet. 2007;23(9):432–9.

    Article  CAS  PubMed  Google Scholar 

  4. Prugnolle F, Manica A, Balloux F. Geography predicts neutral genetic diversity of human populations. Curr Biol. 2005;15(5):R159–60.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  5. Ramachandran S, Deshpande O, Roseman CC, Rosenberg NA, Feldman MW, Cavalli-Sforza LL. Support from the relationship of genetic and geographic distance in human populations for a serial founder effect originating in Africa. Proc Natl Acad Sci U S A. 2005;102(44):15942–7.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  6. Li JZ, Absher DM, Tang H, Southwick AM, Casto AM, Ramachandran S, Cann HM, Barsh GS, Feldman M, Cavalli-Sforza LL, et al. Worldwide human relationships inferred from genome-wide patterns of variation. Science. 2008;319(5866):1100–4.

    Article  CAS  PubMed  Google Scholar 

  7. Mallick S, Li H, Lipson M, Mathieson I, Gymrek M, Racimo F, Zhao M, Chennagiri N, Nordenfelt S, Tandon A, et al. The Simons genome diversity project: 300 genomes from 142 diverse populations. Nature. 2016;538(7624):201–6.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Auton A, Brooks LD, Durbin RM, Garrison EP, Kang HM, Korbel JO, Marchini JL, McCarthy S, McVean GA, Abecasis GR. A global reference for human genetic variation. Nature. 2015;526(7571):68–74.

    Article  PubMed  Google Scholar 

  9. Duret L, Semon M, Piganeau G, Mouchiroud D, Galtier N. Vanishing GC-rich isochores in mammalian genomes. Genetics. 2002;162(4):1837–47.

    CAS  PubMed  PubMed Central  Google Scholar 

  10. Marais G. Biased gene conversion: implications for genome and sex evolution. Trends Genet. 2003;19(6):330–8.

    Article  CAS  PubMed  Google Scholar 

  11. Duret L, Galtier N. Biased gene conversion and the evolution of mammalian genomic landscapes. Annu Rev Genomics Hum Genet. 2009;10:285–311.

    Article  CAS  PubMed  Google Scholar 

  12. Galtier N, Duret L, Glemin S, Ranwez V. GC-biased gene conversion promotes the fixation of deleterious amino acid changes in primates. Trends Genet. 2009;25(1):1–5.

    Article  CAS  PubMed  Google Scholar 

  13. Glemin S, Arndt PF, Messer PW, Petrov D, Galtier N, Duret L. Quantification of GC-biased gene conversion in the human genome. Genome Res. 2015;25(8):1215–28.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  14. Henn BM, Botigue LR, Peischl S, Dupanloup I, Lipatov M, Maples BK, Martin AR, Musharoff S, Cann H, Snyder MP, et al. Distance from sub-Saharan Africa predicts mutational load in diverse human genomes. Proc Natl Acad Sci U S A. 2016;113(4):E440–9.

    Article  CAS  PubMed  Google Scholar 

  15. Do R, Balick D, Li H, Adzhubei I, Sunyaev S, Reich D. No evidence that selection has been less effective at removing deleterious mutations in Europeans than in Africans. Nat Genet. 2015;47(2):126–31.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  16. Subramanian S. Europeans have a higher proportion of highfrequency deleterious variants than Africans. Hum Genet. 2016;135(1):1–7.

    Article  PubMed  Google Scholar 

  17. Lachance J, Tishkoff SA. Biased gene conversion skews allele frequencies in human populations, increasing the disease burden of recessive alleles. Am J Hum Genet. 2014;95(4):408–20.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  18. Xue C, Chen H, Yu F. Base-biased evolution of disease-associated mutations in the human genome. Hum Mutat. 2016;37(11):1209–14.

    Article  CAS  PubMed  Google Scholar 

  19. Kircher M, Witten DM, Jain P, O'Roak BJ, Cooper GM, Shendure J. A general framework for estimating the relative pathogenicity of human genetic variants. Nat Genet. 2014;46(3):310–5.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  20. Alexander DH, Novembre J, Lange K. Fast model-based estimation of ancestry in unrelated individuals. Genome Res. 2009;19(9):1655–64.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  21. Abecasis GR, Altshuler D, Auton A, Brooks LD, Durbin RM, Gibbs RA, Hurles ME, McVean GA. A map of human genome variation from population-scale sequencing. Nature. 2010;467(7319):1061–73.

    Article  PubMed  Google Scholar 

  22. Racimo F, Schraiber JG. Approximation to the distribution of fitness effects across functional categories in human segregating polymorphisms. PLoS Genet. 2014;10(11):e1004697.

    Article  PubMed  PubMed Central  Google Scholar 

  23. Subramanian S. Using the plurality of codon positions to identify deleterious variants in human exomes. Bioinformatics. 2015;31(3):301–5.

    Article  CAS  PubMed  Google Scholar 

  24. Watterson GA. On the number of segregating sites in genetical models without recombination. Theor Popul Biol. 1875;7:256–76.

    Article  Google Scholar 

  25. Conrad DF, Keebler JE, DePristo MA, Lindsay SJ, Zhang Y, Casals F, Idaghdour Y, Hartl CL, Torroja C, Garimella KV, et al. Variation in genome-wide mutation rates within and between human families. Nat Genet. 2011;43(7):712–4.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  26. Roach JC, Glusman G, Smit AF, Huff CD, Hubley R, Shannon PT, Rowen L, Pant KP, Goodman N, Bamshad M, et al. Analysis of genetic inheritance in a family quartet by whole-genome sequencing. Science. 2010;328(5978):636–9.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

Download references

Acknowledgments

The author thanks Alex Quin for critical comments.

Funding

This study was supported by a grant from the Australian Research Council (LP160100594).

Author information

Authors and Affiliations

Authors

Contributions

Conceptualization: SS; Data Analysis: SS; Writing: SS.

Corresponding author

Correspondence to Sankar Subramanian.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Additional file 1: Figure S1.

Correlation between effective population size (Ne) and the ratio AT→GC to GC→AT (b) estimated for homozygous SNVs from whole genomes of the 1000 Genomes Project. The relationship was highly significant (P < 10-6). Figure S2. Relationship between the effective population size (Ne) and the ratio of nucleotide changes within the same types (b): (A) within strong types i.e. C®G/G®C (B) within weak types i.e. A®T/T®A. The ratios were estimated using the high frequency SNVs (DAF > 0.9) belonging to 27 populations obtained from the 1000 genome Project. Figure S3. Relationship between the effective population size (Ne) and the ratio of nucleotide changes within the same types (b): (A) within strong types i.e. C→G/G→C (B) within weak types i.e. A→T/T→A. The ratios were estimated using the homozygous SNVs belonging to 126 populations obtained from the Simon Genome Project. Figure S4. Relationship between the effective population size (Ne) and the normalized ratio of AT®GC to GC®AT (b) changes using equation 2 (see methods). We used A↔T and G↔C to normalize AT→GC and GC→AT changes respectively. (A) High frequency SNVs (DAF>0.9) and (B) Homozygous SNVs of the 1000 genome project (C) Homozygous SNVs from the Simons Genome Diversity project. The relationships were highly significant (P < 10-6). Figure S5. The relationship between effective population size (Ne) and the normalized ratio AT→GC/GC→AT (b) estimated for homozygous SNVs present in individual genomes belonging to 126 distinct populations of the world. This is very similar to Fig 4A except that the nucleotide diversities of non-Africans were 5% reduced while calculating Ne in order to neutralize the difference in mutation accumulation rates between Africans and non-Africans as reported recently. The correlation was highly significant (P < 10-6).

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Subramanian, S. Population size influences the type of nucleotide variations in humans. BMC Genet 20, 93 (2019). https://doi.org/10.1186/s12863-019-0798-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s12863-019-0798-9

Keywords