The aim of our study was to identify stable population-specific mRNA markers, representing the highest differences in gene expression between two human populations: Caucasian and Chinese. Only males were analyzed, to avoid gender-related differences in the expression level. Based on the high-throughput microarray analysis of B-lymphocyte cell lines representing Chinese and Caucasian populations, we have identified a set of 20 genes with the inter-population difference in the mean expression characterized by the at least 1.5-fold change and FDR < 0.05. The fold change of these 20 genes ranged from 1.5 to 2.5.
The validation of 13 transcripts from the 20 identified based on microarray study, for which specific TLDA probes were available, was performed on 47 independent cell lines. The differentiating status was confirmed for three genes: UTS2, UGT2B17 and SLC7A7. The mean expression of UTS2 was higher in CHB (25.8-fold change compared to CEU), while the expression of UGT2B17 and SLC7A7 was higher in CEU (3.2- and 2.2-fold change, respectively).
The magnitude of the population fold-change in UGT2B17 or UTS2 expression examined by dedicated TLDA cards was two to ten times higher than that revealed by during the whole-transcriptome screening by microarrays. Since this step of validation was performed in the same type of material (B-lymphocyte cell lines), these discrepancies were probably due to using different detection systems (microarrays, routinely used in transcriptome-wide screening experiments versus TLDA cards, targeting few preselected transcripts).
It is commonly known that lymphoblastoid cell lines (LCL) model is not perfect for gene expression studies, due to certain technical and environmental factors that may bias the results. The impact of Epstein Barr virus (EBV) transformation on the profile of gene expression in LCL is particularly important and widely discussed in the literature (e,g. [15,16,17]). It has been shown that a large number of genes are differently expressed between the primary and cultured cell lines; it has even been demonstrated that a subset of genes were expressed exclusively in EBV-transformed cells [17]. On the other hand, this effect is mostly important if the comparisons are made between the transformed and non-transformed cells; here, both populations analyzed by either microarrays or TLDA were represented by LCLs obtained from EBV transformed B-lymphocytes.
To exclude the possibility that the differences in the expression reflected specific conditions related to the maintenance of the CHB and CEU cell lines (for example bias in the sample collection time: CEU samples had been collected decades earlier than the CHB samples), the 2nd validation step was carried out using the primary biological material, i.e. peripheral blood samples obtained from Caucasian and Chinese males. Due to the limited availability of the blood samples, only two best-differentiating genes were subjected to this validation step. The inter-population differences in the expression was confirmed for both analyzed genes: the expression of UTS2 was 13 times higher in Chinese (p < 0.001), while that of UGT2B17 was six times higher in Caucasians (p < 0.001). The blood samples were neither subjected to EBV transformation nor to the collection time bias; we therefore believe, that the changes in UGT2B17 and UTS2 expression reflected true population-specific differences.
The discrepancies in the magnitude of the fold change between the first and the second step of validation require additional consideration. It could reflect differences in the expression between the homogeneous B-cell lines cultured under specific laboratory conditions and the peripheral blood samples composed of the mixture of different cells (B- and T-lymphocytes), whose expression might have been in addition affected by different environmental conditions of the donors.
On the other hand, some of the differences in the experiments using TLDA cards and qRT-PCR (which replaced the TLDA cards in the last phase of our study due to the budget restrictions) could cause probe-related differences in transcript detection. The first issue appears less important in the analysis of UGT2B17 gene, which has only one transcript isoform. UST2 gene however has three transcript isoforms; all were targeted by TaqMan probes, contrary to the qRT-PCR, where only two isoforms were covered. In addition, the TaqMan probe manufacturer (Life Technology database) has only recently announced that Hs00922170_m1 probe used in TLDA experiment might not be solely specific to UTS2 transcripts.
The differences in the magnitude of the fold-changes notwithstanding, our results have confirmed that the population level of UGT2B17 and UTS2 expression differentiates Chinese and Caucasian populations, both in B-lymphocyte cell lines and in the whole peripheral blood samples. UGT2B17 encodes a member of the uridine diphosphoglucuronosyltransferase protein family. The encoded enzyme takes a part in metabolism of steroids e.g. steroid hormones and lipid-soluble drugs (GeneCards). UTS2 encodes a mature peptide that is an active cyclic heptapeptide and acts as a vasoconstrictor.
In the last step we performed a statistical analysis to confirm the discriminating power of the two genes (UTS2 and UGT2B17). Three different classifiers were built and after assessment of their sensitivity and specificity (ROC and AUC parameters), the sample population assignment was performed. In spite of the existing intra-population expression variation (see Figs. 2 and 4), our binary-classifiers showed high specificity (> 90%) and sensitivity (> 76%) in sample population classification. The accuracy of classification of an unknown sample to one of the studied populations was nearly 90% regardless of the classification method.
Gene expression differences among distinct human populations, especially in genes being under positive selection like UGT2B17, have been identified before [1, 3, 5, 9, 18]. These differences have been repeatedly shown to be heritable and linked to the variation across the human genome, potential mechanisms including INDELs or copy number variation (CNV), SNPs e.g. [2, 3, 5, 19] or alternative splicing [5, 9, 18]. Interestingly, we have noted that differences in the UGT2B17 and UTS2 expression in the studied groups were due to the complete lack of amplification in different number of individuals in both populations, rather than to the subtle population-specific fluctuations in the expression level. These observations suggested that the individuals, where no transcription of a given gene was observed (ct ≥ 40), could be homozygotes for an expression-abolishing mutation.
To shed light on the mechanisms underlying the population differences in the level of UGT2B17 and UTS2 expression, we examined SNPs with population-specific allele frequencies listed in the genome databases as well as our earlier data from Infinium Human OmniExpressExome, obtained for the same cell lines as used here in the discovery phase (see Additional file 4) and [20]. No SNPs were found, which would affect expression of UGT2B17 and UTS2 genes in the 0–1 manner (e.g. causing premature termination codons or obvious spice site alterations).
Twenty-five SNPs, which correlated with population differences in UTS2 gene expression (see Additional file 5: Table S2), were located far away (from 280,000 to 360,000 bp up- and from 52,000 bp- 920,000 bp down-) from the gene. Further studies are required to investigate whether these SNPs have an impact on the regulation of UTS2 expression. For UGT2B17, no correlation between population differences in gene expression and SNPs was identified.
Another mechanism that may play a role in the regulation of gene expression is methylation of DNA; e.g. methylation of the gene promoter region leads to gene expression silencing. Examination of our earlier data obtained from Illumina Infinium Human Methylation 450 BeadChip Microarray for the same set of Chinese and Caucasian cell lines clearly indicated the lack of methylation differences that would affect the level of gene expression in UTS2 and UGT2B17 genes ( [21] and data not published).
Interestingly, it has been shown that UGT2B17 gene lies in the genomic region where numerous CNV (copy number variants) occur (see ENSEMBL database, and e.g. [7, 22,23,24]. Some of them, e.g. esv3600874, esv3600873, esv3600875, are characterized by high inter-population variation in allele frequency, and UGT2B17 deletion alleles are more common in East Asians, than in Africans and Europeans (e.g. [22, 24, 25]). Our results, where the complete lack of UGT2B17 amplification was more frequent among Chinese compared to Caucasian cell lines (56 to 23%), are in accord with the scenario of CNV deletion underlying the lower UGT2B17 expression in Chinese group. In fact, the majority of the cell lines, where UGT2B17 transcripts were not amplified in TLDA cards are listed in the ENSEMBL database as carrying esv3600874, esv3600873, esv3600875 deletions encompassing the whole gene or its large part. Although no genotype information (i.e. information whether the individual has a hetero- vs homozygous deletion) is available in that database, it is highly probable, that in the samples, which did not amplify in our settings, the deletion was present on both alleles.
Based on the ENSEMBL database, UTS2 also lies in the region rich in CNV polymorphisms. However, the only reported CNV (esv3585131) exhibiting inter-population difference in the allele frequency lies in the long intron 1. The possibility that, similarly to UGT2B17, this CNV affects the expression profile, is therefore not strong, although the possibility that it may influence splicing and affect the gene expression regulation cannot be excluded. Some genomic studies have identified CNVs lying at the larger distance from UTS2, but so far there is no proof for their role in the UTS2 expression regulation e.g. [23]. Another explanation may involve the so called novel transcribed regions. Based on the transcriptome sequencing of Chinese and Caucasian population samples, over 1600 putative ethnic-specific novel transcribed regions that may influence gene expression have been recently identified [19]; importantly, UTS2 gene was among 20 genes reported to exhibit population-specific gene expression pattern and at the same time to encompass novel transcribed regions in Chinese population [26].