SNP eQTL status and eQTL density in the adjacent region of the SNP are associated with its statistical significance in GWA studies

Background Over the relatively short history of Genome Wide Association Studies (GWASs), hundreds of GWASs have been published and thousands of disease risk-associated SNPs have been identified. Summary statistics from the conducted GWASs are often available and can be used to identify SNP features associated with the level of GWAS statistical significance. Those features could be used to select SNPs from gray zones (SNPs that are nominally significant but do not reach the genome-wide level of significance) for targeted analyses. Methods We used summary statistics from recently published breast and lung cancer and scleroderma GWASs to explore the association between the level of the GWAS statistical significance and the expression quantitative trait loci (eQTL) status of the SNP. Data from the Genotype-Tissue Expression Project (GTEx) were used to identify eQTL SNPs. Results We found that SNPs reported as eQTLs were more significant in GWAS (higher -log10p) regardless of the tissue specificity of the eQTL. Pan-tissue eQTLs (those reported as eQTLs in multiple tissues) tended to be more significant in the GWAS compared to those reported as eQTL in only one tissue type. eQTL density in the ±5 kb adjacent region of a given SNP was also positively associated with the level of GWAS statistical significance regardless of the eQTL status of the SNP. We found that SNPs located in the regions of high eQTL density were more likely to be located in regulatory elements (transcription factor or miRNA binding sites). When SNPs were stratified by the level of statistical significance, the proportion of eQTLs was positively associated with the mean level of statistical significance in the group. The association curve reaches a plateau around -log10p ≈ 5. The observed associations suggest that quasi-significant SNPs (10− 5 < p < 5 × 10− 8) and SNPs at the genome wide level of statistical significance (p < 5 × 10− 8) may have a similar proportions of risk associated SNPs. Conclusions The results of this study indicate that the SNP’s eQTL status, as well as eQTL density in the adjacent region are positively associated with the level of statistical significance of the SNP in GWAS.


Background
Genome wide association studies (GWASs) have identified thousands of single nucleotide polymorphisms (SNPs) associated with human diseases [1]. Nevertheless, many disease-associated SNPs remain to be identified, which is obvious from the fact that larger GWASs targeting the same phenotype as the original and smaller GWAS regularly identify additional SNPs [2][3][4]. Those additional SNPs are usually SNPs from the gray zone of the original GWAS: gray zone SNPs are SNPs that are nominally significant (p < 0.05) but do not reach the genome-wide level of statistical significance (p < 5 × 10 − 8 ) [5,6]. It is, therefore, important to have a tool for prioritizing gray zone SNPs based on intrinsic SNP characteristics. A number of SNP characteristics including variation in allele frequencies among populations [7], type of the linked gene(s) [8], or combination of different SNP characteristics [9] were proposed for SNP prioritization. No GWAS of diseases have reached a sample size at which an exhaustive evaluation of all the possible genes or SNPs associated with disease can be anticipated to uncover all of the variability influence disease development and only studies of selected phenotypes like height and smoking behavior have amassed sample sizes that can provide comprehensive analyses of genetic architecture.
Conducting large enough GWAS studies to identify all the disease-associated SNPs, especially those with small effect sizes may not be feasible. Combining small effect SNPs in polygenic score is a useful approach for risk prediction [10,11]. Interestingly, polygenic risk modeling performs better when the threshold for inclusion of a SNP is lower than genome-wide significance [12]. The accuracy of polygenic risk modeling will be reduced when a proportion of the variants being included are not associated with disease [13].
It is known that SNPs located in regulatory regions, e.g. transcription factor (TF) binding sites, are often eQTLs, as they modulate gene expression [14,15]. A number of studies report an association between eQTLs and GWAS detected SNPs [16][17][18][19]. A systematic review of SNP eQTL status in the context of GWAS statistical significance has not been conducted so far. The goal of this study was to use summary statistics from recently published large breast and lung cancer GWASs to analyze the associations between the level of statistical significance of the SNP and its eQTL status. We also studied the association between the level of statistical significance of the SNP and eQTL density in adjacent region of the SNP.

Retrieving data on the SNP' eQTL status
We used eQTLs reported by the Genotype-Tissue Expression (GTEx) project [20]. eQTL data were downloaded from the GTEx website (accessed October 12, 2018). Only cis eQTLs were used in the current analyses. To ensure robustness of the analysis, only eQTLs whose association with gene expression level remained significant after adjustment for multiple testing were used in the analysis. A total of 297,470 unique eQTLs detected in at least one out of 48 tissues analyzed by GTEx were used in the analysis. Additional file 1: Table S1 shows distribution of GTEx tested SNPs by tissue. More than 80% of eQTLs are tissue specific. Adjustment for multiple testing was done for each tissue separately based on the number of statistical tests. We used Bonferroni correction with significance level after adjustment 0.05.
All 48 tissue types available through GTEx were used in the analysis. Even though some tissues are certainly related, for example there are 13 tissues from different brain areas and 3 artery-derived tissues: artery aorta, artery coronary, and artery tibial all were analyzed separately as it was done by GTEx.

GWAS SNPs
We have used summary statistics from breast and lung cancer OncoArray GWASs [21,22]. Those two studies were selected because for both of them complete summary statistics were readily available. The breast GWAS summary statistics were downloaded from the Genome-Wide Repository of Associations Between SNPs and Phenotypes (GRASP) database [23]. Summary statistics for lung cancer GWAS is available for downloading from dbGaP: accession number phs001273.v1.p1. The sample size for the breast cancer GWAS was 122,977 cases and 105,974 controls. Lung OncoArray study included 29, 266 cases and 56,450 controls. The studies analyzed over 500 K SNPs directly genotyped by the OncoArray [24]. Directly genotyped SNPs include candidate SNPs for breast, colorectal, lung, ovarian and prostate cancers. The platform also includes~276 K backbone tag SNPs selected by OncoArray consortium to ensure reliable imputation of additional SNP [24]. Backbone SNPs are used as tag SNPs for imputation. Backbone SNPs are uniformly distributed across genome and generally show less linkage compared to all (directly genotyped plus imputed) OncoArray SNPs. We also used summary statistics from the scleroderma GWAS [25]. Scleroderma, or systemic sclerosis, is an autoimmune disease characterized by fibrosis of the skin and internal organs.
There was a substantial overlap between GTEx tested SNPs (Illumina OMNI 2.5 M SNP Array) and SNPs genotyped or imputed by breast and lung GWASs: 91% eQTLs were tested in breast and 92% in lung cancer GWAS. Only SNPs genotyped/imputed by both GTEx and the GWASs were used in the analysis. As a measure of statistical significance we have used -log 10 p where p is p-value.

Statistical analysis
Non-parametric Mann-Whitney U test was used to compare -log 10 p(s) between eQTL and non-eQTL GWAS SNPs. To illustrate the relationship between eQTLs and the level of statistical significance (−log 10 p) in stratified analyses (Figs. 2 and 4) we used means and standard error of mean (SE). For correlation analyses we have used Spearman rank order correlation tests. All statistical tests were implemented in Statistica (TIBCO Software Inc., 2017).

SNP's eQTL status and the level of statistical significance in GWAS
Nominally significant breast cancer GWAS SNPs were used in this analysis. Tables 1 and 2 as well as Additional  file 1: Tables S1-S3 show mean -log 10 p for the SNPs that are reported as eQTL versus SNPs that are not eQTLs in a given tissue. eQTL SNPs had higher -log 10 p regardless of the tissue specificity of the eQTL. We expected that breast tissue eQTLs will show the strongest -log 10 p inflation, based on the larger sample size of the original study to identify GWAS SNPs. We found, however, that breast eQTLs showed an average level of statistical significance compared to eQTLs for other tissue types. Similar to breast cancer, in lung cancer GWAS we also found that the SNPs reported as eQTLs tended to be more significant regardless of the tissue specificity of the eQTL ( Table 2). Lung tissue specific eQTLs were associated with an average (typical) inflation of -log 10 p compared to eQTLs specific for other tissue types.
SNPs showing eQTL activity in multiple tissues (pan-tissue eQTLs) exhibit a higher -log 10 p inflation eQTLs can be roughly divided into tissue specific (those reported as an eQTL in a single tissue type) and pantissue eQTLsthose showing eQTL activity across multiple tissues. Additional file 1: Table S1 shows the distribution of eQTL SNPs by the number of tissues where they are reported. Over 80% of eQTLs are tissue-specific while only a few SNPs show eQTL activity across all 48 tissues. GWAS SNPs were subdivided into 5 categories based on the number of tissues where a SNP is reported as eQTL: "0", "1", "2", "3" and "> 3" and mean -log 10 ps were computed in each category (Fig. 1). Figure 1a shows the result of the analyses of breast cancer GWAS SNPs and Fig. 1b -lung cancer GWAS SNPs. In both studies pantissue eQTLs show a higher inflation of -log 10 p compared to tissue specific eQTLs.

Statistical significance of a SNP in GWA studies is positively associated with the number of eQTLs in its adjacent region
We tested if the density of eQTLs in the adjacent ±5 kb region is associated with the level of statistical significance of the SNP in GWASs. The size of the adjacent region was selected because it is a typical size of haplotype blocks in the human genome [26]. SNPs were categorized by the number of eQTLs in the adjacent region and mean -log 10 p were estimated for each category (Fig. 2). We found that -log 10 p(s) for breast cancer (upper panel) SNPs were positively associated with the number of eQTLs in adjacent regions. There was a linear association in 0-6 eQTLs interval and after that the curve plateaued. For SNPs that themselves are not eQTLs (a) and eQTLs (b) the associations were similar. The results for lung cancer GWAS (c, d) were similar to the breast cancer GWAS results.
SNPs with a higher density of eQTLs in adjacent region are more likely to be located in regulatory regions High eQTL density may be indicative of high density of regulatory elements in the region. Based on the density of eQTLs in the adjacent region we subdivided GWAS SNPs into three categories: low density (no eQTLs detected in ±5 kb region), intermediate density (1-7 eQTLs); and high densityeight or more eQTLs in the adjacent region of the anchor SNP. The cut points for these categories were chosen to ensure similar sizes of the groups. Encyclopedia of DNA Elements (ENCODE) data [27] were used to identify transcription factor (TF) and miRNA binding sites [28]. Figure 3 shows the proportions of SNPs   Fig. 3) shows that eQTL SNPs are more likely to co-localize with TF binding sites (Fig. 3a) as well as miRNA binding sites (Fig. 3b).

Genome chopping
The detected positive association between eQTL density in adjacent regions of the SNPs and its level of statistical significance in GWAS can potentially be biased because of a non-uniform distribution of SNPs along chromosomes. If the SNP density is higher in a region of GWAS peaks, the 5 kb regions of the many SNPs located in the peak will overlap and, as a result, be overrepresented in the analysis. To assess the association between eQTL density and the level of statistical significance in non-overlapping regions we divided the human genome into consecutive (non-overlapping) 5 kb fragments starting from the first nucleotide of each chromosome. The total number of fragments was 558,455. About 70% of fragments (total 391,824) do not contain eQTLs. The highest number of eQTLs was detected in a fragment on chromosome 5, position 570,305,043-70,310,042 -117 eQTLs. Additional file 1: Figure S1a shows the distribution of the fragments by the number of eQTLs in them. Additional file 1: Figure S1b shows the distribution of fragments by the number of genotyped lung cancer GWAS SNPs in them. The mean and median numbers of SNPs per fragment are correspondingly 35.1 and 33. Similar results were obtained for breast cancer GWAS SNPs (Additional file 1: Figure S1c).
We observed a significant positive association between the number of eQTLs in non-overlapping fragments and the mean -log 10 p for the breast cancer GWAS SNPs from the corresponding fragments (Spearman Rank Order correlation R = 0.05, df = 558,455, p = 7.5 × 10 − 28 ). The correlation remained significant after the exclusion

Proportions of eQTLs in groups of SNPs categorized by the level of statistical significance in GWAS
The results of several studies suggest that eQTLs are more likely to be causal, risk-associated SNPs compared to non-eQTL SNPs [29][30][31][32]. If it is true, the proportion of eQTLs among more significant SNPs is expected to be higher. We subdivided GWAS SNPs into 4 categories based on the level of their statistical significance and estimated proportions of eQTLs in them: white noise -SNPs that do not reach the level of nominal significance p > 0.05; light gray SNPs -SNPs in the lower part of the gray zone those that are nominally significant but do not reach genome wide level of statistical significance 0.05 > p > 5 × 10 − 5 ; dark gray SNPs -SNPs in the upper part of the gray zone: 5 × 10 − 5 > p > 5 × 10 − 8 , and genome wide (GW) significant SNPsthose with p < 5 × 10 − 8 . We estimated the proportions of eQTLs in each category. The proportions of eQTLs were higher among more significant SNPs (Fig. 4a and c). We noted that eQTLs not only became more frequent as GWAS significance level went up, but they as well became more significant themselves: -log 10 q (where q is the p value for association between the number of variant alleles and gene expression adjusted for multiple testing) significantly increases from white noise to GW significant SNPs (Fig. 4b and d).
For a more granular analysis we categorized GWAS SNPs based on the level of statistical significance using 0.5 increments for -log 10 p (16 categories in total). Figure 5, left panel, shows results for breast cancer, and rightfor lung cancer GWAS. For nonsignificant SNPs (blue-shaded areas -those with -log 10 p between 0 and 1.3), the proportion of eQTLs was low and flat across all categories, with the average proportion of eQTLs 1.40 ± 0.01% in lung cancer and~2.11 ± 0.01 in breast cancer. For breast cancer GWAS gray zone SNPs (those with -log 10 p between 1.3 and 7.3), the average percentage of eQTLs was 2.14 ± 0.01%. For gray zone SNPs we observed a significant positive association between the proportion of eQTLs and -log 10 p (Spearman rank order correlation R = 0.95, N = 12, Fig. 2 The relationship between the number of eQTLs in ±5 kb adjacent region of the anchor SNP and the level of statistical significance. a Breast cancer gray zone SNPs that are not eQTLs. b Gray zone SNPs from breast cancer GWAS that are eQTLs. c Gray zone SNPs from lung cancer GWAS that are not eQTLs. d. Gray zone SNPs from lung cancer GWAS that are eQTLs. Vertical bars show standard error (SE) of the mean p = 2.1 × 10 − 6 ). Since the number of SNPs with genome wide level of statistical significance is relatively small, we combined them together. For lung cancer GWAS, the association between the level of GWAS significance and the proportion of eQTLs was similar to that in the breast cancer GWAS.

Analysis of backbone SNPs
The results generated by analyses of backbone SNPs were similar to those generated by the analyses of all OncoArray SNPs. Regardless of their tissue specificity backbone eQTL SNPs tended to be more significant than non-eQTL SNPs in both breast and lung GWASs (Additional file 1: Tables S2 and S3). Densities of eQTLs in the 5 kb adjacent region were positively associated with -log 10 p in both breast and lung backbone SNPs (Additional file 1: Figure S2).

Scleroderma GWAS
We analyzed summary statistics from scleroderma GWAS to check if findings from cancer GWASs hold for noncancerous disease. The results of the analysis of association between a SNP's eQTL status and the level of statistical significance in scleroderma GWAS were similar to the results for breast and lung cancer GWASs. SNPs reported as eQTLs tended to be more significant in scleroderma GWAS than non-eQTL SNPs regardless of the tissue specificity (Additional file 1: Table S4).
Similar to the analyses of breast and lung GWASs we found that eQTL density in the adjacent ±5 kb region was positively associated with the level of statistical significance (Additional file 1: Figure S3).

Discussion
We found that GWAS eQTL SNPs tended to be more significant compared to non-eQTL SNPs. Tissue-specific eQTLs (breast and lung eQTLs in this analysis) did not show a higher level of inflation in significance level compared to other tissues. The likely reason for the lack of tissue specificity may be that eQTLs often show multiple-tissue effects. Almost 20% of eQTLs have more than one target tissue. An overlap across different tissue types is stronger when less stringent criteria to define eQTLs are used [33]. When a SNP acts as a eQTL in multiple tissue types, the direction of the effect is the same in more than 97% cases [33]. Based on this observation one can suggest that eQTLs with a significant effect on gene expression in one tissue type often have a similar effect in other tissue types. eQTLs with pan-tissue effects are not currently very common because they may not have been all identified due to the small sample size of GTEx [20].
We found that the level of statistical significance of a SNP in GWAS is positively associated with the eQTL density in its adjacent region regardless of its eQTL status. We think that the reason for these associations can be that some SNPs that are not reported as eQTLs are, in fact, eQTLs (false negatives). This suggestion is We also noted that the proportion of eQTLs increases with the increasing level of statistical significance in GWAS and reaches plateau at the level of1 0 − 5 -10 − 6 . The simplest explanation to this can be that eQTLs have a higher probability to be causal, risk-associated SNPs and as a result categories with a high level of statistical significance have a higher proportion of eQTL SNPs. Proportions of eQTLs in the group may reflect the proportion of true positives. This analysis found that the proportion of eQTLs plateaued at the level of statistical significance about 10 − 5 , suggesting that the proportions of causal SNPs may be similar among dark gray SNPs and SNPs at the genome-wide level of statistical significance.
It is likely that the associations found between eQTL status/density and the level of statistical significance in cancer GWASs also hold for other phenotypes. This is supported by analysis of summary statistics for scleroderma GWAS. F.
The effect size of the association between eQTLs and the level of statistical significance was relatively small. This suggests that although the eQTL status of the SNP as well as eQTL density in the surrounding region can be useful in SNP prioritizing it would be better to use them in combination with other SNP characteristics associated with functionality, e.g. the level of evolutionary conservation of the site [34]. The limitation of our analysis is that -log 10 p is study specific (GWASs with a larger sample size are likely to have a larger for -log 10 ps) which makes it difficult to generalize exact shapes of SNP/eQTL relationships. The major findings of this study are: 1. eQTL SNPs are more significant in GWASs regardless of their tissue specificity; 2. Pan-tissue eQTLs are associated with a higher inflation of -log 10 p compared to tissue specific eQTLs; 3. SNPs located in regions of high eQTL density are more significant in GWAS regardless of their own eQTL status; 4. The probability of a SNP to be an eQTL is positively associated with the level of statistical significance in a GWAS. The association curve plateaued after -log 10 p~5 suggesting that SNPs from the dark gray zone (10 − 5 > p > 5 × 10 − 8 ) and SNPs at the genome wide level of statistical significance have a similar proportion of causal SNPs.

Conclusions
Our results suggest that a substantial subset of SNPs in the dark grey zone are eQTLs and therefore likely to be causally associated with disease development. Causal risk-associated SNPs from dark gray zone may not be detected by GWAS because of their smaller effect size and the limited sample sizes available from most GWAS studies. Nevertheless, SNPs that are associated with an increased risk for disease development should be included as a part of the polygenic risk score modeling process. Results that we have obtained suggest prioritizing SNPs for polygenic risk score modeling that are strongly or moderately associated with disease risk and act as eQTLs, particularly in multiple tissues.
Additional file 1: Table S1. Distribution of eQTL SNPs by the number of tissues where they are reported as eQTLs. eQTL SNPs with "Number of tissues" equal to one are tissue specific; others are pan-tissue. Table S2.
Mean -log 10 p for non-eQTL and eQTL backbone breast cancer OncoArray SNPs. The eQTLs are stratified by tissue types. Table S3. Mean -log 10 p for non-eQTL and eQTL backbone lung cancer OncoArray SNPs. The eQTLs are stratified by tissue types. Table S4. Mean -log 10 p for non-eQTL and eQTL scleroderma GWAS SNPs. The eQTLs are stratified by tissue types. Figure S1. a) The distribution of non-overlapping 5 kb fragments by the number of eQTLs. b) The distribution of 5 kb non-overlapping chromosomal fragments by the number of lung GWAS SNPs. c) Distribution of 5 kb nonoverlapping chromosomal fragments by the number of breast cancer GWAS SNPs. Figure S2. The relationship between the number of eQTLs in the ±5 kb adjacent region and the level of statistical significance of the backbone SNP. a. Breast cancer GWAS SNPs. b. Lung cancer GWAS SNPs. Shaded circle indicate SNPs with > 18 eQTLs in ±5 kb adjacent region. Vertical bars show standard error (SE) of the mean. Figure S3. The relationship between the number of eQTLs in ±5 kb adjacent region and the level of statistical significance of scleroderma SNP. Shaded circle indicate SNPs with > 18 eQTLs in ±5 kb adjacent region. Vertical bars show standard error (SE) of the mean. This work was supported in part by the National Institutes of Health P01 CA206980-01A1, U19 CA148127 and 1R56LM12371-01A1. This work is also supported by RR170048 grant from the Cancer Prevention Research Institute of Texas. The funders have no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.