Several GWI methods have been developed [5–12], and previous studies have well documented the accuracy of these methods with respect to imputing missing genotypes, as well as untyped SNPs. We find BEAGLE [11] and MACH [7] algorithms are able to impute missing genotypes with high accuracy. On our dataset, MACH slightly outperformed BEAGLE in imputing missing genotypes (Additional file 1). These results are consistent with intensive comparison on existing imputation methods [11, 13], including MACH, BEAGLE, IMPUTE, fastPHASE and PLINK. These methods perform quite similarly although MACH and IMPUTE are the best in term of accuracy. We adopted the MACH algorithm for all subsequent analyses, and mainly focus on imputing markers that are totally untyped.
Imputation yield and accuracy
We employed the CATIE (Clinical Antipsychotic Trials of Intervention Effectiveness) and an expanded version of the deLiver study datasets [14, 15], because these cohorts have several appealing characteristics. First, both studies contain Caucasian and African American subjects. Second, both studies utilized advanced microarray products: samples in CATIE were genotyped using the Affymetrix 500 K + 164 K custom array and deLiver samples were genotyped using the Illumina HumanHap 650Y (Ilmn650Y). From Ilmn650Y, Illumina HumanHap 317 K (Ilmn317K) data can be derived (see materials and methods). Finally, approximately 40,000 mRNA expression traits were measured on the deLiver cohort, providing a unique opportunity to assess statistical power empirically for GWAS when incorporating WGI data.
We conducted GWI on 384 deLiver study Caucasian American subjects (see materials and methods), where untyped HapMap markers were imputed based on the HapMap CEU reference [6, 7, 16]. For each untyped SNP in the HapMap marker set, MACH output both predicted genotypes and a quality score (QS). The majority of the imputed SNPs had a high QS. For example, 88% of the untyped SNPs had a QS ≥ 0.8 (Figure 1, left panel, red dashed line). When we conducted imputation only with SNPs on the Ilmn317K array, considerably lower quality scores obtained, suggesting that GWI benefits from the increased coverage achieved with the higher density arrays (more assayed SNPs to begin with). The Ilmn317K array is a tag SNP array developed to maximize genetic coverage. To construct the Ilmn650Y array, additional tag SNPs were appended to the Ilmn317K set. By design, these additional tag SNPs have weaker average correlations (LD) with SNPs in the Ilmn317K set. Consequently, imputations based on the Ilmn317K were not as accurate (Figure 1, left panel, black dotted line).
To assess the accuracy of the imputed genotypes, we randomly selected 1% SNPs from the Ilmn317K panel and set all patients' genotypes on these 1% SNPs as "untyped" (in other words, we masked out these 1% SNPs). Afterwards we imputed the corresponding genotypes for these SNPs using the remaining SNPs in the Ilmn317K or Ilmn650Y sets, respectively (Figure 1, middle panel). We found the 1% random masking had minor impact on imputation performance (Additional file 2). Comparing the imputed and observed genotypes for the 1% masked SNPs, we found the GWI accuracy was very high. At a QS = 0.8, imputations based on the Ilmn650Y genotypes achieved an accuracy of 97.7%. Interestingly, even at the same QS threshold, imputations based on the Ilmn317K genotypes were less accurate than those based on the Ilmn650Y genotypes. Using the masked SNPs, we estimated that 89% of the untyped HapMap SNPs could be imputed with 98% accuracy with the Ilmn650Y array (Figure 1, right panel). Because the accuracy estimation is based on a large number of comparisons (N = 384 subjects × 3,000 SNP ≈ 1E6), the estimation is very stable and exhibits little variance.
In addition to the deLiver cohort, we ran GWI on 400 CATIE subjects. The HapMap CEU and YRI were used as the reference sets for the Caucasian and African American individuals, respectively. For the Caucasian Americans in this cohort, imputations based on the Affx500K + custom array genotypes gave performance better than Ilmn317K, but worse than Ilmn650Y in the deLiver cohort (Figure 2, left panel). Given the relatively weak inter-marker LD in African Americans (compared to Caucasian Americans) as well as the potential population admixture (Additional file 3), GWI on African American samples resulted in a considerable reduction in accuracy (compared to results achieved in the Caucasian American samples). The Eigenstrat method [17] detected an admixture of European and African genetic components in the African American subjects (see materials and methods). Given this type of admixture, it is natural to use a pooled reference (HapMap CEU + YRI) for the GWI. We found the pooled reference greatly boosted GWI performance in African Americans, although still below the Caucasian American samples (Figure 2, middle and right panels). This finding is consistent with the report of Guan et al, who pointed out that the GWI accuracy could be relatively robust as long as the reference panel contained at least some individuals with genetic variation representative of the study cohort [12]. Among the African Americans in the CATIE set, 75% of the untyped SNPs in the HapMap set were imputed with 96% accuracy using the pooled reference. In contrast, all HapMap SNPs can be imputed at this accuracy level in Caucasian Americans.
The statistical power of imputation-based association tests
Imputed genotypes may be useful in a variety of downstream analyses. One notable application is in testing for association with phenotypes of interest. Perhaps the most important question in this context is whether incorporating GWI genotypes enhances the statistical power to detect associations compared to the power achieved using only the assayed SNPs. Previous studies have used theoretical arguments and simulation studies to address this issue [5, 18]. Conditioning on a number of assumptions, GWI was found to increase power moderately over that achieved by the Affx500K set. With the ~40,000 gene expression traits scored in the deLiver cohort, we attempted to quantify the power to detect expression quantitative trait loci (eQTL) using GWI. The large number of phenotypes scored in this cohort provides a path to estimate the power empirically, reducing the dependency on theoretical arguments and simulations where underlying assumptions may not precisely hold.
For the power comparisons, we focused only on cis eQTLs [19](i.e., associations in which the structural gene corresponding to the expression trait and the associated SNP are within 1 million base pairs), given the sample size was too small to capture a significant proportion of the trans eQTLs (i.e., structural gene corresponding to the expression trait and the associated SNP are more than 1 million base pairs away or are located on different chromosomes)[15]. This strategy has been previously applied to benchmark statistical power of SNP arrays in GWAS [3]. In brief, at a fixed false discovery rate (FDR), the number of cis eQTLs detected by a given method can be interpreted as the relative statistical power in the GWAS for the method. The underlying rationale is straightforward. In real data, although we could not determine whether a particular discovery was true or false, at a give FDR (e.g. 10%) we know the proportion (e.g. 90%) of discoveries that are true. Therefore, at a fixed FDR, when two methods result in a different number of discoveries (termed as N1 and N2) there would be (1-FDR)*N1 and (1-FDR)*N2 true findings and N1/N2 is proportional to the relative power of the two methods. Towards that end, single-marker Kruskal-Wallis association tests were conducted to detect the cis eQTL for each of the ~40,000 gene expression traits profiled in the deLiver cohort. To empirically estimate the FDR we repeated these tests on permuted gene expression data sets. In each permutation run, we first randomized the patient IDs in the expression file, breaking any association between expression traits and genotypes while leaving the respective correlation structures among gene expression traits and SNP genotypes intact. Then we repeated the association tests for every expression trait and genotype pair in the permuted sets, leading to a set of null statistics for each permutation. A standard FDR estimator was then applied to the resulting association statistics, as previously carried out on observed and permutation null statistics [20].
We compared four different strategies in GWAS of gene expression phenotypes: (1) directly testing for associations using the Ilmn317K SNPs, (2) testing for associations using the entire imputed HapMap SNP set based on the Ilmn317K genotype data; (3) directly testing for associations using the Ilmn650Y SNPs; and (4) testing for associations using the entire imputed HapMap SNP set based on Ilmn650Y genotype data. After the imputation step we filtered out SNPs with a low QS (i.e., QS < QScutoff). We found that the statistical power of the GWAS on the gene expression traits was insensitive to the QScutoff, where QScuttoff ∈ [0.1, 0.9] gave similar results (Additional file 4). Figure 3 highlights the number of cis eQTLs (interpreted as relative power) at a QScuttoff = 0.3 according to the suggestion of the Mach's authors. Interestingly, GWI only modestly improved power (by 5.5% and 3.3% for Ilmn317K and Ilmn650Y, respectively) comparing to the analyses with assayed SNPs only. It should be noted that the Ilmn317K + GWI offer less power than the Ilmn650Y itself, suggesting higher density arrays cannot be totally replaced by imputation.
Resolution of imputation-based association peaks
Because imputation improves the power to detect eQTL, we explored whether the significance levels for cis eQTL p values were significantly improved and whether such improvements enhanced the resolution of the association peaks. In the deLiver cohort genotyped on the Ilmn650Y array, there is only a modest improvement in power (3.3%) to detect cis eQTL. That is, almost all of the cis eQTL identified in the GWI data set were detected when only assayed SNPs were looked at. However, we found that 43.2% of the cis eQTL got smaller p-values when incorporating imputation compared to the results using assayed SNPs only. The significance levels for roughly 10% and 4% of the cis eQTLs improved by more than one and two orders of magnitude, respectively. If in these cases the GWI data were providing SNPs that were in stronger LD with the causal variants, we would expect that for the cis eQTL the most significantly associated SNPs would be found closer to the structural gene. This in fact was exactly what we found.
We classified cis eQTL detected by Ilmn650Y assayed SNPs into two categories: (1) transcriptional start site (TSS) eQTL in which the QTL peak was closer to the TSS of the gene than to the transcriptional end site (TES); and (2) TES eQTL in which the eQTL peak was closer to the TES of the gene than to the TSS. For TSS eQTL the overall median distance of the eQTL peak to the TSS moved from -13.8 Kb to -9.9 Kb by incorporating imputation, where the negative distance implies the eQTLs clustered just upstream of the TSS. More striking were the TSS eQTLs whose significance level gained at least one order of magnitude. In this case the median distance to the TSS shifted from -19.4 Kb to +1.6 Kb. For TSS QTLs whose significance level gained at least two orders of magnitude, the median distance to the TSS shifted from -19.8 Kb to +1.0 Kb. For TES eQTLs, the overall median distance of the eQTL peak to the TES moved from +15.0 Kb to +7.3 Kb by incorporating imputation, where the positive sign implies the eQTLs clustered just downstream of the TES. For TES QTLs whose significance level gained at least one order of magnitude, the median distance to the TES moved from +17.8 Kb to -0.3 Kb. For TES QTLs whose significance level gained at least two orders of magnitude, the median distance to the TES moved from +25.9 Kb to -0.8 Kb. Figure 4 makes it clear that the location of the cis eQTL peaks form a bimodal distribution around the TSS and TES sites, with the middle regions of the genes having fewer cis eQTL than the regions surrounding the TSS and TES sites. From another viewpoint, we saw the strong association hits resided close to the structural gene (Additional file 5).