Genome-wide expression studies are making significant contributions to the identification of risk genes for complex traits. Expression studies can help identify genes in linked and associated regions that are appropriate for follow-up with functional studies [1]. Most often knowledge of gene expression in a relevant tissue can help. That is, if a chromosomal region is linked to a phenotype such as one relating to an eye disorder, genes that are expressed in the eye become the prime candidates for further study. In addition, genes that are overexpressed in the eyes of affected individuals compared to controls are also excellent candidates. Also, those genes that fall within the same biological network as a candidate gene are good candidates for further study. There are also studies where gene expression is the prime genetic data (that is, no markers have been genotyped for linkage or association studies). Expression is assessed genome-wide to identify patterns of expression. In all of these studies, we usually do not include biological relatives, so that all observations are independent. The Genetic Analysis Workshop (GAW) 19 data, however, provides expression in multiple members of large pedigrees, giving an excellent opportunity to learn about the familiality of expression and develop methods to adjust for it or capitalize on it in statistical analyses.
Early microarray analyses
Genetic epidemiologists currently have the ability to successfully quantify transcript abundance of messenger RNA (mRNA), genome-wide, using microarray technologies [2]. For a given gene, and among all genes, mRNA abundance is quite variable, with substantial differences among individuals, tissues, and time periods over a life span [3]. The wealth of data generated over multiple tissues and time points by the recently developed technologies permits investigators to design and conduct studies that promise to substantially improve our understanding of factors that influence mRNA levels. This should then, among other goals, lead to the identification of the elements responsible for their regulation. We anticipate that this growing insight and information will ultimately lead to more precise predictions of gene expression levels by revealing the genetic contributors to regulation, and by providing clarification of how genetic factors act through gene expression to contribute to protein levels and human phenotypes. That is, identifying the ways in which transcript variation is regulated and quantifying the interrelationships of mRNA abundance among genes is expected to help us understand how gene expression contributes to variation in complex human traits. The development of this information is likely to involve a long and intense process, and we are currently in its early stages. However, a great deal of experimentation and analytic work has already been done, and it will facilitate the accuracy, speed and breadth of analyses that contribute to this overarching research aim.
The gene expression group of 9 GAW19 papers tackled some timely and important questions that should contribute to this aim by using established tools developed by others, developing newer tools, and developing extensions to these tools. In addition to this work, the papers we summarize here also evaluated the type 1 and type 2 errors of analytic methods used, provided analytic tools for the research community, and conducted analyses to better understand biological aspects of gene expression. Here we only present a summary of the papers that is designed to place the GAW19 gene expression studies within the context of this broad and evolving field. We hope to help the reader interpret GAW19 investigations and their results. Because this paper is a summary, we encourage those who are interested to read the individual GAW19 gene expression papers for relevant details and to assess the motivations for these works. To place the GAW19 gene expression papers within a larger context, we begin by summarizing some of the methods developed prior to GAW19 that have been applied to gene expression levels, historically, and follow this with a more in-depth presentation of some of the analytic methods used in the GAW19 papers from this group. We then provide a discussion of how the findings reported in these papers can impact the field.
Analytic methods for research using expression data that are derived from microarray technologies have been under development since the inception of these technologies in the early 2000s and during the period of their refinement, which continues to be an ongoing process. The arrays have almost always been used on samples of independent individuals, so that most of the methods developed prior to those by our GAW19 gene expression group did not address capitalizing on the possibilities of large numbers of nonindependent samples, such as those from pedigrees. An exception occurs with early studies of expression quantitative trait loci (eQTL) that used pedigree linkage analyses to map regulatory elements, although most of the later investigations used association studies with single nucleotide polymorphism (SNP) arrays involving samples of independent individuals. eQTL studies are addressed in greater detail below.
Several of the initial statistical methods we mention for gene expression analyses were already available and obvious choices, and others were developed or adapted specifically to address gene expression questions. A salient feature is that expression arrays allow us to query expression levels across the entire genome, simultaneously, giving a much broader view than was previously feasible. Unfortunately, along with this ability, multiple testing, which is tied to the number of probes measured in each study sample, becomes a challenge. For GAW19, approximately 22,000 probes were used; more current array data would include expression measures from 450,000 probes.
Identifying gene expression differences using false discovery rates
Initially, because of cost, the sample sizes of the studies conducted were relatively small, sometimes as small as 30 individuals, and the extensive multiple testing made statistical power a prohibitive problem. However, the false discovery rate (which is less stringent than a family-wise error rate addressed by a Bonferroni correction), was subsequently applied to analyses of the genome-wide expression data. The false discovery rate (FDR) is set in advance by the investigator to allow for a particular proportion of false positives within the reported positive results. For example, the FDR may be as permissive of 0.05, which would allow 5 % of the tests reported as positive to be incorrect. This statistical criterion was formally described by Benjamini and Hochberg [4], was applied to gene expression studies by Storey and Tibshirani [5], and was the first alternative to the family-wise error rate to gain broad acceptance.
Cleaning microarray data
In addition to introducing a fundamental difference in the statistical approach to estimation of errors, there was a significant focus on identifying the best methods to clean the array-based expression data. Cleaning involved the identification and removal of outliers resulting from systematic errors in the application of arrays, such as placing cases on separate arrays from controls, and individual errors resulting from the poor preparation of DNA. Storey first described expression heterogeneity in his surrogate variable analysis paper [6], referencing the importance of identifying the sources of batch effects. The cleaning methods are now well developed, although one must always be cognizant of where possible bias could be introduced, and address whether there are factors leading to batch effects in generating expression levels that could affect the results of a study.
Comparing gene expression in different contexts
Early array-based gene expression studies primarily compared expression levels within different contexts, such as the presence or absence of a disease or the presence or absence of a treatment to cells from individuals in the same disease state [7]. This work is done in case–control studies. T-tests, analyses of variance, and nonparametric versions of these tests were used to identify significant differences in the expression of genes between the two states. Significant differences in gene expression could then be used to identify genes and pathways that are involved in the disease state or the response to treatment [8].
Finding clusters of genes with similar expression patterns in different states
Early analytic approaches also included methods to cluster genes based on similarities or correlations in their expression levels. The goal was to reveal similarities or differences in coordinated gene expression under different states. Clustering allowed the subdivision of the whole set of genes based on which ones were expressed at higher levels and which ones at lower levels. These similarities are likely to reflect similarities in gene function. For example in a person who has an infection, cluster analyses of gene expression are likely to illustrate that expression of certain immune response genes are elevated in a similar fashion. One could apply a treatment and observe whether the changes in gene expression revealed anything about the biology of the infectious agent or the response to treatment. That is, clustering applied established analytic techniques to the quantitative gene expression levels so as to provide an assessment of similarity in these gene expression levels and cluster the genes according to this similarity. A set of genes that are all highly expressed when compared to the other genes on the array will be in the same cluster and this fact might derive from the impact of a similar genetic or environmental factor or both. This approach clearly answers a different question than asking whether the genes are differentially expressed within differing contexts. This is also different from clustering subjects based on their gene expression levels, which is what is done in evolutionary studies to find similarity among species.
The 2 standard approaches used to find genes that are related through similarity in expression levels are hierarchical and K-means clustering [9], [10]. Key questions should be addressed prior to conducting cluster analyses. These include whether to analyze all genes measured on the arrays or only a selection, as there are genes that will add noise to the analysis because they do not contribute to the clusters. A prior understanding of the biological process under analysis can help identify the genes to select. In addition, nonindependence of replicate samples can lead to biased results and individual study designs may have to be developed to achieve a sufficient sample size such that the results will not be vulnerable to this factor. This is particularly true in pedigree data, which is a key feature of the GAW19 expression data.
Hierarchical clustering lends itself to an easily interpreted visual display of a dendrogram, where the individuals in the study are used to generate the gene expression data and the analyses are conducted to cluster genes with similar expression patterns; the clusters generated by this method, however, are fairly imprecise, such that small changes in expression levels can result in dendrograms that are different. There are 2 methodological approaches. The first method, agglomerative hierarchical clustering, is a “bottom up” approach, where each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy based on applying an algorithm to the gene expression levels. The second method, divisive hierarchical clustering, uses a “top down” approach, where all observations start within 1 cluster, and the clusters are split recursively as one travels down the developing hierarchy. An alternative approach, K-means clustering [11], requires the investigator to set the number of clusters into which the genes will fall in advance. One begins with an initial partition and the results undergo iterations until a final criterion is reached.
Weighted gene coexpression network analysis: reducing the dimensionality of gene expression data
As time passed, and more expression data were generated, opportunities to develop novel analytic approaches presented themselves and the gene expression microarray technology matured. More sophisticated analytic methods were developed. Weighted gene coexpression network analysis (WGCNA) is an example of one such widely used method that was employed by a number of the individuals in our GAW 19 group. The method is designed to construct gene networks from the pairwise correlations of expression data [12]. WGCNA allows for the incorporation of context differences and trait values with the gene expression summary measures. WGCNA is presented in much greater detail in the “Methods” section below. More recently, other molecular biology approaches to measure gene expression have become available. RNA sequence data allow for the integration of expression and genotype information measured simultaneously. However, the GAW19 data were array based.
Identifying genetic contributors to gene regulation: expression quantitative trait loci
eQTL have been studied extensively [13]. Their identification is essential to the search for genetic contributors to gene regulation [14]. eQTL are based on quantified gene expression that can be viewed like any other phenotypic trait and genetic markers can be used for gene mapping through linkage and/or association. The key difference is that there are many such traits generated by microarrays for many genes throughout the genome. Analytically, each of these traits is analyzed the same way as other quantitative phenotypes such as height and weight. Thus, expression traits can be adjusted for covariates and transformed to achieve a normal distribution for quantitative trait loci (QTL) analysis. The key difference here is that a substantial correction for multiple testing is needed to identify eQTL, as the expression levels are usually all tested for linkage or association with all of the SNPs available in the same study sample, engendering a large multiple testing problem, the magnitude of which is the product of the number of gene probes and the number of SNPs. eQTL, whether discovered by linkage or association, identify loci that harbor genetic elements that regulate the expression of the gene under analysis. Those that are next to the gene tested (usually between 50,000 base pairs and 1 megabase, depending upon the preference of the investigator) are classified as cis loci, whereas those anywhere else in the genome are classified as trans loci [15].
The GAW19 data provided by the workshop organizers included gene expression levels measured on the individuals from 20 pedigrees that were ascertained for individuals with type 2 diabetes. Additional traits included both simulated and real longitudinal measures of systolic (SBP) and diastolic blood pressure (DBP) and whole genome sequence data that has been imputed within the 20 pedigrees. The data are described in detail in the accompanying summary publication [16]. A prior manuscript analyzing a larger sample of the data derived from the San Antonio Family Heart Study reports that 85 % of lymphocyte expression levels were significantly heritable, making them appropriate candidate traits for eQTL analyses in the GAW19 pedigrees. In that manuscript [17], heritability varied substantially among the transcript levels, and the median was 22.5 %. In the published analysis, eQTL were identified by mapping the transcript levels using the SOLAR (Sequential Oligogenic Linkage Analysis Routines) software [18] to conduct linkage analyses.
GAW19 gene expression group analyses
The 9 papers contributed to GAW19 by our gene expression group explored 3 aspects of gene expression. The first group of papers considered the expression values in the pedigree members without incorporating genotype or trait data into the analysis. The questions explored involved identifying aspects of the correlation structure of the expression levels of the thousands of genes measured. Analytic approaches to accomplish this included principal components analysis [19], [20], WGCNA [20], meta-analyses [20], gene enrichment analyses [20], and linear mixed models [19], [20].
The second group of papers explored the genetics of gene expression by incorporating SNPs and rare-variant genotypes into the gene expression analyses to better allow us to identify contributors to gene regulation. Factors addressed included eQTL complexity [21], the feasibility of applying allele-specific binding (ASB) to filter potential regulatory SNPs [22], and epistatic interactions of eQTL [23]. Analytic approaches to conduct these investigations included linear mixed models [21], measured genotypes in pedigrees [21], permutation tests [23], and covariance kernels [22].
The third group of papers incorporated both genotype and phenotype data into the gene expression analyses to understand the effects of gene expression and/or genetic variation on phenotypic traits. Genome-wide gene expression was used (a) to predict blood pressure phenotypes via its associations with the SNP genotypes [24], (b) to predict hypertension [25], (c) in the joint analysis of blood pressure traits with sequence data [26], and (d) to identify causal models that include blood pressure traits and genotypes with the expression levels. Analytic methods employed in this work included linear mixed models [24], [27], nonparametric weighted U statistics [26], structural equation modeling [27], Bayesian unified frameworks [27], and multiple regression [25].