A quantitative linkage score for an association study following a linkage analysis

Background: Currently, a commonly used strategy for mapping complex quantitative traits is to use a genome-wide linkage analysis to narrow suspected genes to regions on a scale of centiMorgans (cM), followed by an association analysis to fine map the genetic variation in regions showing linkage. Two important questions arise in the design and the resulting inference at the association stage of this sequential procedure: (1) how should we design an efficient association study given the information provided by the previous linkage study? and (2) can an association in a linkage region explain, in part, the detected linkage signal? Results: We derive a quantitative linkage score (QLS) based on Haseman-Elston regression (Haseman and Elston 1972) and make use of this score to address both questions. In designing an association study, the selection of a subsample from the linkage study sample can be guided by the linkage information summarized in the QLS. When heterogeneity exists, we show that selection based on the QLS can increase the proportion of sample individuals from the subpopulation affected by a disease allele and therefore greatly improves the power of the association study. For the resulting inference, we frame as a hypothesis test the question of whether a linkage signal in a region can be in part explained by a marker allele. A simple one sided paired t-statistic is defined by comparing the two sets of QLSs obtained with/without modeling a marker association: a significant difference indicates that the marker can at least partly account for the detected linkage. We also show that this statistic can be used to detect a spurious association. Conclusion: All our results suggest that a careful examination of QLSs should be helpful for understanding the results of both association and linkage studies.


Background
Identifying genes underlying complex quantitative traits, which are often heterogeneous and multifactorial, is still a great challenge in genetic epidemiology studies. Currently, a commonly used strategy for mapping complex traits is to use a genome-wide linkage analysis to narrow suspected genes to regions on a scale of centiMorgans (cM), followed by an association analysis to fine map the genetic variation in regions showing linkage. At the association stage of this sequential process, we are often interested in two questions: (1) how should we design a powerful and efficient association study given the information provided by the previous linkage study? and (2) can an association in a linkage region explain, in part, the detected linkage signal? Although these questions that arise respectively at the design and inference stages are two quite different aspects of an association study, they are related because both questions essentially rely on the interdependence of linkage and association. Here, we derive a quantitative linkage score (QLS) from Haseman-Elston linkage regression [1] and make use of this score to address both questions in the scenario of analyzing a complex quantitative trait.
The loci predisposing to a complex quantitative trait are usually expected to have small effects. One important reason for this, among others, is heterogeneity of the phenotype, where an allele of interest may have no effect on some individuals because they have different genetic and environmental backgrounds. If these individuals are included in the sample used in the association study, the effect of the examined allele is "diluted" and this leads to great difficulty in detecting association. Careful selection of individuals from the sample to exclude such possible "dilution" should presumably provide greater power. Ideally, we should like to find a variable, such as age, sex or ethnicity, that indicates heterogeneous persons. Unfortunately, such an indicator variable is often unclear or unavailable for a complex trait. Nevertheless, if an association study follows a linkage study, selection of the sample for the association study may be guided by the linkage information already obtained, using the linkage signal as a natural heterogeneity indicator. This idea has long been recognized and implemented in practice [2][3][4]. Fingerlin et al. (2004) systematically examined the selection of cases for a case-control association study based on allelesharing information provided by affected members of a family [5]. We focus here on sample selection for an association study of a quantitative trait and show the usefulness of the QLS when heterogeneity exists.
After an association has been detected between the trait and a marker allele in the region of linkage, the question of whether this association accounts, in part, for the previously found linkage signal is not trivial. If the allele statistically associated with the trait is partly responsible for the linkage, we may be more confident that this allele is itself functional or in linkage disequilibrium with the true functional variant, rather than a false discovery resulting from other causes. On the other hand, if the associated allele cannot explain any linkage signal, we may consider adding more association markers to the region in order to avoid missing a possible genetic variant affecting the trait of interest. In the case of affected sibs (or other affected relatives) used for linkage analysis, one approach is to examine the difference in the allele sharing identical by descent (IBD) between members of families selected on the basis of the associated marker [2,6]. We address this question for a quantitative trait by testing whether there is a significant difference between the QLS with and without including this marker in the model. We show that this test is essentially the same as examining the interaction between the linkage and association signals and therefore is related to the genotype-IBD sharing test (GIST) proposed by Li et al. (2004) for affected sibship data [6]. Fulker (1999) proposed a similar idea, in the context of a variance component model, simultaneously modeling the association and linkage in the mean and variance-covariance structure of a family [7]. They focused on testing a similar, but different, hypothesis to determine whether the allele is the true candidate or is merely in disequilibrium with the trait locus, by comparing a model with all the parameters freely estimated to a model in which the linked genetic variance of the quantitative trait locus (QTL) is set to zero, on the assumption that there is a single variant responsible for the linkage signal [8].
In this paper, we propose a linkage score derived from quantitative trait linkage analysis that has important applications when an association study follows a linkage analysis. Although the linkage score derived here can be easily extended to general families, to implement our approach we focus here on nuclear families. We first derive the linkage score in the method section. Then we perform computer simulations to examine the usefulness of this score to select a sample for an association study when heterogeneity exists, and to clarify whether the association can, at least in part, explain the linkage signal.

Methods
Our goal is to derive a score that captures the linkage information for quantitative traits in a way that will be useful for a follow-up association study. For simplicity of presentation, we assume the quantitative trait value may be affected by the presence of an allele without any other covariates present, which is not a necessary limitation for our derivation. We suppose linkage markers have been genotyped for family members and therefore the proportion of alleles shared IBD at a particular location can be estimated for all pairs of relatives in a pedigree [9,10].

Quantitative linkage score (QLS)
We first derive the QLS. Suppose we have recruited N sibships. The trait value y ik of sib i(1, ..., n k ) in sibship k (1, ..., N) is modeled by where µ k is the sibship specific mean, which absorbs family-level effects such as polygenic and common environmental effects [11]; b is the effect of the quantitative trait locus (QTL), which may include both additive and dominant effects; x ik is the corresponding vector of design variables indicating the genotype of the QTL; and e ik is an individual-level random effect. For simplicity of exposition only, we assume the QTL effect is additive and there-fore x ik can be coded as one variable to indicate the number of copies of the allele of interest. Otherwise, it can be coded as a vector with two elements, for additive and dominant effects, respectively. Because in a linkage analysis the genotype of a QTL (x ik ) is not observed (or the marker cannot be assumed to be in linkage disequilibrium with the QTL), we are not able to estimate directly. However, we can model the QTL effect in the variancecovariance matrix at the family-level. Under the trait model (1), the variance-covariance matrix of sibship k is given by where is the variance of the QTL, is the individual random effect variance and IBD ijk is the proportion of marker alleles shared IBD by sibs i and j in family k.
Because both matrices are symmetric and the diagonal elements do not include linkage information, we only consider the lower triangular elements. We rearrange these elements of the above matrices as vectors of length n k (n k -1)/2 by stacking one column on top of the other and then have We can treat the above equation as a version of Haseman-Elston (HE) regression. The sibship specific mean µ k is usually unknown and needs to be estimated; various estimates have been discussed and a shrinkage estimate has been recommended [11,12]. For the simulations performed in this paper, the was estimated by the function lme in the R package http://cran.us.r-project.org. In a HE regression, linkage is detected by testing whether the QLT variance > 0, which is equivalent to testing the correlation between IBD ijk and the trait similarity between the two sibs, as measured by (y ik -)(y jk -) in our case. From this perspective, the linkage information provided by a sibpair can be captured by the score From equation (3), we can see that for an additive trait model a positive score supports linkage and a negative score is evidence against linkage. When the inheritance model is unclear, we may take the "minmax" method to estimate the proportion of marker alleles shared IBD for a full sibpair, i.e.
where f ijk1 and f ijk2 are probabilities of 1 and 2 alleles shared IBD, respectively [13]. We can simply sum the scores for all the pairs in a sibship to obtain a measure of linkage evidence for this sibship, because the sibship mean absorbs any residual correlation among the sibs. We may define the QLS more generally as  [11,[14][15][16]]. In those cases we may need to consider, in order to sum the QLSs within a sibship, a weight function appropriate for the correlation between scores among sibpairs. Note there is no difficulty in extending the QLS to qualitative traits. For example, for affected sibpairs S ijk can be defined as 1 for all pairs and the linkage score is simply given by U ijk = (IBD ijk -0.5), which is related to the NPL score [17] and the statistic of the mean test [18].

Application of the QLS in selecting a sample for an association study
We consider selecting a set of unrelated individuals from sibships previously used for a QTL linkage analysis. In the case of a complex quantitative trait where heterogeneity exists, the goal of an association study is to detect a variant with maximum power. We emphasize that such a study would not be a classic epidemiologlcal study done to determine the attributable risk, for which subjects should be drawn randomly from a population. Rather, the study we discuss here is done for gene finding and therefore the selection of the sample should be done to provide maximum power rather than to represent the whole population.
Suppose that a population consists of two subpopulations (P1 and P2) with proportions q 1 and q 2 respectively (q 1 + q 2 = 1), where the gene variant has an effect in only one subpopulation (P1). To examine the usefulness of the QLS in selecting a sample for an association study, we theoretically compare the proportions of individuals affected by a disease allele selected from a homogenous subpopulation (P1) in two selected samples: one sample is obtained by randomly selecting sibships (proportion q r ) and the other is obtained by selecting sibships with QLS>0 (proportion q qls ). To simplify the theoretical deri- vation, we assume known IBD sharing and sibships of size 2 (independent sibpairs). To further simplify the presentation, we standardize T k as Z k , so that the correlation matrix of Z k is where ρ k = 0, 0.5 /( respectively, for proportions 0, 0.5, and 1 allele sharing IBD. With the assumption that a random sample of sibpairs is used for the linkage analysis, we have q r = q 1 and where ρ IBD = 1 is the correlation between two sibs of a pair with proportion 1 IBD sharing. (see Appendix 1). It is obvious that , and so q qls is always ≥ q r . From this inequality, we can also see that the difference between q qls and q r depends on (1) the proportion of P1: when q 1 = 0.5, the difference is maximum; and (2)  The difference between q qls and q r is presented in Figure 1, which shows that selection based on the QLS can increase the proportion of individuals from subpopulation 1 at most 10%. Nevertheless, a slight difference in this proportion is not trivial, because it may greatly improve the power of an association study (see results).

Application of the QLS to assess the correlation of association with previous linkage
To answer the question of whether a linkage signal in a region can be in part explained by a marker allele used in an association study, we compare the QLS on incorporating and not incorporating this marker into the trait model (equation 1), which we call the first (or individual) level regression, to distinguish it from the second (or family) level regression (equation 2). We frame this problem as a hypothesis test. When a marker is included in the model at the individual level, the variance-covariance matrix of sibship k is given by where x ik is a genotype code for the marker and b is its effect on the trait, which may arise from a "true" association (the marker is the QTL itself or is in linkage disequilibrium with the QTL), or from a "spurious" association (e.g. due to population stratification). Based on the above equation, we can obtain the corresponding QLS with the marker included in the above regression model, which is given by Difference in the proportion of individuals from subpopula-tion 1 between random sampling and QLS sampling  where and are the estimates of b and µ k , respectively. In the following presentation, we denote the QLS obtained with and without modeling an association marker and , respectively. Given these two sets of QLSs, and , we expect the mean score to be larger than when the associated marker is the QTL, or is linked in disequilibrium with it. To compare the two means, we may apply a one-sided paired t-test. Let and let n be the total number of sibpairs. The statistic is then defined by and under the null hypothesis follows a t distribution with degrees of freedom n -1. The one sided p-value is given by P(t n -1 > T).
It is useful to examine this statistic under various situations. When the marker modeled is not associated with the phenotype, the allelic effect b is expected to be small and therefore the statistic is likely to be close to zero. However, when there is an association between the marker and the quantitative trait in a statistical sense, but it is not related to the detected linkage (for example it is due to the well-known bias from population stratification), we may not expect the allelic effect b to be small. In this scenario, we may look upon the marker as a covariate representing to some extent population stratification, and therefore modeling this marker would reduce the residual variance of the trait similarity measure coming from population stratification, and hence strengthen the linkage signal. So we can expect the statistic T to be more likely to be negative, and our test statistic would maintain the type I error rate in a conservative fashion in the case of population stratification. Our simulation results agree with this line of reasoning (see results). In this sense, a small lower sided p-value, i.e. P(t n -1 <T), indicates a spurious association, which is also seen in the simulations.
For simplicity, assume the allelic effect b and the sibship mean are µ k known and so can be specified correctly; it can then be easily shown that for sibpair (i,j) in family k, (see Appendix 2). This equation indicates that the proposed statistic essentially tests the correlation (or interaction) between the similar-ity of an associated marker effect, which is measured by a cross-product, and the IBD sharing between two sibs in a pair. Compared to a usual quantitative linkage analysis that detects linkage by testing the correlation between the IBD sharing and trait similarity, which may also be described as a cross-product (e.g. as in HE regressions and the variance component model), we can expect the proposed statistic to be much more powerful for detecting linkage because the noise (residual variances) from polygenic and common environmental effects is eliminated as well as the individual random effects. So, even if a usual linkage analysis fails to show signals in a region, the proposed statistic can still be useful to detect linkage when we have a candidate locus in a region.

Sample selection
Because in practice the number of alleles shared IBD is generally not known with certainty, owing to partially informative markers and missing parental genotypes, we also performed computer simulations to examine the usefulness of the QLS in sample selection for an association study by comparing, in various situations, the statistics from random samples of unrelated individuals and from samples based on the rank order of the QLS. The statistic used to make the comparison is the score statistic proposed by Schaid et al. [19], which follows a χ 2 distribution with one degree of freedom for an additive model.
In our simulations, we generate 1000 sibships of size 2 from different subpopulations. A total of 6 markers, evenly space at a 2 cM density in a 10 cM range and each with 4 equally frequent alleles, are used for the linkage analysis. A QTL with 2 equally frequent alleles is located midway between marker 3 and marker 4. We assume Hardy-Weinberg equilibrium at each marker, linkage equilibrium among the markers and a Haldane no-interference map function. Trait values are constructed as the sum of a major-gene effect generated by the QTL, normal random individual effects, polygenic effects and common environmental effects. We calculate the probabilities of the number of alleles shared IBD using the program GENIBD in the S.A.G.E. package [20], removing the QTL genotype for this calculation.
We first compare random selection and the QLS selection with different sample sizes for the association study. We assume the population consists of two subpopulations, in equal proportions, from which 1,000 sibpairs have been used for the linkage analysis. In subpopulation 1, 20% of the total variance is explained by the QTL, 30% by the polygenic and common environmental effects and the rest by a random individual effect. In subpopulation 2, there is no QTL effect but the same other effects are simulated.
We separately sample 50, 100, 300, 500 and 800 unrelated individuals from the 1000 sibpairs by the two selection approaches and compare their score statistics. In QLS selection, we first select sibships with largest QLS and then randomly select one sib from each of these pairs, while in random selection the sibpair is selected randomly. The average χ 2 is shown in Figure 2(A). In real data, the situation may be more complex in that a population may consist of more than two subpopulations and the QTL effect could vary among subpopulations. We therefore also simulated four subpopulations with equal proportions having different QTL effects (0%, 5%, 10%, 20%) and compared the association statistics for different sample sizes. The results are shown in Figure 2(B). In both Scenarios (A and B), QLS selection can greatly increase the average value of the statistic to detect association, and this increase is larger when fewer unrelated individuals are selected.
To examine different ways of summarizing the several QLSs for a sibship, we also simulated sibships of different sizes, ranging from 2 to 4. The traits for the population with two subpopulations were simulated as before. We sampled 100 unrelated individuals from the 1000 sibships at random, or according to the rank order of the mean QLS, the minimum QLS and the maximum QLS of each sibship, respectively. Our results showed that the average χ 2 values obtained based on any of the QLSs are greater than those from random selection and that they have small differences between them ( ) (data not shown).
Although in this paper we focus on the usefulness of the QLS in the situation where a significant linkage region has already been identified, we are also interested in the situation where the linkage signal is not so clear, because in the case of a complex quantitative trait we expect only weak linkage signals when using customary sample sizes.
To show the usefulness of QLS selection in this scenario, we also simulated 500 sibpairs from two subpopulations in which different proportions of the variance (5%, 10%, 15%, and 20%) are explained by the QTL in just one subpopulation. In this simulation, linkage signals are quite small and even cannot be detected. We sampled 100 unrelated individuals for an association study. The results Comparison of average χ 2 values between random sampling and QLS sampling for various sample sizes show that the sampling based on the QLS still improves the power of an association study, even in the case that the power to detect linkage is negligible (see Figure 3).
At the stage of the association study, a family sample is also often used and then a joint linkage/association analysis can be applied in this case. One advantage of the joint linkage/assocation model is that, when it detects association, this method can simultaneously take account of the linkage information. We also performed a simulation study to examine the usefulness of QLS based sample selection in this case. A total of 500 nuclear families of size 4 from two subpopulations, in equal proportion, were generated for the previous linkage study. We further sampled 50, 100 and 150 families for fining mapping. Different QTL effects were simulated in subpopulation 1 (0%, 10%, 20%, 30% and 40%) and subpopulation 2 (0%). We compared the statistics of a commonly used joint linkage/association method (awbw) for a random sample and QLS based sample of families [21]. The results show the power of this joint analysis can also be greatly improved by the QLS selection approach (see Figure 4).

Testing the correlation between association and a previous linkage
To assess the properties of our tests to determine whether an association is responsible in part for the linkage of a complex quantitative trait, we carried out a limited simulation study. We examined the type I error rate of the proposed test under two scenarios: (1) no trait-marker association and (2) trait-marker association due to population stratification. Under no trait-marker association, we simulated 10,000 replicate data sets of 500 sibpairs or 500 sibships (200, 200 and 100 sibships of sizes 2, 3 and 4, respectively). Trait values were constructed as the sum of a major-gene effect generated by the QTL that explains 10% of the variance, and various proportions of random individual, polygenic and common environmental effects. An association marker with two equally frequent alleles was simulated to be in complete linkage equilib-Comparison of average χ 2 values between random sampling and QLS sampling when the power to detect linkage is small

Figure 3
Comparison of average χ 2 values between random sampling and QLS sampling when the power to detect linkage is small. Random QLS rium with the QTL and a fully informative linkage marker (with 100 equal frequent alleles) was also simulated at the same location. For the case of linkage but no trait-marker association, the results show that the type I error rate of the proposed statistic is generally good for a complex quantitative trait for both sibpair data and sibship data (see Table 1). Under a spurious association, we generated 10,000 replicate datasets of 500 sibpairs from two subpopulations. The trait mean and the frequencies of the marker alleles were different in the two subpopulations. The results (see Table 2) show that the power to detect linkage is consistent in various situations suggesting that the linkage test is quite robust to population stratification (the"linkage" column). For the proposed test examining the linkage-association correlation, the type I error rate is controlled conservatively (the "association-linakge" colmun). When the effect of population stratification is small, the empirical type I error rate is close to the correct level (0.05). We also examined the usefulness of the proposed statistic for detecting spurious association due to population stratification by using the lower sided t-test (shown in the "population stratification" column of Table  2). The results suggest that in practice when the association cannot explain any of the linkage, this statistic may nevertheless be useful to determine whether the association is "false".
We also performed simulations to assess the power of the proposed statistic to detect the correlation between the gene effect and IBD sharing. We compared this statistic with a revised HE regression that we have shown is one of the most powerful versions of HE [11]. An associated marker with two equal allele frequencies was simulated as the QTL itself. We generated trait values with various different QTL effects, keeping fixed polygenic, common environmental effects and individual random effects. We Comparison of average χ 2 values between random sampling and QLS sampling for a joint linkage/association method Comparison of average χ 2 values between random sampling and QLS sampling for a joint linkage/association method. Two subpopulations: 0%, 10%, 20%, 30% and 40% of total variance is from an additive QTL in subpopulation 1, and no QTL effect exists in subpopulation 2. considered two sets of linkage markers: fully informative and partially informative (six markers were used for the linkage analysis, evenly spaced at a 2 cM density in a 10 cM range around the QTL). For each situation, we generated 1,000 replicate samples of data on 500 sibpairs. Table  3 shows that by incorporating information on the candidate marker the proposed test is much more powerful than quantitative linkage analysis. In general, for a complex quantitative trait, usual linkage analysis may lack power and therefore miss an important region, because the noise from other genetic and environmental effects masks the linked gene effect. When no linkage is detected in a region where an important candidate gene is located, it is not wise to discard this region from further study. We may use the proposed statistic to assess whether the "negative" linkage result is true.

Discussion
There is great interest in QTL mapping because many important diseases themselves, or intermediate phenotypes, are measured on a continuous scale. Although traitmarker association studies are expected to be soon conducted genome-wide, because of cost considerations currently an association study often focuses on candidate regions determined by a previous linkage study. For such an association study, we should utilize the information available in the previous linkage study to optimize its design and to facilitate its interpretation. We have proposed a quantitative linkage score, based on the widely used HE regression, to provide quantitative linkage information useful for a follow-up association study. This score is not limited to continuous traits, but can also be used for binary (affected/unaffected) traits. We illustrated the usefulness of this score to answer two different questions posed by an association study: (1) how to select samples at the design stage when heterogeneity exists; and (2) how to test at the inference stage whether an observed association can explain in part a previous linkage signal. In this paper, we are not necessarily advocating a two-stage approach to analyze family data on which we have information on both linkage markers and association markers. For such data a joint linkage and association framework could be of more interest than a two-stage analysis approach. Recent work on this kind of joint analysis has included work on both regression-based methods [22] and variance-component methods [23,24]. However, in the presence of heterogeneity any advantage such a joint analysis may have when performed using all the data available may be lost, because those families that are not affected because of segregation at a linked locus will "dilute" the effect and result in loss of power. Therefore, even for analyzing data with information from both link-  age markers and association markers, we may consider first selecting families based on the QLS to exclude such "dilution" as much as possible.
The idea of selecting families with linkage evidence for further genotyping in a follow-up association study is not new and has been successfully implemented in practice.
In the context of quantitative traits, the proposed score can conveniently be used to summarize quantitative linkage information from a sibpair (or sibship). We have shown that in a heterogeneous population, which is expected to commonly occur for a complex trait, selecting a sample of unrelated persons based on the order of the QLS magnitude results in a more homogeneous sample for an association study than does a random sample, and therefore can improve power for a given sample size.
Other approaches to identifying sibpairs with linkage are available, for example using a regression diagnostic [25]. Careful comparison of these methods would merit further study.
Another use of the QLS investigated in this paper is to test whether association can account in part for a detected linkage. To address this question, we simply compare two sets of QLSs, before and after incorporating an association marker into the individual level regression model. Essentially, the proposed test evaluates the interaction of the allele effect of an associated marker and IBD sharing. In this sense it may be likened to other methods, for example the regression model proposed by Cardon [26], though our statistic emphasizes more whether an association is correlated with a previous linkage finding. This test may also be used as a substitute for the usual quantitative trait linkage analysis test when the latter fails to detect linkage. The gain in power to detect linkage by using the proposed test arises from eliminating possible environmental or other genetic noise. However, this gain is not automatic, but depends on the relationship of the associated marker to the true variant. If there is only weak linkage disequilibrium between an associated marker and the true variant, the test will be less powerful. We also showed that this statistic may be applied to detect spurious association, although that was not our primary aim. The ways com-monly used in practice to detect population stratification are to use genomic control [27] or test for Hardy-Weinberg equilibrium [28]. Using IBD sharing information to test and control for population stratification provides a new approach and further study of this approach will be conducted in our future work.

Conclusion
In conclusion, as proved by our simulations, the QLS is useful for the design of, and resulting inference from, an association study following a linkage study. We suggest that careful examination of the QLS should be helpful for understanding the results of both association and linkage studies.  which is an increasing function of q 1 and ρ. We note that ρ depends on the size of the effect and allelic frequencies of the QTL. On the other hand, so that Pr(QLS > 0, P2) = Pr(QlS > 0|P2)Pr(P2) = . Thus