Identifying rare variants for quantitative traits in extreme samples of population via Kullback-Leibler distance

Background The rapid development of sequencing technology and simultaneously the availability of large quantities of sequence data has facilitated the identification of rare variant associated with quantitative traits. However, existing statistical methods depend on certain assumptions and thus lacking uniform power. The present study focuses on mapping rare variant associated with quantitative traits. Results In the present study, we proposed a two-stage strategy to identify rare variant of quantitative traits using phenotype extreme selection design and Kullback-Leibler distance, where the first stage was association analysis and the second stage was fine mapping. We presented a statistic and a linkage disequilibrium measure for the first stage and the second stage, respectively. Theory analysis and simulation study showed that (1) the power of the proposed statistic for association analysis increased with the stringency of the sample selection and was affected slightly by non-causal variants and opposite effect variants, (2) the statistic here achieved higher power than three commonly used methods, and (3) the linkage disequilibrium measure for fine mapping was independent of the frequencies of non-causal variants and simply dependent on the frequencies of causal variants. Conclusions We conclude that the two-stage strategy here can be used effectively to mapping rare variant associated with quantitative traits.


Background
Thanks to the rapid development of sequencing technology and the lowering of sequencing costs in the last decade, the availability of large quantities of sequence data provides an unprecedented opportunity for researchers to investigate the role of rare variants in complex traits [1][2][3][4]. But due to the low minor allele frequency (MAF < 5%) and thus resulting in weak linkage disequilibrium (LD) with nearby markers, detecting rare variant (RV) association with complex traits faces great challenges [5][6][7][8]. One challenge is that detection of rare causal variants with traditional designs usually requires a large sample, which will be the high cost [3,6]. Thus costeffective design should be considered to reduce sample size. Another challenge is that the statistical power with test statistics of single-marker tests is generally low in genetic association studies of rare variants with more moderate or weak genetic effects [8][9][10]. To date many statistical methods have been developed for rare variant association analysis, including burden tests [11][12][13], variance-component tests [14,15], series of sequence kernel association tests [10,16,17]. Any of these methods has relative perfect performance in special scenario, but none of them can overwhelm others in all scenarios [8,9], especially for quantitative traits.
In fact, rare variant association analysis in the past several years mainly focused on the qualitative trait. Only a few statistical methods have been developed for the quantitative trait [13,[18][19][20][21]. One approach for rare variant association analysis of quantitative traits is the linear regression model. However, most regression-based methods rely on the normality assumption of the phenotype [8,21]. Another commonly used approach adopts phenotype extreme selection design where one can transform the quantitative trait association study into case-control association study of qualitative traits by treating the upper extreme as cases and the lower extreme as controls in a strategy using extreme phenotype [22][23][24][25]. Extreme phenotypes of a quantitative trait are generally considered to be more informative. Moreover, a smaller sample size for extreme-phenotype sampling than that for random sampling is needed to achieve similar power [23,24].
In this report, we use phenotype extreme selection design and Kullback-Leibler distance (KL-distance) [26] to propose a simple statistic method to identify rare variants for quantitative traits. Two stages strategies are adopted in our analysis where association analysis and fine mapping will be done in the first stage and the second stage, respectively. This method will compare the frequency distributions of rare variant in two extreme phenotypes based on KL-distance. Our method has three features: (1) it has increasing power with the stringency of the sample selection, (2) it is affected slightly by noncausal variants and the opposite effect variants.in the first stage for association analysis, and (3) it is not depend on the frequencies of non-causal variants and just dependent on the frequencies of causal variants in the second stage for fine mapping. Through simulation studies, we investigate the performance of the proposed method and compare it with three commonly used methods of the burden test [12], the sequence kernel association test (SKAT) [17], and the optimal test that combines SKAT and the burden test (SKAT-O) [10].

Results
Type I error rate and power for association analysis Table 1 exhibits the estimated type I error rates of the statistic T KL for the extreme sample with sample-selection threshold value of 20, 10, and 5% and with sample size of 1000 and 1500. It can be seen that, under various genetic parameters, type I error rates of T KL are not appreciably different from the nominal alpha levels, which indicates the validity of the statistic T KL . Figure 1 shows the results of power for 9 scenarios when sample sizes are 1000 and 1500. It is found that the power of the statistic T KL with the sample size of 1500 is nearly 0.20 larger than that with the sample size of 1000, indicating that the power of the statistic T KL significantly increase with the increasing of the sample size. It can be seen that, under the same sample size, the powers of the statistic T KL with the low 5% samples and the up 5% samples are highest and the powers with the low 20% samples and the up 20% samples are lowest, which indicates that the powers of the statistic T KL increase with the stringency of the sample selection. It is observed from scenarios {1, 2, 3} that, when rare variant effects are in the same direction, the powers of the statistic T KL increase with the increasing of the number of causal variants. The same above conclusions are observed when 80% causal variants have positive effect and 20% causal variants have negative effect (scenarios {4, 5, 6}) and when there is the same number of causal variants with positive effects and negative effects (scenarios {7, 8, 9}). By comparing the powers under scenarios {1, 4, 7} with 10 causal variants, the powers under scenarios {2, 5, 8} with 20 causal variants, and the powers under scenarios {3, 6, 9} with 50 causal variants, we found that, when the number of causal variants with negative effect increases, the power of the statistic T KL decreases slightly. From Fig. 1, we can observe that, among four statistics of the T KL , the burden test, the SKAT, and the SKAT-O, the power of T KL is higher than that of other three statistics. The burden test, the SKAT, and the SKAT-O are severely affected by the number of non-causal variants and the opposite effect variants, especially when there are the same number of opposite effect variants. Although non-causal variants and the opposite effect variants affect the power of the T KL , the impact is slight. For example, when the sample size is 1500 and the number of causal variants is 50 for 10% sampleselection threshold value (B2), as the number of variants with negative effect increases from zero to 25, the powers of the burden test, the SKAT, and the SKAT-O decrease from~0.80,~0.79, and~0.84 tõ 0.23,~0.63, and~0.74, respectively, with the decline rate of 71.2, 20.2, and 12.0%. Nevertheless, when the number of variants with negative effect is 25, the T KL still achieves~0.83 power, with the decline rate of just 7% comparing to~0.90 when the number of variants with negative effect is zero. Power for fine mapping In fine mapping study, the QTL can be located by the maximum value of the measure l KL . So we sample 10 times from each of 100 simulation populations where each sample includes 750 individuals with the up-extreme phenotype of Y > U and 750 individuals with the lowerextreme phenotype with Y < L. For each sample, we calculate the value of the measure l KL for each variant. In order to guard against noisy distributions of the measure l KL , we adopt the 5-point moving-average method to determine the maximum value. We count the number (here, we denote it B) of the maximum values that locate at variant 10 or variant 11. Then the probability that the maximum values of l KL locate at variant 10 or variant 11 is B/1000. We refer this value as the power of l KL , which measure the likelihood of fine mapping the QTL. Table 2 shows the results of the power for l KL . It can be seen that the power of l KL for fine mapping under dominant model is highest and the power of l KL for fine mapping under recessive model is lowest. The power of l KL increases with increasing of the heritability h 2 of the causal variant and the stringency of the sample selection. For example, power of l KL under We also investigate the effect of different sample sizes (e.g., 2n =1000, 1500, and 2000). As expected, power of l KL increases with the increasing sample size (data not shown). In order to assess the performance of l KL , we compare it with two LD measures l [27] and p excess [28] with case-control design using extreme samples. Table 2 also lists the powers for l and p excess . We found that the powers of l KL and l are nearly the same and higher than those of p excess .

Discussion
In this report, we present a robust approach to identify rare variant of quantitative traits. The proposed approach adopts phenotype extreme selection design and KL-distance method. We use a two-stage strategy in our analysis where the first stage is association analysis and the second stage is fine mapping of QTL if the first stage is positive result. We propose a statistic T KL for association analysis and a LD measure l KL for fine mapping. Simulation studies present the performance of the proposed method. We found that the power of the T KL increases with the stringency of the sample selection and the increasing of the number of causal variants. The T KL here has higher power for association analysis than three existing statistics. Meanwhile, the impact of non-causal variants and the opposite effect variants on the T KL is slight. The LD measure l KL for fine mapping in the second stage has a good feature of not dependence on the frequencies of non-causal variants and just dependence on the frequencies of causal variants. These results show that our method can be used to detect rare variant associated with quantitative traits. At the same time, we found that the proposed method can be easily extended to case-control study by treating cases and controls as samples with upper extreme phenotype and lower extreme phenotype, respectively. In rare variant association analysis, in order to achieve high statistical power of tests, usually a large sample with high sequencing costs is needed. Thus less costly sequencing design is preferred in rare variant association study. For quantitative traits, extreme phenotypes are generally considered to be more informative because of rare causal variants enriched among them. One can use a smaller sample size for extreme-phenotype sampling to achieve similar power as that for random sampling [23,24]. Moreover, because extreme phenotypes of quantitative traits relative to human health are of primary clinical significance and thus data set can be obtained easily for subjects with extreme phenotypic values, using extreme phenotype samples in association analysis will make our study useful and practical. Here we use KL-distance to construct the statistics T KL to measure the difference between two probability distributions of rare variants in two extreme populations. Based on the principle that the larger T KL value is, the more dissimilar two probability distributions of rare variants, the statistics T KL can be used as a test statistic to quantify the magnitude of association between the variants and the quantitative trait in the first stage of association analysis. We found that the statistic T KL here for association analysis has higher power than three existing statistics of the burden test, the SKAT, and the SKAT-O. Moreover, whereas increasing the number of non-causal variants and the opposite effect variants result in decreasing severely the powers of the burden test, the SKAT, and the SKAT-O, non-causal variants and the opposite effect variants affect slightly on the T KL . The T KL has relatively stable power with small change range under various parameters set.
In the second stage of fine mapping, l KL is essentially a measure of LD between the variant and the QTL. Although LD between rare variant and QTL maybe weak [24], the maximum value across all rare variants can be usually found to identify the causal variant (QTL). The measure l KL here has a good performance of just dependence on the frequency of the causal variant. In practice, not dependence on the frequency of the non-causal variant can eliminate"noise" and even bias introduced by varying frequencies of non-causal variants. In our early works, we proposed the LD measure l for mapping common variant of the QTL [27]. The performance of the measure l for mapping rare variant is unknown. We found from theory analysis that the two LD measures l KL and l are parallel and have the same performance, that is, both of them can quantify LD between the variant and the QTL and do not depend on frequencies of non-causal variants. The difference between them is that the measure l KL here is based on KL-distance and the measure l is based on entropy theory. Another LD measure for fine mapping is p excess [28]. The p excess is originally developed for fine mapping common variant of qualitative trait. We compare the performance of these three LD measures for fine mapping rare variant of quantitative traits using extreme samples. We found from theory analysis and simulation study that l KL is superior to p excess .
It is noted that, in practice, we do not know how many causal variants there are in the region established through association analysis at first stage. Although we considered a region having only a single causal variant, our method works for the general case with a region consisting of multiple causal variants. In fact, when there is a region linked to a quantitative trait has multiple causal variants, we can detect all causal variants using following steps: (1) l KL is used to mapping a causal variant with the maximum value of l KL ; (2) T KL is used to do association analysis for all variants except the causal variant detected in (1). If the association analysis result is positive, then return to (1). All causal variants will be found when the association analysis result is negative. It should be noted that we use the permutation procedure to assess the statistical significance of the statistic T KL for association analysis. Permutation procedure may need more computing time to conduct simulation. But with the development of high-performance computing, computing time may not be a problem in our study. In addition, it can be seen that our method involves only rare variants. A phenotype may affected by common variants or both common variants and rare variants. So our further work will involve extensive field for common variants or both common variants and rare variants.

Conclusions
The statistic T KL is affected slightly by non-causal variants and the opposite effect variants. The power of the T KL for association analysis of rare variants increases with the stringency of the sample selection for quantitative traits. Extreme phenotypes allow T KL to achieve higher power than three commonly used methods. The LD measure l KL for fine mapping is independent of the frequencies of non-causal variants and just dependent on the frequencies of causal variants.

Methods
In this study, all datasets were publically available and no research requiring ethics approval was conducted.
We consider an interesting gene region with k biallelic variants and assume that each variant has a minor allele m with the MAF P m and a normal allele M with the allele frequency P M (P m + P M = 1). The variants are indexed by i (i = 1, ..., k). The index i may or may not correspond to the variant orders. Let X i be minor allele count at ith variant carried by a subject. Assume that there is a quantitative trait Y: β i X i , β 0 is the mean baseline value, and ε is residual due to random environmental effects. Without loss of generality, we assume β 0 =0 and ε~N(0, σ 2 ). To simplify our presentation, we use a measure with a superscript "U" to indicate a measure in the upper extreme population that has phenotypic values of the quantitative trait Y > U (U is an upper-threshold value, chosen from the continuous distribution of the study quantitative trait). We also use a measure with a superscript "L" to indicate a measure in the low extreme population that has phenotypic values of the quantitative trait Y < L (L is an low-threshold value, chosen from the continuous distribution of the study quantitative trait). Assume N U and N L subjects are sequenced with k variants in the upper extreme population and in the low extreme population, respectively, which are indexed by j (j = 1,..., N U / N L ). Denote X U ij and X L ij as the number of copies "m" for jth subject at ith variant in the upper extreme population and in the low extreme population, respectively. Then the frequencies of P m and P M at ith variant in the upper extreme population and in the low extreme population, denoted as p U mi , p U Mi , and p L mi , p L Mi , respectively, are estimated as follows:

A statistic test for association analysis in the first stage
In the first stage, we propose a statistic test for association analysis. We define a k-dimensional random vectorp m ¼ ðp m1 ; ⋯;p mk Þ T as the proportion of the minor allele m among all k variants, wherep mi and X ij is the number of copies "m" for jth subject at ith variant. In the upper extreme population and in the low extreme population, the kdimensional random vector of the proportion of the . We compare the two probability distributionsp U m andp L m using the KL-distance which is defined as in Kullback & Leibler [26], here, we denote it the statistic T KL : It is easy to find the relationship between T KL and the frequencies of P m and P M as follows: T KL is the mean between two KL-distances where one is the KL-distance betweenp U m andp L m and the other is the KL-distance betweenp L m andp U m . KL-distance provides a non-symmetric measure of how big of the difference between two probability distributions are. The KL-distance is always non-negative and equal to 0 only if two distributions are identical. It can be seen that T KL is a non-negative and symmetric measure of the two probability distributionsp U m andp L m . So, T KL can be used as a statistic to quantify the magnitude of association between the variants and the quantitative trait: a much larger T KL value will be observed under the alternative hypothesis of association compared to that under the null hypothesis of no association.
A KL-distance index for fine mapping of QTL in the second stage Assume that a region linked to a quantitative trait has already been established through association analysis at first stage. In order to simplify our presentation, we assume that this region contains only a causal variant with a minor allele a (with frequency p a ) and a normal allele A (with frequency p A = 1 − p a ), here, we call it the quantitative trait locus (QTL). We consider the quantitative trait Y = G Q + ε, where G Q is the genotypic value at the QTL and ε~N(0, σ 2 ). We hope to fine map this region by calculating the linkage disequilibrium (LD) measure between the QTL and a variant. We still use KL-distance to construct this measure through comparing the probability distributions of allele m and M at a variant in the upper extreme population and in the low extreme population. Following the previous symbols, let P m and P M be the frequencies of allele m and M at a variant. From Eq. (1), we have From Appendix, Hðfp U m ; p U M g; fp L m ; p L M gÞ can be asymptotically expressed as a function of LD (δ am ) between the QTL and the variant: Assume that there is an initial complete association between the variant allele m and the QTL allele a, at the 0th generation when the allele a is initially introduced into the study population. Let δ ð0Þ ma be the initial complete LD between the allele a and m at the 0th generation, δ ð0Þ ma ¼ p M Á p a . After n generations, the LD between the allele m and a is δ ðnÞ ma ¼ ð1 − θÞ n δ ð0Þ ma ¼ ð1 − θÞ n Á p M Á p a [29], where, θ is the recombination between the QTL and the variant.
Now we define a LD measure, here, we denote it as l KL , as follows: It can be seen that l KL is a decreasing function of the recombination θ and reaches its maximum at θ = 0. So we can use l KL to find the variant closest to the QTL and thus fine map the QTL. Notice from Eq. (6) that l KL is independent of the frequency of the variant, just only dependent on the frequency of the QTL.

Simulation for association analysis
To evaluate the performance of the test statistic T KL , we perform a series of simulation studies. We consider k = 100 variants with MAF values of causal variants determined by a uniform distribution U (0.001, 0.01) and MAF values of non-causal variants determined by a uniform distribution U (0.001, 0.05). The genotype data are simulated similar to those in Wang and Elston [30]. We first generate haplotypes for k variants based on a latent variable Z = (Z 1 , ⋯, Z k ) from a multivariate normal distribution with covariance structure cov(Z i , Z j ) = 0.4 |i − j| between any two latent components. Then we combine two haplotypes to obtain the genotype value for each individual X i = (X i1 , ⋯, X ik ). A phenotype Y under the null hypothesis of no association is generated using the model Y = ε with ε~N(0, 1) (β 1 = ⋯ = β k = 0). Under the alternative hypothesis of association, we randomly chose s variants as causal variants while other k-s variants as noncausal variants having β j = 0. Here, we let s = 10, 20, 50 in which 10, 20%, or 50% of rare variants were causal. For causal variants under the alternative hypothesis, we set β j = c ⋅ log 10 (p mi ) as used in Lee et al. [10], where c is 0.6, 0.3, and 0.2 for different values of s and different direction of the effects of causal variants. We consider 9 scenarios in the simulation study with the parameter values detailed in Table 3. We conduct 1000 simulations for each scenario. In each simulation, we select three extreme sample strategies, the low 20% and the up 20%, the low 10% and the up 10%, and, the low 5% and the up 5%, each of which consists of 2 N individuals including N individuals in an upper sample and N individuals in a lower sample. The statistical significance is assessed by a permutation procedure. We first calculate the value of the data-based statistic T KL for each simulation. Then we permute the "upper sample" and "lower sample" labels with equal probability and recalculate the statistic T KL for 1000 times. The estimated P value is then the proportion of permutation-based statistics that are larger than the databased statistic in 1000 permutations for each simulation. For a given significance level α, the power/type I error rate is estimated as the proportion of rejecting the null hypothesis when p-value ≤ α in 1000 simulations. In order to compare the performance of the test statistic T KL with the existing methods, we also obtain the power for the burden, SKAT, and SKAT-O tests using case-control design with the same samples as for the test statistic T KL .

Simulation for fine mapping
To assess the performance of the LD measure l KL in fine mapping rare causal variants of quantitative traits, we conduct a simulation study using the method similar to those described in our early work [27,28]. We consider a genetic region that has 21 variants, where only a variant locating at the middle of variant 10 and variant 11 is causal variant (that is, the QTL). The MAF of the causal variant is set to be 0.01(p a =0.01) and the MAFs for 20 other variants are uniformly determined with values ranging from 0.001 to 0.05. Other parameters in simulation include the ratio d/v (here, v and d are the genotypic values for individuals with genotypes aa and Aa, respectively), the thresholds L and U, the heritability (h 2 ) of the causal variant, and the sample size (2 N) [31]. We let the ratio d/v be − 1, 0, and 1 which correspond to recessive, additive, and dominant models, respectively. Once the parameter values are chosen, a population with the effective size of 15,000 is simulated starting from the 0th generation, with an initial complete association between the minor allele a at the causal variant and m at other variants [P(m|a) = 1]. The population then evolved for 50 generations under random mating and genetic drift. A hundred populations are simulated for analyses.