 Research article
 Open access
 Published:
An efficient weighted tag SNPset analytical method in genomewide association studies
BMC Genetics volumeÂ 16, ArticleÂ number:Â 25 (2015)
Abstract
Background
Singlenucleotide polymorphism (SNP)set analysis in Genomewide association studies (GWAS) has emerged as a research hotspot for identifying genetic variants associated with disease susceptibility. But most existing methods of SNPset analysis are affected by the quality of SNPset, and poor quality of SNPset can lead to low power in GWAS.
Results
In this research, we propose an efficient weighted tagSNPset analytical method to detect the disease associations. In our method, we first design a fast algorithm to select a subset of SNPs (called tag SNPset) from a given original SNPset based on the linkage disequilibrium (LD) between SNPs, then assign a proper weight to each of the selected tag SNP respectively and test the joint effect of these weighted tag SNPs. The intensive simulation results show that the power of weighted tag SNPsetbased test is much higher than that of weighted original SNPsetbased test and that of unweighted tag SNPsetbased test. We also compare the powers of the weighted tag SNPsetbased test based on four types of tag SNPsets. The simulation results indicate the method of selecting tag SNPset impacts the power greatly and the power of our proposed method is the highest.
Conclusions
From the analysis of simulated replicated data sets, we came to a conclusion that weighted tag SNPsetbased test is a powerful SNPset test in GWAS. We also designed a faster algorithm of selecting tag SNPs which include most of information of original SNPset, and a better weighted function which can describe the status of each tag SNP in GWAS.
Background
With the development of high throughput genotyping technology, more and more biologists use GWAS to analyze the associations between disease susceptibility and genetic variants [13]. Although standard analysis of a caseâ€“control GWAS has identified many SNPs and genes associated with disease susceptibility [46], it suffers from difficulties in detecting epistatic effects and reaching the significant level of Genomewide [7,8]. As an alternative analytical strategy, some researchers put forward association analytical approaches based on SNPset [814], which have obvious advantages over those based on individual SNP in improving test power and reducing the number of multiple comparisons.
Maxsingle is the simplest method using the maximum Ï‡ ^{2} statistic of all SNPs to compute the pvalue of the SNPset [9]. However, this method might not be optimal as it does not utilize the LD structure among all genotyped SNPs, especially when the disease locus has more than one in SNPset. Fan and Knapp [10] used a numerical dosage scheme to score each marker genotype and compared the mean genotype score vectors between the cases and controls by Hotellingâ€™s T ^{2} statistic. Compared with the former, the later makes full use of the LD information, but the degree of freedom of Hotellingâ€™s T ^{2} increases greatly. Mukhopadhyay [11] constructed kernelbased association test (KBAT) statistic, which compared the similarity scores within groups (case and control) and between groups. The simulation results indicated that KBAT has stronger power than multivariate distance matrix regression (MDMR) by Wessel [12] and Zglobal by Schaid [9]. The principal component analysis (PCA) was first applied to analyze the association between disease susceptibility and SNPs by Gauderman [14]. He extracted linearly independent principal components (PCs) from the expression vectors of all SNPs in SNPset and tested the association between qualitative trait and PCs under logistic model. Compared with the above method, PCA gets more favour for the improved power because great reduction of the degree of freedom remedies the limitation of the information loss. Lately, Wu [8] proposed sequence kernel association test (SKAT) based on logistic kernelmachine model, which allows complex relationships between the dependent and independent variables [15]. The simulation results showed that SKAT gains higher power than individualSNP analysis.
All the above methods are involved the selection of SNPsets and the quality of SNPset can further affect the test power greatly. As an alternative solution, we propose selecting some representative SNPs (called tag SNPset) from the original SNPset [1618] and then designing a proper weighted function on the association test to remedy the information loss in the process of forming tag SNPset. The existing algorithms of selecting tag SNPs, such as pattern recognition methods proposed by Zhang [16] or Ke [17], statistical method put forward by Stram [18] and software tagsnpsv2 [19] written by Stram, are with high time complexity. Therefore, we first propose a novel fast algorithm of selecting tag SNPs based on the LD structure among the genotyped SNPs. Then design a weighted function in constructing tag SNPsetbased test (called weighted tag SNPsetbased test). The intensive simulation results indicate that our method has much higher power than those of tests based on original SNPset, tag SNPset and weighted original SNPset.
The remainder of this paper is organized as follows. In the next section, we will introduce the proposed fast algorithm of selecting tag SNPset, weighted function, and statistics KBAT and SKAT used in this paper. Then we will list simulation scenarios and simulation results of the comparison of the weighted tag SNPsetbased test and the weighted original SNPsetbased test. The analysis and discussion of the results are shown at the end of this paper.
Methods
Notations
Assumed that there are p SNP loci to be tested in the original SNPset, and n independent subjects in a caseâ€“control GWAS. Select randomly m subjects i _{1}, i _{2}, â‹¯, i _{ m } from the n subjects, i _{ j } âˆˆ {1, 2, â‹¯, n}, jâ€‰=â€‰1, 2, â‹¯, m, mâ€‰â‰ªâ€‰n. We intend to test the haplotypes at all the p SNP loci of the m subjects. Thus we get 2â€‰m haplotypes, where every allele at each locus only has two possibilities 0 or 1, representing the major allele and the minor allele respectively. Let Z _{ i }â€‰=â€‰(z _{ i1}, z _{ i2}, â€¦, z _{ ip }) denote all the alleles of the i ^{th} haplotype at all the p SNP loci (iâ€‰=â€‰1, 2, â‹¯, 2â€‰m), where z _{ ij } âˆˆ {0, 1}, iâ€‰=â€‰1, 2, â‹¯, 2â€‰m, jâ€‰=â€‰1, 2, â‹¯, p. For the remaining nm subjects \( {i}_1^{\hbox{'}},{i}_2^{\hbox{'}},\cdots, {i}_{nm}^{\hbox{'}},{i}_j^{\hbox{'}}\in \left\{1,2,\cdots, n\right\},j=1,2,\cdots, nm, \) we only need to consider the genotypes of their s tag SNP loci l _{1}, l _{2}, â‹¯, l _{ s }, sâ€‰â‰ªâ€‰p. Obviously, this reduces greatly the cost of genotyping. Let \( {G}_k=\left({g}_{k{l}_1},{g}_{k{l}_2},\dots, {g}_{k{l}_s}\right) \) denote the genotype value vector of the k ^{th} subject at all the s tag SNP loci (kâ€‰=â€‰1, 2, â‹¯, n), where the genotype value g _{ kj }â€‰=â€‰0, 1, 2. corresponds to homozygotes for the major allele, heterozygotes and the homozygotes for minor allele under the additive model, respectively (kâ€‰=â€‰1, 2, â‹¯, n, jâ€‰=â€‰l _{1}, l _{2}, â‹¯, l _{ s }). Let y _{ i } denote the qualitative trait of the i ^{th} subject and y _{ i }â€‰=â€‰1 for case, y _{ i }â€‰=â€‰0 for control, iâ€‰=â€‰1, 2, â‹¯, n.
Fast algorithm of selecting tag SNPs
Up to now, many approaches of grouping the original SNPsets have been proposed, such as gene, LD structure, biological pathway and complex network clusteringbased approaches [8]. In our study, we employ the genebased approach, namely treat all the SNPs in a gene as an original SNPset. We select a subset of SNPs from the original SNPset, in which each SNP is the representative with high expression correlation. Obviously, the subset includes most of information of the original SNPset and we define it as the tag SNPset of the original SNPset, tag SNPset for short without confusion. We divide the original SNPset into some subsets by the rules that the SNPs in the same subset have high expression correlations among individuals and the SNPs in different subsets have low correlations, then choose one SNP of each subset (regarded as a tag SNP) as the representative of this subset. All the tag SNPs forms a tag SNPset. The detailed algorithm is as follows.
Input haplotypes z _{ ij } of all the p loci of the m subjects, iâ€‰=â€‰1, 2, â‹¯, 2â€‰m, jâ€‰=â€‰1, 2, â‹¯, p.
Step 1 compute the coefficient R _{ ij } of LD describing the correlation between SNP i and SNP j [20],
where \( {\overline{z}}_i \) and S _{ i } denote the mean and the variance of z _{Â·i } respectively. t is a threshold in the interval [0, 1]. We set tâ€‰=â€‰0.9 based on a series of experiments. If R _{ ij }â€‰>â€‰t or iâ€‰=â€‰j, let N _{ ij }â€‰=â€‰1, otherwise N _{ ij }â€‰=â€‰0, i, jâ€‰=â€‰1, 2, â‹¯, p, iâ€‰â‰¥â€‰j. Let Sâ€‰=â€‰âˆ…, Bâ€‰=â€‰{1, 2, â€¦, p}.
Step 2 choose an element k from B randomly. Let
Step 3 if there exists N _{ mn }â€‰=â€‰1, m âˆˆ Q, n âˆˆ B, then let Qâ€‰=â€‰Q + {n}, Bâ€‰=â€‰B âˆ’ {n}, and go to Step 3; Otherwise go to Step 4.
Step 4 determine the tag SNP of the subset Q grouped in Step 3. Namely, let
Step 5 if Bâ€‰â‰ â€‰âˆ…, go to Step 2; Otherwise Stop.
Output tag SNPset S
We compare the time complexity of the above algorithm and software tagsnpsv2 [19], listed in Table 1. Table 1 shows that our algorithm of selecting tag SNPs has absolute advantage over software tagsnpsv2 from the view of time complexity.
Weighted function
Among the analytical methods based on SNPset, weighted analysis tends to increase the power [8]. The square of Ï‡ ^{2} statistic of single SNP is used to weight the corresponding SNP in our research. The detailed formula [21] of computing the weight w _{ i } corresponding to the i ^{th} SNP is
where a, b, c, d are the observed data of i ^{th} SNP in case and control.
Kernelbased association test (KBAT)
Mukhopadhyay [11] proposed KBAT statistic based on Ustatistic [22]. Let \( {\overline{U}}_l^k={\displaystyle {\sum}_{i<j}{h}_l^k}\left({g}_i^k,{g}_j^k\right)/{m}_l \) denote Ustatistic of the k ^{th} SNP in the l ^{th} group, where lâ€‰=â€‰1, 2 represent case and control respectively; \( {m}_l={C}_{n_l}^2, \) n _{ l } is the number of subjects in the l ^{th} group; the \( {h}_l^k\left(\cdot, \cdot \right) \) is the kernel, allele match kernel (AM) function [11] is used in our study. Let \( {W}_k={\displaystyle {\sum}_{l=1}^2{\sum}_{i<j}}{\left[{h}_l^k\left({g}_i^k,{g}_j^k\right){\overline{U}}_l^k\right]}^2 \) and \( {B}_k={\displaystyle {\sum}_{l=1}^2{m}_{\mathrm{l}}}\left({\overline{U}}_l^k{\overline{U}}_k\right) \) represent the quadratic sum of the kernel score of k ^{th} SNP within group and between groups, respectively, where \( {\overline{U}}_k=\left({\overline{U}}_1^k+{\overline{U}}_2^k\right)/2. \) Mukhopadhyay employed KBAT statistic to test the association between SNPset and phenotype. The statistic is
Although KBAT statistic is constructed using F distribution, it does not obey F distribution [11]. We compute the pvalue by a permutation procedure under the null model to count the empirical quantiles of KBAT statistic. The details of KBAT method can be found in [11].
In our research, we perform original SNPsetbased test and tag SNPsetbased test using KBAT. For convenience to describe, we denote the original SNPsetbased test as KBAT, and tag SNPsetbased test as KBATtag. In weighted analysis, we compare the powers of the tests based on weighted KBAT with weighted KBATtag.
Sequence kernel association test (SKAT)
To further verify the effectiveness of our method, we also conduct the similar comparisons using sequence kernel association test (SKAT) statistic instead of KBAT. For the i ^{th} subject, we use the following model (1) to describe the correlation between the phenotype and the genotypes:
where Î± _{0} is an intercept term, Î± _{1}, â‹¯, Î± _{ m } are regression coefficients and x _{1}, â‹¯, x _{ m } are the environmental and demographic covariates. The correlation is completely defined by function h(â‹…) and \( h\left({Z}_i\right)={\displaystyle {\sum}_{j=1}^n{\gamma}_j}K\left({Z}_i,{Z}_j\right) \) according to Representer Theorem [23], where Î³ _{1}, â‹¯, Î³ _{ n } are the coefficients. The mean and variance of h(z) are 0 and Ï„K respectively offered by Liu [24]. We can consider the null hypothesis h(z)â€‰=â€‰0 by testing Ï„â€‰=â€‰0, and Wu [8] proposed to test Ï„â€‰=â€‰0 using the score statistic Q introduced by Zhang and Lin [25]. The Qstatistic is
where \( \mathrm{logit}\kern0.5em {\widehat{p}}_{0_i}={\widehat{\alpha}}_0+{\widehat{\alpha}}_1{x}_{i1}+\cdots +{\widehat{\alpha}}_m{x}_{im}, \) Q obeys Ï‡ ^{2} distribution with scale parameter Îº and degree of freedom v. The details of SKAT method can be found in [8]. We also use the notations SKAT, SKATtag similar to KBAT.
Simulations
To evaluate the performance of weighted tag SNPset analytical method, we conduct extensive simulations. All causal SNPs used in our study are assumed to increase the disease risk, because KBAT are not affected by the direction of effect [11].
HTR2A, associated with Schizophrenia and Obsessivecompulsive disorder [26,27], is a 62.66kblong gene with 169 HapMap [28] SNPs and is located at 13q14q21. A total of 34 out of 169 SNPs genotyped by Illumina Human Hap 650v3 array [29] are used to be the causal SNPs in simulations. We consider HTR2A gene for instance and use the HAPGEN2 [30] to generate SNP data at each locus on the basis of the LD structure of the CEU samples of the International HapMap Project.
To verify the effectiveness of our proposed method, we first generate replicated datasets at the 169 SNP loci on the HTR2A gene in nine different scenarios using HAPGEN2, where each data set includes 500 cases and 500 controls. Then choose one from the replicated data sets for each scenario and 200 haplotypes of 50 cases and 50 controls from this set randomly as the considered haplotypes used to form the tag SNPset by the algorithm of selecting tag SNPs mentioned in the methods. In the first scenario, 5000 replicated data sets are generated under the null disease model and 1000 replicated data sets are generated under different disease models which assume the same heterozygote disease risk 1.25 and same homozygote disease risk 1.5 for other scenarios. We assume there is only one causal SNP in scenario 2 and two causal SNPs specified randomly in scenarios 3â€“9. Both of the two causal SNPs are genotyped by Illumina Human Hap 650v3 array in scenario 3â€“5, only one is genotyped in scenarios 6â€“8, and no causal SNPs are genotyped in scenarios 9. The minor allele frequency (MAF), the mean R ^{2} with genotyped SNPs and the distance between the causal SNPs are also different. The detailed parameters for scenarios 2â€“9 are listed in Table 2.
Results
The preliminary validation using KBAT Type I error rate evaluation
We simulate 5000 replicated data sets to estimate type I error rate in scenario 1. The detailed results are listed in Table 3 at the significance level of 0.005, 0.01 and 0.001 respectively. Table 3 indicates that the type I error of our method can be controlled.
Power evaluation
To evaluate the powers of KBAT, KBATtag, weighted KBAT and weighted KBATtag, we simulate 1000 replicated data sets in scenarios 2â€“9. Figure 1 plots the powers of them in scenario 2. As a whole, the powers of the tag SNPsetbased tests on the basis of KBAT are higher than the corresponding original SNPsetbased tests. That is to say, the selected tag SNP plays an important role in increasing the power of statistical test by obtaining information from the SNPs with high LD. But when we regard the 6^{th}, 7^{th}, 8^{th} and 9^{th} SNP respectively as the causal SNP, the powers of tests based on tag SNPset are evidently lower than the one based on original SNPset of KBAT. We think the main reason is the high LD between the SNPs. Namely, the very high LD exists between multiSNPs and the causal SNP. This makes the test power reduce due to losing too much information when forming the tag SNPset. Obviously, each tag SNP in the tag SNPset plays a different role in detecting disease association. Therefore we come to an idea that each SNP in the tag SNPset is assigned a different value weighted by the Ï‡ ^{2} statistic of this SNP. Figure 1 shows that, in the weighted case, the power of test based on tag SNPset is better than that based on original SNPset.
In order to further study the performance of our method under more complex simulation data sets, we conduct scenarios 3â€“9. Each data set has two causal SNPs designated randomly. Table 4 lists the powers of KBAT, KBATtag, weighted KBAT and weighted KBATtag in scenario 3â€“9. In unweighted cases, the powers of KBAT based on tag SNPset are higher than those based on original SNPset except for few scenarios, while these exceptions do not arise in weighted case.
The further validation using SKAT
To further verify the performance of our method, we apply it on SKAT. Table 5 shows that the type I error of our method can be controlled. Figure 2 plots the power comparison of SKAT, SKATtag, Weighted SKAT and Weighted SKATtag in scenario 2 and Table 6 lists their powers in scenario 3â€“9. The results also demonstrate our proposed weighted tag SNPset analytical method is effective in disease association. To estimate the influence of the selection of the tag SNPset on the test power, we compare the powers of the weighted SKATtag based on four types of tag SNPsets: the original SNPset, all tag SNPs selected by our proposed algorithm of selecting, all remaining SNPs and a randomly selected subset. Figure 3 indicates that the power of the weighted SKATtag based on the tag SNPset selected by our proposed algorithm is the largest.
Discussion
In this research, we proposed a novel powerful methodweighted Tag SNPset analytical method, which uses weighted tag SNPsetbased test instead of the original SNPsetbased test. We also designed a new fast algorithm of selecting tag SNPs and treated Ï‡ ^{2} statistic of individual SNP as its weight in the study of disease association. In our method, we only need to genotype the tag SNPs instead of all SNPs in original SNPset, which greatly reduces the cost of genotyping. To illustrate the effective of our method, we applied it to the test of SKAT and KBAT respectively and conducted intensive simulations under nine scenarios. The results indicated that weighted Tag SNPset analytical method is an attractive alternative approach in SNPset analysis. It is worth mentioning that we only applied our method to the test of SKAT and KBAT of qualitative traits, but, theoretically, it is also suitable for all statistical tests of qualitative traits and quantitative traits. We will verify its effective in the future study.
Power improved
Power and Type I error are two important standards in statistical test. In our proposed weighted tag SNPset analytical method, the power is increased greatly under the condition of protecting the type I error. We also note that regardless of the tag SNPset, the curve patterns of the powers are very similar in Figure 3. This indicates the relative size of the power of the test is determined by the LD structure between causal SNP and other SNPs. From Table 4 and Table 6, we also find that the power has no direct relationships with that whether the causal SNP is genotyped or not and the power has positive correlation with the mean R ^{2} between causal SNP and all genotyped SNPs. This further verifies that the LD structure between causal SNPs and other SNPs impacts the relative size of the power.
New fast algorithm of selecting tag SNPs
Obviously, the quality of the tag SNPset impacts the test power directly because our test is performed between the tag SNPset and disease phenotype. In the study, we selected the tag SNPset using the LD structure information among SNPs. Firstly we established the complex network, whose nodes are SNPs and edges are the relationships of LD between SNPs, then divided it into many subsets by a threshold, and finally selected a SNP from each subset as the tag SNP to form a new set regarded as tag SNPset. It took less than 1 minute to select 58 tag SNPs from 169 SNPs on a server (Intel(R) Core(TM) i33240 T CPU @2.90GHz 2.90GHz, 4GB Windows 8). During forming the tag SNPset, threshold t is an important parameter. When tâ€‰=â€‰1_{,} each SNP represents itself and tag SNPset is the same as original SNPset. If tâ€‰=â€‰0, only one SNP is included in tag SNPset and the analysis is similar to MaxSingle method. We tested different values of t in our simulations, and the comparison showed that threshold has a great influence on power and tâ€‰=â€‰0.9 is relatively the best to improve power.
Reduction of the cost of genotyping
Our proposed tagSNPbased analytical method only needs to test genotypes of tag SNP loci instead of all loci of all subjects. For example, the original SNPset used in our simulations consists of 169 SNPs and 58 SNPs (about 1/3 of the original SNPset) of forming the tag SNPset are showed in Table 7 when regard rs3803189 as the causal SNP in scenario 1. That is to say, the tag SNPsetbased method saves nearly 2/3 of the cost of genotyping relative to original SNPsetbased one. This also happens in other situations and that how much can be saved relies on the LD structure of the original SNPset and the set of threshold.
Although there are many advantages in our method, limitations also exist. We only used simulative datasets to evaluate the effectiveness of our method, and did not apply the method to the real disease data. In addition, the set of threshold t is difficult and it determines the size of the tag SNPset, which further greatly impacts the test power and influences the cost of genotyping.
Conclusions
We proposed a weighted tag SNPset analytical method involving the selection of tag SNPset from original SNPset and the description of status of each tag SNPset. Based on gene HTR2A and the LD structure of the CEU samples of the International HapMap Project under various model parameters, our simulation studies confirmed that the weighted tag SNPset analytical method is efficient in SNPset analysis of GWAS. In our simulative experiments, we also demonstrated that tag SNPset impacts the test power greatly. So we designed a fast algorithm of selecting tag SNPset with most of information of original SNPset, and the power of the test based on our selected tag SNPset is the highest in our simulations. The proposed weighted function provides a better description for the status of each tag SNP according to the comparisons between weighted cases and unweighted cases.
Abbreviations
 GWAS:

Genomewide association study
 LD:

Linkage disequilibrium
 SNP:

Single nucleotide polymorphism
 KBAT:

Kernelbased association test
 SKAT:

Sequence kernel association test
 MDMR:

Multivariate distance matrix regression
 AM:

Allele match kernel
 AS:

Allele share kernel
 PCA:

Principal component analysis
 PC:

Principal component
References
Dering C, Hemmelmann C, Pugh E, Ziegler A. Statistical analysis of rare sequence variants: an overview of collapsing methods. Genet Epidemiol. 2011;35(Suppl1):S12â€“7.
Sasieni PD. From genotypes to genes: doubling the sample size. Biometrics. 1997;53:1253â€“61.
Wang R, Peng J, Wang P. SNP set analysis for detecting disease association using exon sequence data. BMC Proc. 2011;5 Suppl 9:S91.
Hunter DJ, Kraft P, Jacobs KB, Cox DG, Yeager M, Hankinson SE, et al. A genomewide association study identifies alleles in FGFR2 associated with risk of sporadic postmenopausal breast cancer. Nat Genet. 2007;39:870â€“4.
Yeager M, Orr N, Hayes RB, Jacobs KB, Kraft P, Wacholder S, et al. Genomewide association study of prostate cancer identifies a second risk locus at 8q24. Nat Genet. 2007;39:645â€“9.
Hageman GS, Anderson DH, Johnson LV, Hancox LS, Taiber AJ, Hardisty LI, et al. A common haplotype in the complement regulatory gene factor H (HF1/CFH) predisposes individuals to agerelated macular degeneration. Proc Natl Acad Sci U S A. 2005;102:7227â€“32.
Moskvina V, Schmidt KM. On multipletesting correction in genomewide association studies. Genetic epidemiology. Genet Epidemiol. 2008;32:567â€“73.
Wu MC, Kraft P, Epstein MP, Taylor DM, Chanock SJ, Hunter DJ, et al. Powerful SNPset analysis for caseâ€“control genomewide association studies. Am J Hum Genet. 2010;86:929â€“42.
Schaid DJ, McDonnell SK, Hebbring SJ, Cunningham JM, Thibodeau SN. Nonparametric tests of association of multiple genes with human disease. Am J Hum Genet. 2005;76:780â€“93.
Fan R, Knapp M. Genome association studies of complex diseases by caseâ€“control designs. Am J Hum Genet. 2003;72:850â€“68.
Mukhopadhyay I, Feingold E, Weeks DE, Thalamuthu. Association tests using kernelbased measures of multilocus genotype similarity between individuals. Genet Epidemiol. 2010;34:213â€“21.
Wessel J, Schork NJ. Generalized genomic distanceâ€“based regression methodology for multilocus association analysis. Am J Hum Genet. 2006;79:792â€“806.
Jin L, Zhu W, Yu Y, Kou C, Meng X, Tao Y, et al. Nonparametric tests of associations with disease based on Ustatistics. Ann Hum Genet. 2014;78:141â€“53.
Gauderman WJ, Murcray C, Gilliland F, Conti D. Testing association between disease and multiple SNPs in a candidate gene. Genet Epidemiol. 2007;31:383â€“95.
Cristianini N, ShaweTaylor J. An introduction to support vector machines and other kernelbased learning methods. Cambridge, UK: Cambridge university press; 2000.
Zhang K, Deng M, Chen T, Waterman MS, Sun F. A dynamic programming algorithm for haplotype block partitioning. Proc Natl Acad Sci. 2002;99:7335â€“9.
Ke X, Cardon LR. Efficient selective screening of haplotype tag SNPs. Bioinformatics. 2003;19:287â€“8.
Stram DO, Haiman CA, Hirschhorn JN, Altshuler D, Kolonel LN, Henderson BE, et al. Choosing haplotypetagging SNPS based on unphased genotype data using a preliminary sample of unrelated subjects with an example from the Multiethnic Cohort Study. Hum Hered. 2003;55:27â€“36.
Haplotype tagging SNP (htSNP) selection in the Multiethnic Cohort Study [http://wwwhsc.usc.edu/~stram/tagsnps.html]
Hill WG, Robertson A. Linkage disequilibrium in finite populations. Theor Appl Genet. 1968;38:226â€“31.
Miller R, Siegmund D. Maximally selected chi square statistics. Biometrics. 1982;38:1011â€“6.
Hoeffding W. A class of statistics with asymptotically normal distribution. Ann Math Stat. 1948;19:293â€“325.
Kimeldorf G, Wahba G. Some results on Tchebycheffian spline functions. J Math Anal Appl. 1971;33:82â€“95.
Liu D, Ghosh D, Lin X. Estimation and testing for the effect of a genetic pathway on a disease outcome using logistic kernel machine regression via logistic mixed models. BMC bioinf. 2008;9:1â€“11.
Zhang D, Lin X. Hypothesis testing in semiparametric additive mixed models. Biostatistics. 2003;4:57â€“74.
Basile VS, Ozdemir V, Masellis M, Meltzer HY, Lieberman JA, Potkin SG, et al. Lack of association between serotonin2A receptor gene (HTR2A) polymorphisms and tardive dyskinesia in schizophrenia. Mol Psychiatry. 2001;6:230â€“4.
Frisch A, Michaelovsky E, Rockah R, Amir I, Hermesh H, Laor N, et al. Association between obsessivecompulsive disorder and polymorphisms of genes encoding components of the serotonergic and dopaminergic pathways. Eur Neuropsychopharmacol. 2000;10:205â€“9.
International HapMap Consortium. A haplotype map of the human genome. Nature. 2005;437:1299â€“320.
UCSC Genome Bioinformatics website Illumina Human Hap 650v3 array [https://cgwb.nci.nih.gov/cgibin/hgTrackUi?g=snpArray]
Su Z, Marchini J, Donnelly P. HAPGEN2: simulation of multiple disease SNPs. Bioinformatics. 2011;27:2304â€“5.
Acknowledgements
The research is supported by grant 61170183 and 11371230 from National Natural Science Foundation of China, BS2011SW025 from Excellent Young and MiddleAged Scientists Fund of Shandong Province of China, 2014TDJH102 from SDUST Research Fund and Shandong Joint Innovative Center for Safe and Effective Mining Technology and Equipment of Coal Resources of China, and YC140359 from SDUST Graduate Innovation Foundation of China.
Author information
Authors and Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interest.
Authorsâ€™ contributions
BY conceived the study and carried out data simulation. SDW and BY developed the methods, interpreted the results and drafted the manuscript. HQJ, XL and XZW participated the analysis of results. All authors read and approved the final manuscript.
Rights and permissions
Open Access Â This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.
The images or other third party material in this article are included in the articleâ€™s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the articleâ€™s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
To view a copy of this licence, visit https://creativecommons.org/licenses/by/4.0/.
The Creative Commons Public Domain Dedication waiver (https://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Yan, B., Wang, S., Jia, H. et al. An efficient weighted tag SNPset analytical method in genomewide association studies. BMC Genet 16, 25 (2015). https://doi.org/10.1186/s1286301501823
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1286301501823