Score-based Statistic
We denote Y
i
as observed binary trait outcome of individual i(i = 1, 2, …, n) in the GWAS data set and let the genotype data be (X
11, X
12, …, X
1k
, …, X
1K
) for gene A with K SNPs and (X
21, X
22, …, X
2j
, …, X
2J
) for gene B with J SNPs. Particularly, for the k
th loci of gene A and j
th loci of gene B, we can firstly define the variability score for each sample by \( {u}_{kji}=\left({X}_{1ki}-{\overline{X}}_{1k}\right)\left({X}_{2ji}-{\overline{X}}_{2j}\right) \), where \( {\overline{X}}_{1k} \) and \( {\overline{X}}_{2j} \) indicate the mean level of k
th loci of gene A and j
th loci of gene B respectively. Then, the score-based statistic for their co-association effect can be defined as \( {u}_{kj}={\displaystyle {\sum}_{i=1}^n\left({Y}_i-\overline{Y}\right)}\left({X}_{1ki}-{\overline{X}}_{1k}\right)\left({X}_{2ji}-{\overline{X}}_{2j}\right) \), where \( \overline{Y} \) is the sample mean of disease status. Furthermore, the score vector with the length of K*J can be defined as U = (u
11, u
12, …, u
1K
, u
21, …, u
2K
, …, u
kj
, …, u
K1, …, u
KJ
), and covariance matrix for the score vector can be easily obtained as
$$ \varSigma =\left(\begin{array}{l}cov\left({u}_{11},{u}_{11}\right),cov\left({u}_{11},{u}_{12}\right),cov\left({u}_{11},{u}_{13}\right),\dots, cov\left({u}_{11},{u}_{kj}\right),\dots, cov\left({u}_{11},{u}_{KJ}\right)\\ {}cov\left({u}_{12},{u}_{11}\right),cov\left({u}_{12},{u}_{12}\right),cov\left({u}_{12},{u}_{13}\right),\dots, cov\left({u}_{12},{u}_{kj}\right),\dots, cov\left({u}_{12},{u}_{KJ}\right)\\ {}\vdots \\ {}cov\left({u}_{kj},{u}_{11}\right),cov\left({u}_{kj},{u}_{12}\right),cov\left({u}_{kj},{u}_{13}\right),\dots, cov\left({u}_{kj},{u}_{kj}\right),\dots, cov\left({u}_{kj},{u}_{KJ}\right)\\ {}\vdots \\ {}cov\left({u}_{KJ},{u}_{11}\right),cov\left({u}_{KJ},{u}_{12}\right),cov\left({u}_{KJ},{u}_{13}\right),\dots, cov\left({u}_{KJ},{u}_{kj}\right),\dots, cov\left({u}_{KJ},{u}_{KJ}\right)\end{array}\right) $$
Finally, the new score-based statistic for detecting gene-gene co-association can be constructed as SBS = UΣ
− 1
U
T, which follows chi-square distribution with K*J degree freedom (χ
2
K ∗ J
) under the null hypothesis that there is no co-association between these two genes.
Data simulation
Simulation studies were conducted to assess the type I error rate and power of the SBS comparing with other methods for testing gene-gene co-association. We simulated three co-association scenarios as follows: Type I co-association (under nearly independent condition between gene A and gene B, i.e. the traditional interaction β
3), Type II co-association (only caused by correlation between gene A and gene B, i.e. r(β
1 + β
2)),Type III co-association (caused by both correlation and independent term A × B between gene A and gene B, i.e. β
3 + r(β
1 + β
2)). Specifically, the null hypothesis for all three simulation scenarios can be described as inexistence of co-association between two genes. Reference phased haplotype data was downloaded from the HapMap website (http://hapmap.ncbi.nlm.nih.gov/) [21]. Subsequently, a large CEU population of 100,000 individuals was obtained by gs2.0 [22, 23] under the additive genetic model. In all simulations, the causal SNPs were removed to assess the performances of the SBS. For each parameter setting, 1000 simulations were repeated with a significant level of 0.05 and N individuals were sampled from the whole 100,000 population randomly.
For scenario 1 (Type I co-association), we chose 7 SNPs at Chr17:1650000215…1650011216 and 7 SNPs at Chr18:1700258917…1700276475. The case-control statuses were generated from a logistic regression model Logit(P) = β
0 + β
1 × SNP
1 + β
2 × SNP
2 + β
3 × (SNP
1 × SNP
2), where SNP1 and SNP2, correlated with coefficient r were causal SNPs, and the 1st SNP of gene A and 5th SNP of gene B were defined as the causal SNPs. Three different main effects were set to make our simulations more practical, two marginal effects (β
1 = log(1.3), β
2 = log(1.5)), one marginal effect (β
1 = 0, β
2 = log(1.5)) and no marginal effects (β
1 = β
2 = 0). Different β
3 were chosen to evaluate the type I error rate (r = 0, β
3 = 0) under various sample sizes N (N/2 cases and N/2 controls, N = 400, …, 2000) and power (β
3 was specified from log(1.1) to log(1.9) stepped by log(0.2)) under fixed sample size 1200. In addition, we also fixed the interaction odds ratio and main effects to assess the performance of the SBS under different sample sizes.
For scenario 2 (Type II co-association), we chose 7 SNPs at Chr22:2126161008…2126164539 and 7 SNPs at Chr22:2126166075…2126177318. In this situation, the case-control statuses were generated from the logistic regression model Logit(P) = β
0 + β
1 × SNP
1 + β
2 × SNP
2. Different r were specified to evaluate the type I error rate (β
1 = β
2 = β
3 = 0, r = 0.1, 0.2, 0.3, 0.4, 0.5, 0.9) and power under fixed main effects β
1 = 0, β
2 = log(1.5) and β
1 = log(1.3), β
2 = log(1.5) for the two causal SNPs with given sample size 1200. To evaluate the performance under different MAF of causal SNP pairs, different correlation structures between two causal SNPs were chosen from the two regions.
For scenario 3 (Type III co-association), we selected the same gene region as in the scenario 2. The case-control statuses were generated from the model Logit(P) = β
0 + β
1 × SNP
1 + β
2 × SNP
2 + β
3 × (SNP
1 × SNP
2). Two situations were considered: β
3 was specified from log(1.1) to log(1.9) stepped by log(0.2) under fixed r, and r was set from 0.1to 0.5 by 0.1under fixed β
3. All the simulations were conducted under sample size 1200 and different main effect patterns (β
1 = β
2 = 0, β
1 = 0, β
2 = log(1.5) and β
1 = log(1.3), β
2 = log(1.5)).
For the single SNP-based logistic regression model,we considered each pair-wise interaction separately, and selected the most significant one (smallest p-values). Significane levels were assessed using permutations to adjust the multiple testing [10].
Applications
The SBS was also applied to a GWAS of North American Rheumatoid Arthritis (RA) Consortium containing 868 RA cases and 1194 controls [24] and all datasets used were publically available [25, 26]. We chose four genes (VEGFA, PADI4, C5, ITGAV) to detect gene-gene co-association with RA susceptibility, involving four, six, eight and eight SNPs in each gene respectively. Meanwhile, the other six methods mentioned above were also used to detect co-association contributing to RA and their computation time was also calculated by R 3.1.0 on a desktop computer (Intel Core 2 with 3.00 GHz CPU using 4 GB of RAM).