Here we describe existing variance heterogeneity tests and the newly proposed test, which is suitable for the analysis of imputed genetic data (subsection "Variance heterogeneity tests"). Next, we describe the setup of the simulations, which were used to study statistical properties of the new test (subsection "Simulations") and outline the details of implementation of our software (section "The VariABEL package").
Variance heterogeneity tests
For measuring variance heterogeneity we have implemented two tests: Levene's test [6, 7] and the test where linear regression is performed on squared residual values of a trait (Squared residual Value Linear Modeling, SVLM).
Levene's (the Brown-Forsythe) test is defined as:
(1)
where Z
i
= |y
i
- ỹgi| is the deviation of the value of the trait of i-th individual, y
i
, who has genotype g
i
from the median value of the trait in individuals having that genotype, ỹ
gi
; N is the total sample size, n
j
is the number of individuals with genotype j, k is the number of possible genotypes, is mean deviation from the median for individuals having genotype j ( is an indicator variable which takes value of one if g
i
is equal to j and zero otherwise), and is the mean deviation from the median across all individuals.
Under the null hypothesis of variance homogeneity, the value of the test statistic, T2, has an F distribution with df1 = (k - 1) and df2 = (N - k) degrees of freedom. In a case of large N, T2 is approximated well by the distribution. With three possible genotypes, k - 1 = 2.
Genetic imputations routinely used in GWAS nowadays increase the power in the analysis of individual studies and also allow meta-analysis of the studies using different SNP arrays. In case of imputations the posterior probability of a genotype is estimated for each subject for a given SNP. Because standard variance heterogeneity tests assume that an observation should be known to belong to a certain group (i.e. an individual is known to have specific genotype with full confidence), they can not be directly applied to the imputed data.
To allow for variance heterogeneity test for imputed SNPs we propose a simple procedure (SVLM) described below. It is known from elementary statistics that by definition the variance is:
(2)
where Y is a random variable, Var(Y) is the variance of Y, E[Y] and E[Y2] are expected values of the variable Y and Y 2 correspondingly. In our case Y is a trait. The variance of Y conditional on the genotype g is V (Y |g) = E[(Y - E[Y |g])2|g]. This means that for each genotype the variance is equal to the mean of the squared residual of the trait conditional on the genotype.
To explain this idea we provide Figure 1. Panel 1A shows the relation between the trait value and the number of B alleles in the genotype. It is assumed that allele B is interacting with some quantitative factor, hence the variance of the trait is increasing as the number of B alleles, present in an individual's genotype, increases. Figure 1B shows the same data, but the points correspond to the squared residuals after subtracting genotypic mean from the trait's value. The means of these squared residuals in each genotypic group shown in panel B is equal to the variance within genotypic groups shown in panel A. Thus, taking squared residuals conditional on the genotype changes the task of estimation of the conditional variances into the task of estimation of the conditional means, which can be approached with using conventional methods such as regression analysis. Important covariates having large effects on means, can be easily accommodated in the model if necessary by modifying the expression used to compute the conditional mean.
Technically, the SVLM method consists of two steps. First, a regression analysis is applied where the trait is adjusted for a possible SNP effect and other covariates. Second, a regression analysis is applied to the squared values of residuals obtained from the first stage, using the SNP as the predictor.
Simulations
To study Type I error and power of the SVLM test, we performed a simulations study. Similar to our previous work [3], we simulated the trait under following linear model
(3)
where y
i
is the value of the trait for ithindividual, μ is the intercept, β
g
is the direct effect of the SNP, β
F
is the direct effect of the interacting factor, β
gF
is the effect of interaction between the SNP and the factor, g
i
~ B(n
g
,P
B
) is a SNP, which is assumed to be binomially distributed with n
g
= 2 (number of alleles in the genotype) and P
B
∈ [0; 1] (frequency of the interacting B allele). is a factor, which is assumed to be normally distributed with mean μ
F
and variance . ϵ
i
is the random error, which follows a normal distribution with a zero mean and a variance of one. We assumed that the distributions of g
i
, F
i
, and ϵ
i
are independent. For our simulations, without loss of generality we can assume that μ = μ
F
= 0, and .
Without loss of generality for both type I error and power the SNP effect was set to zero β
g
= 0. We studied four different frequencies of the interacting allele: 5%, 40%, 60% and 95%. Results for 40% are presented in the text below. Results for other frequencies are shown in Additional file 1, Figure S1, Additional file 2, Figure S1 and Additional file 3, Figure S1. From the GWAS it is known that the regression analysis may lead to spurious results when the frequency of the minor allele is very low.
Therefore additionally we have studied type I error of the SVLM test for allele frequencies 0.0005, 0.00075, 0.001, 0.002, 0.003, 0.004 and 0.005. As SVLM requires squaring of the trait's residuals, the presence of extreme values can affect the type I error and power of this test. To check this, we studied three types of distribution of the residual error term ϵ
i
: one normal distribution and two types of χ2distributions with degrees of freedom df = 1 and df = 5 respectively. We simulated data for 10000 individuals.
For studying Type I error effect of the factor and the effect of interaction term were both set to zero (β
F
= β
gF
= 0). Twenty thousands simulations were performed.
For studying dependence between power and the interaction effect, 1000 simulations were done under each simulation scenario. As we demonstrated previously [3], the magnitude of the genotypic variance difference and hence the power of the variance heterogeneity test depends not only on the effect of interaction between the genotype and the environmental factor, but also on the magnitude of the main effect of the interacting factor F. This dependence is not monotonic and, given other parameters are fixed, there is a certain optimal main effect of the factor under which the magnitude of variance difference and, therefore, the power to detect interaction is maximal. The value of the optimal effect depends on the interaction effect, variance of the factor and the variance of error term. As in our study variance of the factor and the variance of error term is fixed to one, the optimal effect of the factor depends on the interaction effect only. For simplicity, power was studied using the optimal main effect of the interacting factor. The range of the optimal effects of the factor used in this study can be found in the Additional file 4, Figure S1.
In both type I error and power estimation, the null hypothesis was rejected when threshold p-value ≤ 0.05 was reached.
The VariABEL package
The VariABEL software implementing the SVLM is designed as an R package written in C++ and R languages. For regression analysis used by the SVLM method, the LAPACK functions "dgeqrf" and "ch2inv", which are part of the R distribution, were used. The package was compiled with gcc version 4.1.2 under Linux with version 2.6.18-274.7.1.el5 (Red Hat 4.1.2-51) and tested in R of version 2.13.1. The package is distributed under the GNU GPL license (v. 2.0 or later).
Stable version of the VariABEL package can be downloaded from the Comprehensive R Archive Network, CRAN [8] (http://www.r-project.org/). Installation is possible from R directly by running the command "install.packages("VariABEL")". Documentation is available as a part of the distribution and also on-line at the GenABEL project web-site (http://www.genabel.org). Developmental version of the package is available from the GenABEL project development pages (http://genabel.r-forge.r-project.org) located at R-forge [9].
The first stage of SVLM analysis consists of standard regression analysis which is used to access association between mean values of the trait and SNPs. VariABEL output contains results from both stages of analysis (modeling of means and variances). Thus, the VariABEL can be used for regular GWAS as well.