Genotypes
Data were derived from a 2007 MS case-control study conducted by the International Multiple Sclerosis Genetics Consortium [19] and were comprised of genotypes for a total of 325,807 SNPs (Affymetrix GeneChip Human Mapping 500K array) in 931 MS cases and 2,431 controls (n = 3,362). Stringent quality control (QC) analyses were applied to the dataset as previously described, including the removal of population outliers [19]. SNPs with greater than 10% missing data were removed. The genetic inflation factor was 1.06, indicating negligible population stratification [19].
Less than 1% of the genetic data contained missing values. There are a few different ways missing data can be handled within RF. However, since the data were derived from a dense SNP marker panel and had minimal missingness, any missing values were imputed with Beagle 2.13 [20]. Allelic data were then recoded into genotype format using PLINK 1.05 [21], producing three categories for each SNP (0, 1 and 2 copies of the minor allele). Since the optimal binary split is found at each node, this allows for the algorithm to be agnostic to recessive, dominant or additive effects. An allelic chi-square test (df = 1) was performed to calculate marginal associations for comparison.
RF Implementation
The RF code was originally written in Fortran by Breiman and Cutler. There is also an R package randomForest based on the same Fortran code [22]. Neither implementation could be used for the large GWA dataset in the current study. The original RF code has been licensed to Salford Systems [23], and they recently optimized the Fortran version, v.6.4.0.179, for application to large datasets. In preliminary testing of small datasets, similar results were found between the three implementations of RF (data not shown). RF was implemented in a server environment with 8 2/GHz cpus and 32 GB of memory. Run time was dependent on data size and mtry, ranging from a few seconds per tree to over 10 minutes per tree (~ 1 week for a full forest).
Tuning Parameters Considered
Number of variables to choose per node (mtry)
The primary tuning parameter in RF is the number of variables to search at each node (mtry). This parameter controls the bias-variance trade-off. Searching over fewer variables per node will produce less correlated trees, reducing the overall variance of the prediction. However, this will also decrease the accuracy of each individual tree, increasing the bias. The mtry can also be viewed as controlling the complexity of the model, with a smaller value leading to a more complex, less sparse solution (see below). Breiman originally suggested choosing the int(log 2 p + 1) of the number of predictors per node. In the R implementation, the default value is the square root of the number of predictors.
For a GWA dataset, this would entail examining approximately 550 SNPs per node. As noted by Breiman, when there are many weak predictors, this number may need to be increased. It has also been noted that mtry is more important for VI calculation than for prediction, and that with sparse data, mtry = p leads to greatest stability [18]. A coarse search for the optimal mtry was performed in the current study using mtry values of 1,
, 0.1p, 0.5p and p. The parameterization that produced the lowest final OOB-ER was chosen as the optimal mtry.
Number of Trees to Grow (ntree)
Another important consideration is how many trees to grow. This is also a dataset dependent factor, where stronger predictors lead to quicker convergence. While for prediction purposes few trees are often necessary, and the OOB-ER will generally converge rapidly, for VI, more trees will generally lead to refinement and stability in VI [18].
The main trade-off with growing a larger number of trees is the computation cost required. In the current study, trees were grown until the OOB-ER stabilized. Additional trees were then grown to ensure stability.
Weighting
The final tuning parameter, which was not considered in this analysis, is weighting. In classification, with uneven classes, an unweighted classification scheme will be biased towards the majority class. The typical strategy is to re-weight the classes so that they are balanced, the practice used within the Salford Systems implementation of RF, and the default in the R implementation. Unfortunately, class weighting cannot be altered in the Salford Systems version, so it could not be tested as a tuning parameter. However, internal testing on a more flexible version of RF showed no added benefit to changing the weighting.
Data Configurations
Sparsity Pruning
As noted, it is expected that the vast majority of SNPs in a GWA study do not impact risk for disease, and therefore, are simply noise. The goal of any algorithm should be to separate noise from signal, providing a sparse solution. A sparse solution is indicated when the VI is either 0 or negative. Such a VI indicates that the variable was either never selected into a tree, or when it was selected, permutation did not increase the prediction error. Sparse solutions provide a convenient way to remove unimportant data from the analysis.
Sparsity is a function of both mtry and ntree, with a higher mtry leading to greater sparsity and a higher ntree leading to less sparsity. One proposed strategy is to sequentially remove genes by dropping the bottom 20% or 50%, and perform successive runs until there is a noticeable increase in prediction error [7]. Utilizing the natural sparseness in the dataset, the results of each RF run were examined and sparse SNPs were dropped. The RF analysis was then re-run until prediction error stabilized. While this will give a biased estimate of the prediction error for the model [24], it can still be used to judge model quality. This sub-sampling process was repeated in the current study until the final OOB error-rate stabilized or increased.
Removing Strong Associations
RF searches over multiple variables finding solutions based on joint and conditional effects. Variable(s) with strong effects may mask weaker, yet important effects. It is well established that the HLA region within the major histocompatibility complex (MHC) on chromosome 6p is strongly associated with MS [25]. Therefore, to search for weaker non-MHC effects, RF analysis was performed in the current study after removing chromosome 6p marker data.
Linkage Disequilibrium
An important consideration when applying RF to GWA data is the large degree of LD among SNPs. VI is calculated from the number of trees in which a variable appears. Therefore, two SNPs that are in perfect LD will appear in trees about half as often as each individual one may appear by itself, effectively lowering the VI of each SNP. While this does not present a problem for prediction, it can skew the VI rankings [18]. Two proposed solutions have been to calculate VI independently of the number of trees in which the variable appears [10] or as conditional on other variables in the tree [26].
PLINK [21] provides two methods of LD pruning based on r 2 and R 2. r 2 is a traditional pairwise LD measure, though not based on phased haplotypes. R 2 is the multiple correlation coefficient based on a sliding window. Using PLINK, SNPs with a multiple correlation coefficient (R 2) of 0.99, 0.90, 0.80, 0.50 and 0.33 were removed from the MS case-control dataset for comparison. This resulted in pruning between 22% and 76% of the original data which had the side benefit of increasing computational efficiency.
Reliability of Results Obtained from RF
Since RF is a Monte-Carlo process, random variation may influence VI results, particularly if enough trees are not grown. While, work has indicated that RF results are relatively stable [18] and our own internal testing has confirmed this, it is important to grow large forests and do multiple runs when possible. Reliability of final RF results was examined by re-running RF with the final dataset configuration, parameterization and sub-sampling process, changing just the seed in the random number generator. While more than one re-run would be ideal, the VI measures are unlikely to be unstable given that two runs were performed.
Comparison of RF Results to Original GWA study
The original MS GWA study identified, with replication, 16 SNPs across 13 genes as associated with MS [19]. An important consideration for the current study was whether RF could identify additional genes of interest, as well as duplicate the original findings based on univariate testing. Duplication was considered present when a SNP identified by RF was: (1) among the original 16 SNPs, or (2) a SNP that was tagged by one of the 16 SNPs identified in the original GWA study. PLINK was used to identify tagged SNPs using an r 2 threshold of 0.5.
Analysis Strategy
Figure 2 presents the analysis plan. The primary method for choosing tuning parameters was minimization of the OOB-ER, as this is the best indication of model quality. Determining how many results to report is more subjective since the VI measure does not constitute a formal hypothesis test. To help guide interpretation of RF results for follow-up, we plotted VI scores. A sloping line with an "elbow" (Figure 3) was observed most often around the top 25, so this was chosen as the cutoff for an important result.