Data set
The data set provided by GAW20 organizers included a simulated pharmacoepigenetic effect of a fictitious drug on TG response, where major effects include interactions that depend on an individual’s genotype and corresponding methylation state. Specifically, there are five simulated causal single nucleotide polymorphisms (SNPs) that express their influence on TG response to treatment when their five corresponding nearby cytosine-phosphate-guanine (CpG) sites are sufficiently unmethylated posttreatment. This analysis used genome-wide genotypes, simulated posttreatment genome-wide methylation, and TGs measured on two consecutive days pretreatment and simulated on two consecutive days posttreatment, for a total of four TG measures.
Consistent with the simulation model, we calculated TG response by subtracting the average log pretreatment from the average log posttreatment TG measures and then adjusted this difference by baseline TG levels using linear regression. The resulting residuals were used as the outcome; SNPs and posttreatment CpG sites were predictors in RF.
A total of 680 participants had all three data types and were included in the analyses. Because of the computational demands of the analyses, we focused on chromosomes 1, 6, 8, 10, and 17 (202,919 SNPs and 153,422 simulated posttreatment CpGs, for a total of 356,341 variables), which contained the five causal SNPs and their corresponding methylation sites. Furthermore, we used the 84th simulation replicate, which was suggested by the GAW20 organizers to be most representative of the 200 simulations provided. Correlation between predictors was calculated using Pearson r2. Regional association plots displaying results from both RF and RF-RFE were created using LocusZoom v1.3 [10].
Random forest
RF is a machine-learning algorithm that ranks the importance of each predictor included in a model by constructing a multitude of decision trees [4]. Each node of a tree considers a different subset of randomly selected predictors, of which the best predictor is selected and split on. The criterion used to determine the best predictor was decreased in node impurity, measured with the estimated response variance, which is the default method used for regression trees in the ranger implementation of RF that was used in this study [11]. Each tree is built using a different random bootstrap sample, which consists of approximately two-thirds of the total observations and is used as a training set to predict the data in the remaining out-of-bag (OOB) sample, or testing set. Predictions for each variable are aggregated across all trees and the mean square error (MSE) of the OOB estimates is calculated. The MSEOOB and percentage of variance explained are used to evaluate the performance of each RF.
Recursive feature elimination
To assess whether RF-RFE improved upon RF alone, we assessed the importance scores attained after running RF once and after running it recursively, using the initial RF as the first of the recursive runs in the RF-RFE approach. The RF-RFE approach consisted of (a) running RF to determine initial importance scores, (b) removing the bottom 3% of variables with the lowest importance scores from the data set (3% was chosen because of the high computational demands of using a lower threshold; this resulted in a total of 324 RF runs), and (c) assigning ranks to removed variables according to the order in which they were removed and their most recent importance scores (ie, importance scores are only compared within runs, not between runs). This was performed iteratively using the reduced data set created in step two until 3% of the number of remaining variables rounds to zero (ie, no further variables could be removed).
Parameter tuning and model runtimes
In RF, the number of predictors sampled for splitting at each node, mtry, and the number of trees in the forest are the two primary tuning parameters [12]. For this analysis, 8000 trees were used. When the majority of features are noise or a very large number of features are being used, an mtry of 0.1*p, where p is the number of predictors in the model, has been suggested to be a more appropriate choice than the default mtry =\( \sqrt{\mathrm{p}} \) [13, 14]. Thus, when p > 80 and likely still contained many noisy variables, we used an mtry of 0.1*p, and after features were recursively removed from the model and p ≤ 80, we used the default mtry. These parameters produced reasonably low MSEOOBs when compared to others. Permutation variable importance mode was used.
The initial RF took approximately 6 h and the RF-RFE took approximately 148 h to run on a Linux server with 16 cores and 320GB of RAM.