GAW20 data
GAW20 real data were used in this study and were provided by the Genetics of Lipid Lowering Drugs and Diet Network (GOLDN) study, which aimed to identify the genetic determinants of the responses of circulating lipid levels to fenofibrate treatment interventions. In total, 1053 individuals from families with at least 2 siblings were recruited. They all self-reported as being of white ethnicity [10]. TG levels were measured at visits 1, 2, 3, and 4, among which data from visits 1 and 2 were collected before fenofibrate intervention, whereas the other two TG measurements were made after the intervention (visits 3 and 4). At visit 1, participants were measured using a lipid profile after an overnight fast. A repeated lipid file occurred the next day during visit 2. The treatment period lasted 3 weeks, after which participants returned to the clinic for 2 consecutive days for visits 3 and 4 [10]. Meanwhile, DNA methylation levels were measured at visits 2 and 4. DNA was isolated from CD4+ T cells harvested from stored buffy coats and the proportion of sample methylation was quantified at > 450,000 cytosine-phosphate-guanine (CpG) sites [10].
Data quality control
In the quality control process, 39 participant outliers were removed, and only subjects without any missing data for the key variables (TG levels at visits 1 to 4, methylation value at visit 2, and genotypes) were used. A total of 523 participants were included in the analysis. For the genotype data, single-nucleotide polymorphisms (SNPs) with a minor allele frequency < 0.01 were excluded. Missing variants were imputed according to the probability distribution of the genotype in all subjects. For the methylation data, cross-reactive probes and probes containing common variants were filtered. Beta-mixture quantile normalization was used to correct for the Infinium Type I/II bias [11], and participant outliers were identified by hierarchical clustering and Eigenstrat [12].
Drug-response definition
Drug response was used as the dependent variable which could be defined as the percentage change in the TG level.
$$ TG\kern0.5em change\kern0.5em percentage=\left( TG\kern0.5em post- TG\kern0.5em pre\right)/\left( TG\kern0.5em pre\right) $$
Where TG pre is the average of TG levels at visits 1 and 2, and TG post is the average of TG levels at visits 3 and 4. It was reported that fenofibrate, which was the intervention drug for the GAW20 real data, usually reduced the plasma TG level by approximately 30 to 60% in hyperlipoproteinemia patients at a dosage of 200–400 mg daily [13]. In this regard, we defined the drug-response variable as 1 when the TG level was reduced by more than 30% after treatment, which meant the drug worked for patients. Otherwise, the drug-response variable was coded as 0, which meant that the drug did not work as expected. Consequently, as shown in Fig. 1, 301 and 222 participants were coded as 1 and 0, respectively.
Stratified variable selection and prediction modeling
The features related to drug response were selected in a stratified manner [14], first within each data type, and then aggregated in an ANN to predict the drug response [15]. ANNs are designed to perform learning tasks using a collection of computational units and a system of interlinking connections [16]. The central idea of ANN is to extract features by linearly combining the inputs and then use nonlinear functions to model the targets. Therefore, a neural network can be thought of as a nonlinear generalization of linear models, which generalizations can be used for classification and regression [17]. We used the AMORE package in the R 3.3.2 GUI 1.68 Mavericks build (7288) to conduct the ANN analysis [15]. The stratification enables precise variable selection within each data type, and the ANN enables the consideration of interaction effects within and across data types [18]. Five-group cross-validation error rates and their standard deviation were calculated to evaluate prediction performance.
The generalized estimation equation (GEE) model was used to select significant SNPs and adjust for family relatedness [19]. CpG sites were selected by linear mixed model (LMM) with an empirical kinship matrix to adjust for family structure [20]. Both the mixed-effect model and GEE are theoretically suitable for the selection of the SNPs and CpG sites while controlling for family structures. The two methods differ in the way they estimate the coefficients and treat the population correlation structure. The major consideration for us was the ability of software packages to handle a binary phenotype, control family structure, and treat continuous random-effect variables. An arbitrary p value threshold of 10− 4 was applied to filter the biomarkers for GEE and LMM so that a moderate number of predictors can be used in the prediction model. SNPs were pruned to avoid the strong influence of SNP clusters, by snpgdsLDpruning, and the linkage disequilibrium threshold was set at 0.2 [21, 22]. The empirical kinship matrix was calculated using the pruned SNPs to control for family relatedness. Other clinical variables, including sex, age, and smoking status, were also used as predictors.
Predictors were added into the prediction model step-by-step by data types. Afterward, chosen SNPs were inputted into the ANN first, followed by significant CpG sites. Finally, age, sex, and smoking status were included. This stratified method made it easy to identify the respective contribution of each category of information to prediction.
A three-layer ANN was applied with one hidden layer. The hyperbolic tangent sigmoid transfer function was used as the activation function (a) for the hidden layer, which has the following form:
$$ a= transig(n)=-1+2/\left(1+{e}^{-2n}\right) $$
A linear function was used as the activation function for the output layer (purelin):
The learning rate and global momentum were set at 0.01 and 0.4, respectively. The preferred training method was an adaptive gradient descent with momentum. The least mean squares criterion was used to measure the proximity of the neural network prediction to its target when training the ANN.