PCA
As a traditional multivariable statistical technique, PCA has been widely applied in genetic analysis, both for reduction of redundant information and interpretation of multiple SNPs. The basic idea of PCA is to efficiently represent the data by decomposing a data space into a linear combination of a small collection of bases consisting of orthogonal axes that maximally decorrelate the data. Assuming that M SNPs in a candidate gene or specific genome region of interests have coded values {x
i
∈ RM| i = 1,2,...,N}, where N represents sample size giving a genetic model (assuming additive model here). PCA diagonalizes the covariance matrix of the centered observations x
i
, , defined as
(1)
To do this, one has to solve the following eigenvalue problem:
where ν are the eigenvectors of C, and λ are the corresponding eigenvalues. As , all solutions ν must lie in the span of {x
i
∈ RM| i = 1,2,...,N}, hence (2) is equivalent to
where the dot product of two vectors a = (a1, a2, ..., a
N
) and b = (b1, b2, ..., b
N
) is defined as .
KPCA
Given the observations, we first map the data nonlinearly into a feature space F by
Again, we make the assumption that our data mapped into feature space, Φ(x1),...,Φ(x
N
), is centered, i.e. . To do PCA for the covariance matrix
we have to find eigenvalues λ ≥ 0 and eigenvectors ν ∈ F\{0} satisfying
By the same argument as above, the solutions ν lie in the span of Φ(x1),...,Φ(x
N
). This implies that we may consider the equivalent equation
(4)
and that there exist coefficients a
i
(i = 1,...,N) such that
(5)
Substituting (3) and (5) into (4), we arrive at
where α denotes the column vector with entries α1, ..., α
N
, and K is a symmetric N × N matrix defined by
(7)
It has a set of eigenvectors which spans the whole space, thus
gives all solutions α of equation (6).
Assume λ1 ≤ λ2 ≤ ... ≤ λ
N
represent the eigenvalues for the matrix K with α1, α2, ..., αNbeing the corresponding complete set of eigenvectors. λ
p
is the first nonzero eigenvalue. We do the normalization for the solutions αp, ..., αNby requiring that the corresponding vectors in F be normalized, i.e. νk· νk= 1 for all k = p, p + 1, ..., N. Based on (5), (6) and (8), this translates into
(9)
We need to compute projections on the eigenvectors νkin F to do principal component extraction. Suppose x is the SNP set within previously defined gene or genome region of an individual, with an image Φ(x) in F, then
(10)
are its nonlinear principal components corresponding to Φ.
Note that neither (7) nor (10) requires Φ(x
i
) in explicit form - they are only needed in dot products. We, therefore, are able to use kernel functions for computing these dot products without actually performing the map Φ: for some choices of a kernel k(x
i
, x
j
), by methods of functional analysis, it can be shown that there exists a map Φ into some dot product space F (possibly of infinite dimension) such that k(x
i
, x
j
) can compute the dot product in F. This property is often called "kernel trick" in the literature.
Theoretically, a proper function can be created for each data set based on the Mercer's theorem of functional analysis [29]. The most common kernel functions include linear kernel, polynomial kernel, radial basis function (RBF) kernel, sigmoid kernel [30], IBS kernel and weighted IBS kernel [32]. In particular, KPCA with linear kernel is the same as standard linear PCA. It is worth noting that in general, the above kernel functions show similar performance if appropriate parameters are chosen. In present work, we chose the RBF kernel owing to its flexibility in choosing the associated parameter [33].
There are two widely used approaches for the selection of parameters for a certain kernel function. The first method chooses a series of candidate values for the concerned kernel parameter empirically, performs the learning algorithm using each candidate value, and finally assigns the value based on the best performance to the kernel parameter. As is well-known to us, the second one is the cross-validation. However, both approaches are time-consuming and with high computation burden [34]. For RBF kernel applied in present study, there is a popular way of choosing the bandwidth parameter σ, which is to set it to the median of all pairwise Euclidean distances ||x
i
- x
j
|| in the set {x
k
∈ RM| k = 1, 2, ..., N} for all 1 ≤ i < j ≤ N [35–37].
Models
To test the associations between multiple SNPs and disease, the PCA-LRT and KPCA-LRT models are defined as follows:
(11)
(12)
where PCs and KPCs are the first Lthlinear and nonlinear (kernel) principal component scores of the SNPs, respectively. The value of L can be chosen such that the cumulative contributing proportion of the total variability explained by the first L PCs (λ1 + λ2 + ···+ λ
L
)/(λ1 + λ2 + ··· + λ
M
) exceeds some threshold. For comparison, we set the same threshold of 80% in both PCA-LRT and KPCA-LRT as Gauderman et al [34].
Data simulation
To assess the performance of KPCA-LRT and compare it with PCA-LRT, we apply a statistical simulation based on HapMap data under the null hypothesis (H0) and alternative hypothesis (H1). The corresponding steps for the simulation are as follows:
Step 1. Download the phased haplotype data of a genome region from the HapMap web site (http://snp.cshl.org): we select the Protein tyrosine phosphatase, non-receptor type 22 (PTPN22) gene region to generate the simulating genotype data of CEU population using HapMap Phase 1& 2 full dataset. This region is located at Chr 1: 114168639..114197803, including 11 SNPs. Figure 1 shows their pair-wise R2 structure and minor allele frequencies (MAF).
Step 2. Based on the HapMap phased haplotype data, we generate large samples with 100 000 cases and 100 000 controls as CEU populations using the software HAPGEN [38]. To investigate the performance of the two methods on different causal SNPs with different MAF and different LD patterns, each of the 11 SNPs was defined as the causal variant. We remove the causal SNP in the simulation to assess the indirect association with disease via correlated markers,. Under H0, we set the relative risk per allele as 1.0 to assess the type I error. Under H1, different levels of relative risks are set (1.1, 1.2, 1.3, 1.4 and 1.5 per allele) to assess the power. The SNPs in this region are coded according to the additive genetic model.
Step 3. From the remained SNPs, we sample the simulation data and perform the PCA-LRT and KPCA-LRT under different sample sizes N (N/2 cases and N/2 controls, N = 1000, 2000, ..., 12000) using the R packages kernlab (http://cran.r-project.org/web/packages/kernlab/index.html) and Design (http://cran.r-project.org/web/packages/Design/index.html). Under H0, we repeat 10 000 simulations at two significant levels (0.05 and 0.01). Under H1, for each model with a given relative risk, we repeat 10 000 simulations at four significant levels (0.05, 0.01, 1E-5 and 1E-7).
Application
The proposed method is applied to rheumatoid arthritis (RA) data from GAW16 Problem 1. The data consists of 2062 Illumina 550 k SNP chips from 868 RA patients and 1194 normal controls collected by the North American Rheumatoid Arthritis Consortium (NARAC) [39]. At present study, only 1493 females (641 cases and 852 controls) are analyzed to avoid potential bias with the fact that rheumatoid arthritis is two to three times more common in women than in men [40].
To illustrate the performance of PCA-LRT and KPCA-LRT, we mainly focus on four special regions in chromosome 1, within the genes PTPN22, ANKRD35, DUSP23, RNF186 involved, respectively. The reasons are as follows: 1) Both the PTPN22 gene (R620W, rs2476601) and ANKRD35 gene have been reported to be associated with RA [41–43]; 2) DUSP23 can activate mitogen-activated protein kinase kinase [43], which may regulate a pathway in rheumatoid arthritis [44, 45]; 3) RNF186 involves a ulcerative colitis-risk loci (rs3806308) [44], and RA may be associated with ulcerative colitis [45].