PCA
Assume that p SNPs in a candidate region of interest have coded values (X1, X2, ⋯, X
p
) according to a given genetic model (e.g., additive model) whose correlation matrix is C. PCA solves the following equation,
where
= 1, i = 1,2, ⋯, p, l
i
= (li 1, li 2, ⋯, l
ip
)' are loadings of PC s. The score for an individual subject is
where cov (F
i
, F
j
) = 0, i ≠ j, and var(F1) ≥ var(F2) ≥ ⋯ ≥ var(F
p
).
Methods of extracting PC s
Potentially, PCA can be conducted via four distinct extracting strategies (ES) using case-control data, i.e., 0. Calculate PC scores of individuals in cases and controls separately (SES), 1. Use cases only (CAES) to obtain loadings for calculation of PC scores for subjects in both cases and controls, 2. Use controls only (COES) to obtain the loadings for both groups, and 3. Use combined cases and controls (CES) to obtain the loadings for both groups. It is likely that in a case-control association study, loadings calculated from cases and controls can have different connotations and hence we only consider scenarios 1-3 hereafter. More formally, let (X1, X2, ⋯, X
p
) and (Y1, Y2, ⋯, Y
p
) be p-dimension vectors of SNPs at a given candidate region for cases and controls respectively, then we have,
Strategy 1 (CAES):
where C
XX
is the correlation matrix of (X1, X2, ⋯, X
p
),
and
= 1, i = 1,2, ⋯, p. The ithPC for cases is calculated by
and for controls
Strategy 2 (COES):
where C
YY
is the correlation matrix of (Y1, Y2, ⋯, Y
p
). The ithPC for controls is calculated by
And for cases, the ithPC, i = 1,2, ⋯, p, is calculated by
Strategy 3 (CES):
where C is the correlation matrix obtained from the pooled data of cases and controls,
and
. The ithPC of cases is calculated by
The ithPC of controls is calculated by
PCA-BCIT
Given a sample of N cases and M controls with p-SNP genotypes (X1, X2, ⋯, X
N
)T, (Y1, Y2, ⋯, Y
M
)T, and X
i
= (X1i, X2i, ⋯, x
pi
) for the ithcase, Y
i
= (Y1i, Y2i, ⋯, y
pi
) for the ithcontrol, a PCA-BCIT is furnished in three steps:
Step 1: Sampling
Replicate samples of cases and controls are obtained with replacement separately from (X1(b, X2(b), ⋯, X
N
(b))Tand (Y1(b, Y2(b), ⋯, Y
M
(b))T, b = 1,2, ⋯, B (B = 1000).
Step 2: PCA
For each replicate sample obtained at Step 1, PCA is conducted and a given number of PC s retained with a threshold of 80% explained variance for all three strategies[16], expressed as
and
.
Step 3: PCA-BCIT
3a) For each replicate, the mean of the kthPC in cases is calculated by
and that of the kthPC in controls is calculated by
3b) Given confidence level (1 - α ), the confidence interval of
is estimated by percentile method, with form
where
is the
percentile of
, and
is the
percentile.
The confidence interval of
is estimated by
where
is the
percentile of
, and
is the
percentile.
3c) Confidence intervals of cases and controls are compared. The null hypothesis is rejected if
and
do not overlap, which is
and
are statistically different[19], indicating the candidate region is significantly associated with disease at level α. Otherwise, the candidate region is not significantly associated with disease at level α.
Simulation studies
We examine the performance of PCA-BCIT through simulations with data from the North American Rheumatoid Arthritis (RA) Consortium (NARAC) (868 cases and 1194 controls)[20], taking advantage of the fact that association between protein tyrosine phosphatase non-receptor type 22 (PTPN22) and the development of RA has been established[21–24]. Nine SNPs have been selected from the PNPT22 region (114157960-114215857), and most of the SNPs are within the same LD block (Figure 1). Females are more predisposed (73.85%) and are used in our simulation to ensure homogeneity. The corresponding steps for the simulation are as follows.
Step 1: Sampling
The observed genotype frequencies in the study sample are taken to be their true frequencies in populations of infinite sizes. Replicate samples of cases and controls of given size (N, N = 100, 200, ⋯, 1000) are generated whose estimated genotype frequencies are expected to be close to the true population frequencies while both the allele frequencies and LD structure are maintained. Under null hypothesis, replicate cases and controls are sampled with replacement from the controls. Under alternative hypothesis, replicate cases and controls are sampled with replacement from the cases and controls respectively.
Step 2: PCA-BCITing
For each replicate sample, PCA-BCITs are conducted through the three strategies of extracting PC s as outlined above on association between PC scores and disease (RA).
Step 3: Evaluating performance of PCA-BCIT s
Repeat steps 1 and 2 for K ( K = 1000 ) times under both null and alternative hypotheses, and obtain the frequencies (P
α
) of rejecting null hypothesis at level α (α = 0.05).
Applications
PCA-BCITs are applied to both the NARAC data on PTPN22 in 1493 females (641 cases and 852 controls) described above and a data containing nine SNPs near μ-opioid receptor gene (OPRM1) in Han Chinese from Shanghai (91 cases and 245 controls) with endophenotype of heroin-induced positive responses on first use[25]. There are two LD blocks in the region of gene OPRM1 (Figure 2).