In this section, we will first give the test statistic of the MPDT. Then, we will discuss a searching algorithm and how to find a set of susceptibility genes by the searching method and the MPDT. Finally, we will describe a two-stage approach used to incorporate the information of parental phenotypes if it is available.
The MPDT
As the PDT proposed by Martin et al. [14], the MPDT is designed for pedigrees of any size. In the following discussion, for simplicity of presentation, we will only give the statistic for nuclear families with affected children. It is straightforward to extend the statistic for general pedigrees. Suppose we have genotyped m markers across the genome or in a candidate region for each sampled individual. Consider a sample of n nuclear families with n
i
affected children in the ithfamily. For a biallelic marker with two alleles A and a, we code the three genotypes aa, Aa, and AA as 0, 1, and 2, respectively.
Let F
ij
, M
ij
and u
ijk
denote the genotype codes of the father, mother and kthchild in the ithfamily at the jthmarker, respectively, i = 1, 2,..., n; j = 1, 2,..., m; k = 1, 2,..., n
i
. Considering each affected child as a case, we define a pseudo-control matching each case. The pseudo-control matching the kthchild in the ithfamily has a genotype code at the jthmarker where is the genotype code of the two alleles not transmitting to the kthchild by the parents. For example, if the genotypes of the father, mother, and a child are Aa, Aa, and AA, respectively, then the pseudo-control matching this child has a genotype of aa and a genotype code of 0.
It is easy to see that the genotype codes of parents, children, and the pseudo-controls have the relationship
Let U
ijk
= u
ijk
- = 2 × u
ijk
- F
ij
- M
ij
. Define a multi-marker score U
ik
for the kthchild in the ithfamily as = (Ui 1k,..., U
imk
). The multi-marker score of the ithfamily is defined as
Let and . The statistic of the MPDT is defined as
T
C
= UTV⊕U,
where V⊕ is the generalized inverse of V. Under the null hypothesis of no association between the markers and the trait, the MPDT has approximately a χ2 distribution with degrees of freedom k, where k is the rank of V. If only one marker is considered, is the test statistic of the PDT.
Searching algorithm and overall p-value
In this section, we consider a genome-wide association study. Suppose we have genotyped M markers across the genome. Our aim is to find a set of markers that jointly have significant association with the trait. We propose two searching algorithms: Conditional Search (CS) and Sequential Forward Search (SFS). In both algorithms, each of the M markers is tested by using the PDT first. Then, the markers are ordered according to their p-values of the PDT. Suppose the p-values of markers 1, 2,..., M are in ascending order. Based on the ordered markers, the two algorithms are given below:
CS
The CS algorithm searches marker-sets A1,..., A
L
, where marker-set A
i
consists of markers 1,..., i (i = 1,..., L) and L is a pre-specified value. We calculate the p-value of the MPDT for each set of markers and call the p-value from this step a raw p-value.
SFS
The SFS algorithm begins with marker-set A1 which consists of marker 1. Then, by adding one marker to the marker-set A1, we get all of the two-locus combinations with the first marker included. We test all of the two-locus combinations by the MPDT and choose the two-locus combination with the smallest p-value (also called a raw p-value) as marker-set A2. In this way, we can get a series of marker-sets A1,..., A
L
.
Both of the two searching algorithms give a series of candidate marker-sets and the corresponding raw p-values of the MPDT. The problems that remain are choosing the "best" or the final marker-set and evaluating the overall p-value of the final marker-set. An intuitive idea is to choose the marker-set with the smallest raw p-values as the final marker-set and use a permutation procedure to evaluate the overall p-value. However, our simulation studies (results not shown) show that in most cases, the more markers a marker-set contains, the smaller the p-value of the marker-set will be. Thus, instead of using the raw p-values, we propose to use a permutation procedure recently proposed by Ge et al. [16] and further discussed by Becker and Knapp [17] to adjust the raw p-values and use the adjusted p-values to choose the final marker-set. This procedure also gives the overall p-value of the final marker-set. Let A1,..., A
L
denote the candidate marker-sets and P01,..., P0Ldenote the associated raw p-values of the MPDT. The permutation procedure includes the following steps:
1. Generate S (say, 1,000) permuted datasets. In each permutation, there is a 50% probability of changing the multi-marker genotype (the genotype across all of the M markers) of each child with that of the corresponding pseudo-control. The reason that we changed the genotypes across M markers simultaneously is to keep the LD structure in each permuted data set.
2. For each permuted data set, search for the L candidate marker-sets by either of the two algorithms. Based on the permutated data set, test for the association between each marker-set and the trait using the MPDT. For the sthpermuted data set, denote the L candidate marker-sets by As 1,..., A
sL
and the associated raw p-values by Ps 1,..., P
sL
. Then, the adjusted p-value corresponding to the candidate marker-set A
i
is estimated by , where I(·) is a indicator function. We will choose the marker-set with the smallest adjusted p-value, p0 = min(p01, p02,..., p0L), as the final marker-set.
3. To evaluate the overall p-value of the final marker-set, we first adjust the raw p-values Ps 1,..., P
sL
for the sthpermuted data, s = 1,..., S. The adjusted value of P
sl
is given by . Let p
s
= min{ps 1,..., p
sL
}. Then, the overall p-value of the final marker-set is given by
(1)
Usually, p
overall
is obtained through another layer of permutation by a standard double permutation procedure, according to Ge et al. [16], p
overall
can be estimated by (1), which needs only one layer of permutation.
A two-stage approach to incorporate parental phenotypes
If parental phenotypes are available, we propose a two-stage approach to incorporate the parental phenotypes. The basic idea of the two-stage approach is that the test used in the first stage is independent of the association test used in the second stage; the test in the first stage is used to select promising SNPs, and the association test in the second stage can be performed on a smaller set of the selected SNPs.
Stage one
The test that we propose to use in this stage is based on a test statistic for a case-control study. Consider a case-control study with N1 cases and N2 controls, and each sampled individual has a genotype at a bi-allelic marker with two alleles A and a. To test the association between the marker and the disease, one can use the test statistic
where p and q are the sample frequencies of allele A in cases and controls, respectively; is the estimated variance of p - q; p0 is the sample frequency of allele A in the whole sample. Under the null hypothesis of no association, this test statistic asymptotically follows a Chi-squared distribution with one degree of freedom. To use this test statistic in the first stage, we consider the affected parents of the sampled nuclear families as cases and the unaffected parents of the sampled nuclear families as controls. We propose to use the statistic T
p
on each of the M markers and get a corresponding P-value for each marker. Select M1 markers with the smallest P-values, where M1 is a pre-specified number, which usually is smaller than M. We will discuss how to choose M1 later. In this stage, we use only the parental information of the nuclear families.
Stage two
Apply the searching algorithm (including the permutation procedure) to the M1 selected markers to find a final marker-set and the overall p-value of the MPDT to test the association between the final marker-set and the trait. Since all the calculations including searching and permutation procedure are applied to the data set of M1 markers, the calculation will be much faster than that of applying the method directly to the original M markers. If the parental phenotypes and genotypes have sufficient information to keep most of the disease susceptibility loci in the selected markers and delete many noise markers in the first stage, then the two-stage approach should be more powerful. Otherwise, the two-stage approach may lose power.
Other method compared
We compared the proposed MPDT (plus the searching algorithms) with the single-marker TDT. We also compared the power of the tests using (two-stage approach) and without using (one-stage approach) parental phenotypes. For the single marker TDT, we search for a set of significance markers by controlling the False Discovery Rates (FDRs) [18], the ratio of the number of falsely rejected null hypotheses to the total number of rejected null hypotheses. We use the one-stage approach as an example to explain the procedure. Calculate the TDT for each of the M markers and denote the ordered p-values by p(1),..., p(M). Declare a marker significant if the P-value of the TDT at this marker is less than a threshold δ
M
such that the FDR can be controlled at a level of α. The threshold δ
M
is determined by
The marker set that consists of all the markers associated with the trait is called the final marker-set.