BSA
For completeness, we briefly describe the BSA method of Hoh et al. [1] for use in genome-wide association studies based on bi-allelic markers, such as SNPs. The proposed statistic is based on allelic association (AA) and Hardy-Weinberg Disequilibrium (HWD). Consider a case-control study design with n cases and n controls genotyped at a marker locus with k alleles. A 2 × k contingency table is constructed and a statistic of AA is then computed as follows:
where p
s
and q
s
are the frequency of the allele s for the cases and controls, respectively. The above statistic has an approximate χ2 distribution with k-1 degrees of freedom.
HWD [5], which also has an approximate χ2 distribution with k(k - 1)/2 degrees of freedom, is then calculated among the cases according to the following statistic:
where q
s
and q
t
are the frequencies of alleles s and t, respectively, while p
ss
and p
st
are the frequencies of genotypes ss and st, respectively. For BSA, k = 2.
Hoh et al. [1] suggested that extremely high HWD values at a locus might indicate genotyping errors at the locus. Hence, they suggested trimming the d largest case-based HWD values to 0, thereby removing these problematic loci. d is empirically determined from the number of loci whose control-based HWD statistic is greater than
, the β quantile of the standard χ2 distribution with one degree of freedom. However it should also be noted that both AA and HWD values are high if the locus is in strong LD with one of the disease variants.
Now suppose that there are m bi-allelic marker loci. Hoh et al. [1] . defined a new statistic s
i
= AA
i
* HWD
i
, where AA
i
and HWD
i
are the AA and HWD statistic for marker locus i, 1 ≤ i ≤ m. The values s1, s2,...,s
m
are then ordered from the largest to the smallest with s(1) ≥ s(2) ≥ ... ≥ s(m). Then they defined
Next, Monte Carlo permutations were used to find the p-value of BSA. The collection of cases and controls are permuted 1000 times and for each permutation, j, an analogous sum statistic
is calculated. Let p
i
be the fraction of times that
is smaller than S
i
. The minimum of p
i
,
is treated as the final statistic.
To evaluate the overall significance, the above process is repeated such that for the jth permuted case-control data set,
is obtained. The overall significance level, P
overall
, can be approximated by the fraction of times that P
j
is smaller than P. The null hypothesis of no association of the region with the disease is rejected if P
overall
is less than a pre-specified type I error, for example, 0.05.
MSA
BSA cannot be directly applied to multiallelic markers because the χ2 statistics might have varying degrees of freedom for markers with different number of alleles. Therefore, BSA is extended to consider multiallelic markers. For each marker locus, i, calculate the χ2 statistics, AA
i
, based on the allele frequencies in the cases and the controls, and HWD
i
from cases. The corresponding p-values are then obtained for AA, p
AA
(i), and HWD, p
HWD
(i), on the basis of χ2 approximation, respectively.
Trimming was conducted similar to the method of Hoh et al. [1]. However, the d smallest HWD p-values were removed from MSA analysis along with the corresponding p
AA
. The remaining p-values for AA and HWD were then multiplied and the products were used as a score denoted as s
i
, that is,
s
i
= p
AA
(i) * p
HWD
(i).
The scores for all the markers are sorted from the smallest to the largest such that s(1) ≤ s(2) ≤ ... ≤ s(m). We define a sum statistic similar to BSA as
Finally, the p-value was again determined by Monte Carlo permutations.
Bi-allelic coalescent simulations
The coalescent theory, first introduced by Kingman [6], allowed the inference of genealogies from the observed genotypic data. The process was to take extant k individuals and trace backward in time to a common ancestor [6]. Unlike Kingman, who considered no crossover events, Hudson [7] and Kaplan and Hudson [8] provided the framework to consider the coalescence with a constant population recombination rate θ. Once the genealogy is determined, SNPs are added using an infinite-many-sites model with population mutation rate μ. This model assumes that mutation rates occur uniformly along the region of interest and that each mutation generates a novel SNP that does not already exist in the population.
We simulated 2000 haplotypes consisting of a large number of consecutive SNPs across a genomic region using the above coalescent process with recombination. θ and μ were both set to 200, which correspond to a 200-kb genomic region of human DNA on the basis of a large survey of the human genome. This strategy has been used by Nordborg and Tavaré [9].
One hundred cases and 100 controls with genotypes for approximately 130 adjacent polymorphic sites were sampled from this simulated population of haplotypes according to the following disease model. A disorder with two disease genes was considered. We assumed that the first disease gene to be located between the 35th and the 45th marker loci and the second disease gene to be located between the 75th and the 85th marker loci. The frequency of the high-risk allele of the first disease gene and the second disease gene is denoted as ε and φ, respectively. ε and φ were set to be either 0.1 ± 0.05 or 0.2 ± 0.05 throughout the simulation. The first condition constrains the location of the disease locus to be isolated to two specified regions while the second condition constrains the disease allele frequency. If no such marker loci exist, the sample set is discarded [10].
Consider the phenotypic aspect of the disease model with D
i
, i = 1, 2 as the high-risk allele and d
i
, i = 1, 2 as the low risk allele, respectively. The haplotype risks for d1d2,D1d2, D2d1, and D1D2 were assumed to be δ, δλ, δλ, and δλ2, respectively, where δ is the phenocopy rate and λ is the genotype relative risk. Furthermore, a haplotype multiplicative disease model was assumed, i.e., the penetrance of an individual equals the product of the two corresponding haplotype risks. As an example, if the two haplotypes of an individual are D1D2/D1d2, the penetrance for the individual is δλ2 * δλ = δ2λ3 [10].
We drew 100 times from the simulated population of haplotypes with replacement to generate a sample set, each containing 100 cases and 100 controls. The entire process of generating a population of haplotypes and sample sets was repeated 10 additional times giving a total of 1000 simulated data sets. By modifying the above parameters, the data sets created under various conditions allowed the exploration of type I error and power for individual marker analysis, BSA, and MSA.
Multiallelic simulations
To conduct the multiallelic simulations, we utilized the haplotypes generated in the previous section. Here, we considered overlapping two-locus haplotypes as alleles for a single marker locus. Thus, we have a set of multiallelic marker loci with at most four alleles at each marker locus. Finally, 100 cases and 100 controls were also generated according to the same procedure and disease model as previously described.
GAW 13 data set
Based on the Framingham Heart Study, the simulated data sets attempted to model the observed longitudinal family data with 399 genome-wide microsatellite markers and 50 trait genes. In addition to genetic effects, environmental covariates (smoking and alcohol consumption), along with hypertension diagnosis and treatment were used to simulate the observed phenotypes. The pedigree structure was generated to be nearly identical to the Framingham Heart Study with varying degrees of heritability of eight longitudinal quantitative traits [4].
The first measured cholesterol level was chosen as the phenotype of interest. The empirical distribution of cholesterol level in a replicate set was approximated from the cholesterol levels of all the individuals. Next, the upper 15% and lower 15% quantiles were then estimated from this observed distribution. We considered an extreme sampling design in which an individual was considered affected if the first measured cholesterol level exceeded the upper limit and unaffected if it was below the lower limit. Cases and controls were randomly selected based on these measured cholesterol levels with no two members from the same family. We required that there were no missing genotypes for the selected individuals.