### Genotype data and populations

I used genotype data from release 21 (phase II) of the International HapMap project [19]. I used data from all four populations studied in the HapMap project. These populations are defined by the HapMap project as follows: Yoruba in Ibadan, Nigeria (abbreviation: YRI); Japanese in Tokyo, Japan (abbreviation: JPT); Han Chinese in Beijing, China (abbreviation: CHB); and CEPH (Utah residents with ancestry from northern and western Europe) (abbreviation: CEU). Similar to the analysis performed by the HapMap project, I combined genotypes from the JPT and CHB populations to make a joint JPT+CHB population. For all three resulting populations, I removed SNPs that have a minor allele frequency (MAF) less than 0.05 in that population. The remaining SNPs are considered to be "common." A summary of the number of SNPs remaining for each population is found in Table 1. When phased data is needed, I used the phased chromosomes for release 21.

### Calculation of power

To compute the overall power of an association study, I use three steps. First, I find the best tag SNP for each genotyped SNP in the data set. Then, I compute the power for each SNP assuming the specified GRR and sample size. Finally, I take an average power over all the SNPs to get the overall power.

To find the best tag SNP for each genotyped SNP, I look at the linkage disequilibrium between each SNP and all tag SNPs within 300 kb of it. For each pair of SNPs, I infer the two-locus haplotype frequencies between them using expectation maximization and compute r^{2} between the two SNPs from the inferred haplotype frequencies [12]. The best tag is then taken to be the tag SNP with the highest value of r^{2}.

To compute the power for a SNP, I assume that we are looking at genotype frequency differences using a two-degree of freedom *χ*^{2} test. The power of this test is computed using a non-central *χ*^{2} distribution with non-centrality parameter *λ*. Equations for *λ* have been previously derived for a general *χ*^{2} test [22] and for application to genetic association [23]. Specifically, for genotypic association *λ* is given by:

\lambda ={N}_{A}{N}_{U}[\frac{{({p}_{00}-{p}_{10})}^{2}}{{N}_{A}{p}_{00}+{N}_{U}{p}_{10}}+\frac{{({p}_{01}-{p}_{11})}^{2}}{{N}_{A}{p}_{01}+{N}_{U}{p}_{11}}+\frac{{({p}_{02}-{p}_{12})}^{2}}{{N}_{A}{p}_{02}+{N}_{U}{p}_{12}}]

where *N*_{
A
}and *N*_{
U
}are the number of case (affected) and control (unaffected) individuals, respectively; *p*_{00}, *p*_{01}, and *p*_{02} are the genotype frequencies in the cases; and *p*_{10}, *p*_{11}, and *p*_{12} are the genotype frequencies in the controls. If, instead of a 3 × 2 table we use a 2 × 2 table for a one-degree of freedom test of allelic association, the non-centrality parameter is given by

\lambda =2{N}_{A}{N}_{U}{({p}_{A}-{p}_{U})}^{2}\frac{{N}_{A}+{N}_{U}}{({N}_{A}{p}_{A}+{N}_{U}{p}_{U})({N}_{A}+{N}_{U}-{N}_{A}{p}_{A}-{N}_{U}{p}_{U})}

where *p*_{
A
}and *p*_{
U
}are the frequencies of allele 0 in the cases and controls, respectively.

I use the Bonferroni correction for multiple testing and require a *p*-value of 0.05/M (where M is the number of tag SNPs genotyped) for statistical significance. When association is directly tested (the SNP is a tag SNP), I use the actual number of cases and controls to compute the power. For indirect association (the SNP is in LD with a tag SNP), I reduce the number of cases and controls by a factor of r^{2} for the power computation [2].

I assume that the disease has a low enough prevalence in the population that the risk allele frequency in those without the disease approximates the risk allele frequency in the population. I can set the disease to follow a multiplicative, additive, dominant, or recessive model with a specified genotype relative risk (GRR) for the SNP of interest [1]. Given that genotype 0 is the wildtype, and taking *p*_{10}, *p*_{11}, and *p*_{12} from the observed genotype frequencies in the population, *p*_{00}, *p*_{01}, and *p*_{02} are computed as follows:

Multiplicative

\begin{array}{l}{p}_{00}={\scriptscriptstyle \frac{\frac{{p}_{10}^{2}}{{p}_{11}{p}_{12}}}{{\gamma}^{2}\frac{{p}_{10}}{{p}_{11}}+\gamma \frac{{p}_{10}}{{p}_{12}}+\frac{{p}_{10}^{2}}{{p}_{11}{p}_{12}}}}\hfill \\ {p}_{01}={\scriptscriptstyle \frac{\gamma \frac{{p}_{10}}{{p}_{12}}}{{\gamma}^{2}\frac{{p}_{10}}{{p}_{11}}+\gamma \frac{{p}_{10}}{{p}_{12}}+\frac{{p}_{10}^{2}}{{p}_{11}{p}_{12}}}}\hfill \\ {p}_{02}={\scriptscriptstyle \frac{{\gamma}^{2}\frac{{p}_{10}}{{p}_{11}}}{{\gamma}^{2}\frac{{p}_{10}}{{p}_{11}}+\gamma \frac{{p}_{10}}{{p}_{12}}+\frac{{p}_{10}^{2}}{{p}_{11}{p}_{12}}}}\hfill \end{array}

Additive

\begin{array}{c}{p}_{00}={\scriptscriptstyle \frac{{p}_{10}}{2\gamma {p}_{12}+\gamma {p}_{11}+{p}_{10}}}\\ {p}_{01}={\scriptscriptstyle \frac{\gamma {p}_{11}}{2\gamma {p}_{12}+\gamma {p}_{11}+{p}_{10}}}\\ {p}_{02}={\scriptscriptstyle \frac{2\gamma {p}_{12}}{2\gamma {p}_{12}+\gamma {p}_{11}+{p}_{10}}}\end{array}

Dominant

\begin{array}{l}{p}_{00}={\scriptscriptstyle \frac{{p}_{10}}{\gamma {p}_{12}+\gamma {p}_{11}+{p}_{10}}}\hfill \\ {p}_{01}={\scriptscriptstyle \frac{\gamma {p}_{11}}{\gamma {p}_{12}+\gamma {p}_{11}+{p}_{10}}}\hfill \\ {p}_{02}={\scriptscriptstyle \frac{\gamma {p}_{12}}{\gamma {p}_{12}+\gamma {p}_{11}+{p}_{10}}}\hfill \end{array}

Recessive

\begin{array}{c}{p}_{00}={\scriptscriptstyle \frac{{p}_{10}}{\gamma {p}_{12}+{p}_{11}+{p}_{10}}}\\ {p}_{01}={\scriptscriptstyle \frac{{p}_{11}}{\gamma {p}_{12}+{p}_{11}+{p}_{10}}}\\ {p}_{02}={\scriptscriptstyle \frac{\gamma {p}_{12}}{\gamma {p}_{12}+{p}_{11}+{p}_{10}}}\end{array}

After the power is computed for each SNP, I take the overall power to be the average power over all the SNPs. In taking the average power over all SNPs, I give less weight to the tag SNPs since they are over-represented in the set of SNPs being analyzed. Assume that of the *S* SNPs under consideration (for which we have linkage disequilibrium [LD] data from, for instance, the HapMap project), *M* are tags that will be genotyped on the chip and *S-M* are not tags. Further assume that there are *T* common SNPs in total in this population, which includes both those *S* SNPs for which we have LD data and SNPs for which we do not know their LD with surrounding SNPs. Let 1-*β*_{
i
}be the power for SNP *i* where *i* ranges from 1 to *S* and SNP *i* is a tag SNP when *i* ≤ *M* and a non-tag otherwise. Then, the overall power is given by:

Power={\displaystyle \sum _{i=1}^{M}\left[\frac{1}{T}(1-{\beta}_{i})\right]}+{\displaystyle \sum _{i=M+1}^{S}\left[\frac{T-M}{T(S-M)}(1-{\beta}_{i})\right]}

In this manner, the tag SNPs are only considered representative of themselves, while the non-tag SNPs for which we have LD data are considered representative of all common non-tag SNPs. For these calculations, I use *T* = 2 × 10^{7}.

### Implementation

A computer program to implement these calculations was written in C. The source code is available upon request from the author.