Frequencies of single nucleotide polymorphisms in genes regulating inflammatory responses in a community-based population

Background Allele frequencies reported from public databases or articles are mostly based on small sample sizes. Differences in genotype frequencies by age, race and sex have implications for studies designed to examine genetic susceptibility to disease. In a community-based cohort of 9,960 individuals, we compared the allele frequencies of 49 single nucleotide polymorphisms (SNPs) of genes involved in inflammatory pathways to the frequencies reported on public databases, and examined the genotypes frequencies by age and sex. The genes in which SNPs were analyzed include CCR2, CCR5, COX1, COX2, CRP, CSF1, CSF2, IFNG, IL1A, IL1B, IL2, IL4, IL6, IL8, IL10, IL13, IL18, LTA, MPO, NOS2A, NOS3, PPARD, PPARG, PPARGC1 and TNF. Results Mean(SD) age was 53.2(15.5); 98% were Caucasians and 62% were women. Only 1 out of 33 SNPs differed from the SNP500Cancer database in allele frequency by >10% in Caucasians (n = 9,831), whereas 12 SNPs differed by >10% (up to 50%) in African Americans (n = 105). Two out of 15 SNPs differed from the dbSNP database in allele frequencies by >10% in Caucasians, and 5 out of 15 SNPs differed by >10% in African Americans. Age was similar across most genotype groups. Genotype frequencies did not differ by sex except for TNF(rs1799724), IL2(rs2069762), IL10(rs1800890), PPARG(rs1801282), and CRP(rs1800947) with differences of less than 4%. Conclusion When estimating the size of samples needed for a study, particularly if a reference sample is used, one should take into consideration the size and ethnicity of the reference sample. Larger sample size is needed for public databases that report allele frequencies in non-Caucasian populations.

When designing a study to investigate genetic susceptibility to diseases, information on the allele frequencies of single nucleotide polymorphisms (SNPs) in the source population is crucial for ensuring sufficient statistical power. To date, the estimation for allele or genotype frequencies of candidate genes has been primarily based on limited numbers of individuals. For example, the SNP500Cancer database, a useful resource often used for referencing sequences and allele frequencies of validated SNPs, were based on 102 anonymous individuals with self-described heritage (24 African Americans, 31 Caucasians, 23 Hispanics, and 24 Pacific Rim heritages) [19]. On the dbSNP database, summary allele frequencies were calculated based on data from various ethnic groups, and often, several hundreds of samples were included [20]. The International HapMap Project analyzed 270 individuals, including 30 sets of samples from two parents and an adult child in Yoruba people of Ibadan, Nigeria, 45 unrelated individuals from Tokyo, 45 unrelated individuals from Beijing, and 30 U.S. trios with northern and western European ancestry [21]. Published studies of genetic polymorphisms have included no more than a few hundreds of individuals. The limited sample sizes make for uncertainties in estimating the allele frequencies in the general population.
Age and sex are often used as matching factors in epidemiological association studies because in most cases, they are associated with disease risk or survivorship. When genotype frequency is associated with age or sex, by matching on these factors in case-control studies, one would make the genotype frequency artificially similar between cases and controls. Under this circumstance, over-matching may occur.
To assess how allele frequencies reported on public databases are commensurate with the allele frequencies in the general population, we compared the allele frequencies of selected SNPs in candidate genes involved in inflammatory pathways in a large, community-based population in Washington County, Maryland to the allele frequencies reported from SNP500Cancer database, and if unavailable from SNP500Cancer, the dbSNP database. In addition, we compared the genotype frequencies among age and sex groups to explore whether overmatching by age or sex in an association study of genetic polymorphisms and disease risk may be of concern. The candidate genes included were CCR2, CCR5, COX1, COX2, CRP, CSF1, CSF2, IFNG, IL1A, IL1B, IL2, IL4, IL6, IL8, IL10, IL13, IL18, LTA, MPO, NOS2A, NOS3, PPARD, PPARG, PPARGC1 and TNF.

Results
Characteristics of the study population, the Odyssey and CLUE II subcohort, were presented in Table 1. The Odyssey participants were older than the CLUE II subcohort, reflecting the fact that individuals in the Odyssey had participated in the CLUE I 15 years prior to CLUE II. The Odyssey cohort also had more women, higher body mass index (BMI), and more years of school education.
Age was similar among the genotype groups with a few exceptions (Table 3; see Additional file 2). Specifically, statistically significant differences of one to three years of age were observed for the genotypes of TT vs. CC of IL4 (rs2243250), AT vs. TT of IL10 (rs1800890), AG vs. GG of IL10 (rs1800896), AG vs. AA of NOS2A (rs2297518), and GG vs. AA of PPARG (rs709158). Genotype frequencies did not differ by sex except for TNF (rs1799724), IL2 (rs2069762), IL10 (rs1800890), PPARG (rs1801282), and CRP (rs1800947) with differences of no more than 4% between two groups (Table 4; see Additional file 3).

Discussion
We report allele frequencies and genotype frequencies in a large community-based population, predominantly of Caucasians. Although the CLUE cohorts were not enrolled through a random sampling process, we find no particular reason to suggest that the genetic composition of CLUE participants would have affected research participation. This notion is supported by the findings that the fre-quency distributions of genetic polymorphisms did not differ between the Odyssey and the CLUE II subcohort.
The similar allele frequencies in CLUE's Caucasians to the frequencies reported from the SNP500Cancer database have implications for the design of studies on genetic polymorphisms. When the SNP500Cancer database is used as a source of reference for SNP selection with allele fre- * age adjusted odds ratio of being an Odyssey cohort participant. † per year increment. ‡ Normal blood pressure was defined as individuals with systolic pressure <120 mmHg and diastolic blood pressure of 80 mmHg and not on antihypertensive medication; Hypertension was defined as under anti-hypertensive medication or systolic pressure >140 mmHg or diastolic pressure of 90 mmHg. Individuals with systolic blood pressure between 120 and 140 mmHg and/or diastolic blood pressure between 80 mmHg and 90 mmHg were considered pre-hypertensive.
quencies as one of the selection criteria, and if both wildtype and variant alleles have fairly high frequencies (30%-70%), even a discrepancy in allele frequency between study samples and the samples used in the SNP500Cancer project is up to 20% may not influence investigators' decision on including into a particular SNP into a study. However, for rarer alleles, sampling errors resulting in variations in allele frequencies estimates can have an impact on SNP selection. For example, we chose to study a SNP (rs1726803) of POLD1 gene and a SNP (rs6413413) of ADH2 gene that were reported on the SNP500Cancer database to have a minor allele frequency of 5%. After analyzing approximately 3,000 samples, we found no variation in the SNP allele frequency and stopped this genotyping analysis.
Among the 49 SNPs examined, 10 SNPs did not follow the Hardy-Weinburg equilibrium. Genotyping is not 100% accurate and failures to call out genotypes might have been a reason for the H-W disequilibrium. On the other hand, the sample size is fairly large in this study, and the larger the sample size is, the easier for any discrepancy in the observed allele frequency from the expected frequency (according to an H-W equilibrium) to reach statistical significance.
Although in the present study, the number of African American participants was limited (n = 105), it exceeded the size reported on the SNP500Cancer databases (n = 24). As expected, there were greater differences in the allele frequencies between the present study and the SNP500Cancer database for African Americans; allele frequencies significantly differed for 35% of the SNPs for which comparisons could be made. This finding raises concerns about the usefulness of the SNP500Cancer database as a reference for candidate gene selection for African Americans.
We observed sex-or age-differences in genotype frequencies for some of the SNPs. Chance alone cannot be excluded from being a possible explanation for the statistically significant differences in genotype frequencies between age groups or sex groups, particularly because the differences in genotypes between age and sex groups were small in the present study. Replication is needed for testing the robustness of these findings.

Conclusion
In conclusion, when estimating the size of samples needed for a study, particularly if a reference sample is used, one should take into consideration the size and the ethnicity of the reference sample. The greater differences in allele frequencies among African Americans between the present study and public databases indicate a need for basing public databases on a larger sample size for this ethnic group. The small differences in genotype frequencies by age or sex for some candidate genes may be explained by chance alone, and more published data are needed for replication.

Study population
The study population consists of participants in two community-based cohorts, CLUE I and CLUE II, in Washington County, Maryland. Washington County has a slowly growing population.

Selection of candidate genes
Candidate genes were selected based on the following criteria: (a) estimated allele frequencies of ≥ 5% in Caucasians in published literature or databases, (b) known or promising importance in the development of cancer, cardiovascular diseases, and/or longevity, (c) validated allele substitutions, and/or (d) functional changes linked to allele substitutions that have been published in the literature.

Laboratory analysis
At CLUE II enrollment, blood samples were collected into 20-ml heparinized Vacutainers (Fisher Scientific, Pittsburgh, PA). Samples were refrigerated at 4°C and most samples were centrifuged within 2 to 6 hours after blood collection and were never 24 hours later. Plasma aliquots from each participant were placed in two 5-ml Cryotubes (Sumitomo Bakelite, Neptune, NJ) and were stored at -70°C. Buffy coat samples were stored in separate vials at -70°C until extraction. Barcoding was performed as part of the blood collection process. Labels were printed with the study numbers barcoded so that they could be scanned for accuracy in data entry and inventory maintenance.