For re-sequencing of the SLC6A1 gene, 40 genomic DNA samples were collected from unrelated individuals representing 5 different populations: EA (n = 7), AA (n = 9), Finnish (n = 8), Thai (n = 8), and Hmong (n = 8). The Finnish subjects were unrelated parents of adolescent subjects who were participating in an epidemiological study focusing on the identification of risk factors for early-onset mental illness and substance dependence in Finland . The Thai and Hmong populations were collected in Thailand as part of an ongoing genetic association and population genetic study. The Thais selected for resequencing had grandparents and parents of Thai ancestry (Thai-Thai) or had mixed Thai and Chinese ancestry (Thai-Chinese), by subject report. These samples were obtained from a blood drive in Bangkok, Thailand. The Hmong subjects were recruited in a Hmong village in the northern part of Thailand. The AA and EA samples have been described earlier elsewhere . Both EA and AA samples were self-identified and confirmed as such by Bayesian marker clustering . All subjects provided informed consent as approved by the appropriate institutional review boards. In addition, 46 EA, 60 AA, 59 Thai, 47 Finnish and 48 Hmong individuals were genotyped for 16 SLC6A1 SNPs to examine linkage disequilibrium (LD) in this gene. The Thai subjects selected for examination of linkage disequilibrium were Thai-Thai. Recruitment and population characteristics of subjects selected for SLC6A1 genotyping were identical of subjects selected for resequencing [35, 36]. The participants were recruited from non-clinical populations. No detailed medical information was available for most of the participants. Therefore biases deriving from undetected medical conditions of the control sample could not be controlled. All studies described in this article were conducted according to the Declaration of Helsinki. The studies were approved by the institutional review boards of Yale University School of Medicine, West Haven VA Hospital, Northern- Ostrobothnia Hospital District (University of Oulu, Finland) and Chulalongkorn University (Bangkok, Thailand). All subjects signed a written informed consent for participation in this study.
The ElDorado program of the Genomatix software package was used to predict the location of the SLC6A1 promoter region . The sequence of the SLC6A1 gene submitted to the promoter region analysis was obtained from the National Center for Biotechnology Information (NCBI) .
Amplification and sequencing
For sequencing of SLC6A1, the upper and lower promoter regions, all 16 SLC6A1 exons (total of 4.4 kb) and 7.3 kb of flanking intronic regions were amplified. About 70 bp of the predicted 601 bp of the lower promoter region were not included in the sequence analysis. Approximately 12.4 kb of the SLC6A1 gene was amplified, corresponding to about 25% of the total length of the gene. All primers were designed with the PRIMER3 software . Primers were obtained from Invitrogen (Carlsbad, CA). PCR amplification was optimized before sequencing by testing different cycling conditions. Betaine (Sigma Aldrich, St. Louis, MO) at 0.5–1 M final concentrations was added to the reactions, as needed, to enhance specificity and yield of PCR amplification. PCR reactions were carried out in 15 μl volumes containing 20 ng genomic DNA, 200 μM of dNTPs mix (Stratagene, La Jolla, CA), 1 μM of mixed primers forward and reverse, 1X PC2 buffer, 0.75 U of KlenTaq1™ (Ab Peptides, St Louis, MO) and 0.5–1 M betaine when needed. Thermocycling conditions consisted of an initial denaturation step at 95°C for 5 min, 30 cycles of denaturation step at 95°C for 30 sec, an annealing step at 60–65°C 30 sec, and an extension step at 72°C. The duration of the extension step varied from 30 sec to 2 min depending on the length of the amplicon. After optimization, genomic DNA samples from each population were PCR amplified followed by purification with MinElute PCR purification columns (Qiagen, Valencia, CA) or the reaction mixtures were treated with ExoSAP-IT (USB, Cleveland, OH) to remove excess nucleotides and primers. Purified PCR samples were sequenced in the forward and reverse directions at Yale University W.M Keck Foundation Biotechnology Resource Laboratory. Sequencing reactions were conducted using the BigDye Terminator v3.1 cycle sequencing kit and an ABI 9800 Thermocycler (Applied Biosystem, Foster city, CA). Sequencing reactions were analyzed on an ABI 3730 xl DNA Analyzer (Applied Biosystems, Foster city, CA). Owing to technical problems, approximately 300 bp in exon 7 and intron 8 (2.4% from total 12.4 kb sequenced region) is missing in the sequencing data in Thai and Hmong populations (see additional data file 1).
Owing to the repeat elements contained in the identified upper promoter sequence and homologous sequences within the SLC6A1, the upper promoter region and parts of exon 1 were amplified using nested PCR. In addition, for amplification of a 180 bp fragment located in the junction of the 5' upstream region and exon 1, a region which is very high in CG content, 7-deaza-dGTP (New England BioLabs, Beverly, MA) was added to the reactions.
Genotyping and linkage disequilibrium study
A total of 16 SNPs were chosen for genotyping in population samples to examine haplotype structure of the SLC6A1 gene. Nine SNPs chosen for genotyping were identified through resequencing: -24321A/C, -1529A/G, 949A/G, 3164C/T, 14351A/G, 16009A/G, 16116C/T, 20172C/T and 20622A/G. The remaining seven SNPs, -29477C/T, -17590C/T, -13071A/G, -9765C/T, 7772A/G, 13269C/T, and 16605C/T, were chosen from the NCBI dbSNP  collection. Of the 16 SNPs studied, 14 were available through Applied-Biosystem's Assay-On-Demand service (Applied Biosystems, Foster city, CA). One assay was custom designed and obtained through the ABI's Assay-by-Design service (Applied Biosystems, Foster city, CA). PCR amplification of the 5' nuclease assays were conducted using 1 ng of DNA, 1X TaqMan universal PCR master mix (Applied Biosystems, Foster city, CA), 0.5X SNP genotyping assay mix [Applied Biosystems, Foster city, CA]. PCR conditions were as follows: denaturation step of 95°C for 10 min, followed by 50 cycles of 95°C for 15 sec and 60°C for 1 min. Amplification was performed on PTC-200 cyclers (MJ Research, Hercules, CA) and data were analyzed using the ABI Prism 7900HT Sequence Detector System and software version 2.1 (Applied Biosystem, Foster city, CA). All samples were run in duplicate for quality control purposes. Based on comparison of the duplicate runs, we estimated the genotyping error rate to be less than 0.05%. The -24321A/C SNP was genotyped using 7-deaza-dGTP sequencing because its location inside a GC-rich region made it very difficult to design a 5' nuclease assay for this SNP.
Genotyping of the length polymorphisms
Amplification of the region containing the 21 bp short/long VNTR and 2 bp GG/-GG insertion/deletion polymorphisms was accomplished using primers 5'AAGGAGAGAGATTGGAGCG 3' and 5'CTTCTTTCCTCTCGCATTC 3' (Invitrogen, Carlsbad, CA). PCR reactions were conducted in 15 μl volumes containing 20 ng genomic DNA, 200 μM of dNTPs mix (Stratagene, La Jolla, CA), 1 μM of mixed reverse and forward primers, 1X PC2 buffer, 0.75 U of KlenTaq1™ (Ab Peptides, St Louis, MO) and 1 M Betaine. The thermocycling conditions consisted of an initial step at 95°C for 5 min, 30 cycles of denaturation at 95°C for 30 sec, annealing 60°C 30 sec, and extension 72°C 30 sec. The lengths of the PCR products corresponding to the long and short alleles are 166 bp and 145 bp. The long and short alleles were separated using 3% metaphore agarose and gel electrophoresis (ISC BioExpress, Kaysville, UT). The GG/-GG insertion/deletion polymorphism was genotyped using direct sequencing of the PCR product as described in above.
Indices of sequence variation in SLC6A1 were calculated using a web application SLIDER . These indices included the number of polymorphic sites, nucleotide diversity per base pair (π) and the Watterson's estimator of theta (θ). Nucleotide diversity per base pair (π) describes the mean number of differences per site between two sequences chosen at random from a sample of sequences. The Watterson's estimator of theta (θ) is the observed number of SNPs adjusted for the sample size and new mutation rate expected to occur in each generation . In addition, for each subject we calculated the number of heterozygous SNPs observed in the sequence data. The number of heterozygous SNPs was compared between populations using ANOVA followed by post hoc Fisher's Least Significant Difference-test.
PHASE software, which implements a Bayesian algorithm for haplotype reconstruction, was used to estimate haplotype frequencies [18, 19]. PHASE's options -X10 and -MR were used to estimate recombination rates across SLC6A1 [20, 21]. The value on the Y-axis of Figure 2 shows changes in recombination parameter (ρ) per base pair of SLC6A1 exceeding the background recombination rate [20, 21]. The average recombination rate was estimated based on 1,000 burn-ins and 1,000 iterations. Recombination frequencies at SLC6A1 were compared visually between our and HapMap data. No statistical analyses were performed. To evaluate haplotype diversity among populations, we studied how often the most common haplotypes were shared or disjoint. The haplotypes were identified using a sliding window analysis across every three consecutive SLC6A1 SNPs. The four most common three-SNP haplotypes in each window and in each population were identified. The rationale for choosing the four most common haplotypes for this analysis was that visual inspection of the haplotype frequencies told us that in each window and in each population virtually all variation in haplotype diversity was captured by the four most common haplotypes. The average percent of all haplotypes captured by the top four haplotypes was 96%. We then calculated how many times each of the common three-SNP haplotype was disjoint between the populations. A summary pairwise score derived for the populations is presented in Table 2.
LD patterns in SLC6A1 were visualized using HAPLOVIEW version 3.2 We used the Tagger algorithm, implemented in HAPLOVIEW, to search for haplotype tagging SNPs in SLC6A1.  We used the default Tagger thresholds r2 > 0.8 and LOD score > 3. POWERMARKER  was used to calculate allele frequencies and examination of Hardy-Weinberg equilibrium (HWE). To illustrate differences in the span of LD in the five populations, r2 was plotted against physical distance. To do this, r2 was calculated for all SNP pairs. Because these values were not normally distributed, median values are presented. Physical distance (bp) was divided into distance bins to illustrate population differences in LD span across a range of physical distances. Median r2 in distance bins (0.1–10 kb, 10.01–20 kb etc) in different populations is presented in Figure 3. No statistical analyses were performed on these data. Software CENSOR was used to search for repeat elements within the recombination hotspots [43, 44].