 Software
 Open access
 Published:
ALDsuite: Dense marker MALD using principal components of ancestral linkage disequilibrium
BMC Genetics volumeÂ 16, ArticleÂ number:Â 23 (2015)
Abstract
Background
Mapping by admixture linkage disequilibrium (MALD) is a whole genome gene mapping method that uses LD from extended blocks of ancestry inherited from parental populations among admixed individuals to map associations for diseases, that vary in prevalence among human populations. The extended LD queried for marker association with ancestry results in a greatly reduced number of comparisons compared to standard genome wide association studies. As ancestral population LD tends to confound the analysis of admixture LD, the earliest algorithms for MALD required marker sets sufficiently sparse to lack significant ancestral LD between markers. However current genotyping technologies routinely provide dense SNP data, which convey more information than sparse sets, if this information can be efficiently used. There are currently no software solutions that offer both local ancestry inference using dense marker data and disease association statistics.
Results
We present here an R package, ALDsuite, which accounts for local LD using principal components of haplotypes from surrogate ancestral population data, and includes tools for quality control of data, MALD, downstream analysis of results and visualization graphics.
Conclusions
ALDsuite offers a fast, accurate estimation of global and local ancestry and comes bundled with the tools needed for MALD, from data quality control through mapping of and visualization of disease genes.
Background
It is well established that a subset of disease and trait phenotypes differ among human populations. Observed differences between ancestral groups can be attributed to two general causes: a difference in environmental exposures or factors or a difference in underlying genetic composition. Individuals with mixed ancestry provide an effective way to map phenotype/genotype associations to specific loci for diseases that show populationspecific prevalence differences not fully explained by environmental factors [1,2]. When two populations combine to form a new admixed population, large chromosomal segments from each of the ancestral populations remain in circulation for many generations. The difference in allele and haplotype frequencies between the populations induces admixture linkage disequilibrium (ALD) that extends over much greater distances than the local LD inherited from ancestral populations. With each new generation chromosomes recombine and the extent of ALD becomes smaller, but with the sequencing of the human genome and the advances in genotyping technology of the last decade, the ancestral origin of chromosomal segments can be inferred with high accuracy for many generations postadmixture [3]. Admixture mapping using sparse SNP arrays have been used to identify the genetic bases for several traits and diseases, including renal disease, white blood count, and chronic obstructive pulmonary disease in African Americans [46].
The application of ALD information to gene mapping studies, also referred to as Mapping by Admixture Linkage Disequilibrium (MALD), is a statistically powerful method to identify genetic associations with disease in admixed populations when there is a difference in disease risk among ancestral groups not attributable to environmental factors [7]. The key advantage of this approach over the standard genome wide association study (GWAS) approach is that the effective number of statistical comparisons, for associations between markers and disease, is inversely related to the length of LD between markers and the causal disease locus. In African Americans, for example, ALD between loci as distant as 20 cM has been identified, while LD in nonadmixed populations rarely extends longer than 0.1 cM [8,9]. This increases the power over classical GWAS by drawing focus to a specific region of interest with 200500 fold fewer comparisons that must be corrected using multiple comparisons techniques [9].
As computational power has increased and the cost of genotyping and sequencing has decreased, MALD studies have become more common and successfully applied to identify a number of genetic variants associated with common diseases [4]. Several software packages, ADMIXMAP, ANCESTRYMAP and STRUCTURE, provided good estimates of global ancestry (i.e. the proportion of ancestors from each admixing population for an individual), as well as statistics for association between phenotype and local ancestry (i.e. the population each haplotype was inherited from at a particular locus) [1012]. These early software packages were limited in their ability to analyze dense marker sets, due to their reliance on the lack of local LD among sampled markers. This reliance on sparse marker sets results from the additional complexity involved with the modeling of local LD. An attempt was made in one software package, SABER, to model 2way LD of a marker with its immediate neighbors, but this was later shown to allow bias into the model from higher order local LD with more distantly linked markers [13,14]. The consequences of this bias include a tendency to overestimate the divergence of admixing populations and possible inference of significant admixture in unadmixed individuals [3].
Two recent software packages, HAPAA and HAPMIX, have modeled local LD in a Bayesian framework similar to that used for genotype imputation, with very good results [15,16]. These methods, however, can be computationally intensive, especially with increasingly dense marker sets [3]. Other recent algorithms, including LAMPLD, MULTIMIX and RFMix, have mainly focused on local ancestry inference using disjoint haplotype blocks [1719] (see Table 1 for a list of all currently available ancestry inference software). While this approach is much more computationally efficient and scales well with increasing marker density, many regions do not segregate well into haplotype blocks. Additionally, most of these methods bin markers in an arbitrary way, including a predetermined number of markers in each bin along the chromosome. This can lead to vastly differing window sizes.
With the R package described here we provide local ancestry estimates using a hidden Markov model (HMM) algorithm similar to that used by existing software [1012], with higher order local LD modeled indirectly using principal components of neighboring markers in groups designed to maintain consistent window size in cM. Additional features not provided in most admixture software packages include MALD association statistics, quality control measures and data formatting tools. Followup statistical and graphical analysis using the powerful tool set available in R is readily available.
Implementation
Principal component regression
We use a Hidden Markov Model (HMM) to model switches between ancestral states across each individualâ€™s chromosomes (see Figure 1). The genome is split into equally sized windows (default is 0.1 cM), and an analysis of modernday representatives of ancestral populations (e.g. West Africans and Europeans for evaluation of African American individuals), is used to obtain starting values for the HMM priors. This is done with a phased data set such as that provided by the International HapMap Project [36], which can be found in the accompanying companion package, ALDdata. These methods can be extended to unphased data, but phased data are currently required by ALDsuite.
Higher order ancestral LD information is approximated in this method using principal components (PCs) of the surrounding, linked markers, and a principal component analysis (PCA) is performed. Samples from modernday surrogate populations chosen to represent ancestral, admixing populations are analyzed in the PCA, and PCs accounting for 80% of the observed variation are chosen to model the likelihood of each ancestral population within each window. The transformation of the genotype data using the PC loadings from the l ^{th} surrogate ancestral population is illustrated in Equation 1 where the principal component matrix for a window is the matrix multiplication of the phased haplotype matrix, A (one row per chromosome, one column per marker in the haplotype) with the eigenvector matrix, v:
A logistic Principal Component Regression (PCR) is then performed to infer the likelihood of each ancestral state within each window as a function of these PCs, and the regression coefficients are used as starting points for local ancestral state probability calculation in the HMM. In the case of two ancestral populations, this simplifies to a logistic regression (see Equation 2); a multinomial logistic regression is used to model admixture between more than two populations (see Appendix).
where g indicates the proposed ancestral population the haplotype originated from. In sparsely sampled regions, where only one marker was sampled within the bounds of the window, observed alleles are used in the model instead of PCR.
HMM algorithm
The HMM is an iterative, twostep process: in the first step, ancestral state probabilities, Î³, are calculated for each individual in the sample at each window, followed in the second step by an update of the parameters on which Î³ is conditioned (see Figure 1). A basic overview is given here; complete details are given in the Appendix section.
We calculate ancestral state probabilities using a forwardbackward algorithm similar to other admixture HMMs [1012], but using the PC loadings discussed above to account for local LD. The ancestral state probabilities in each Markov chain (i.e. one starting at each end of the chromosome, called the forward and reverse chains) consist of the ancestral state probabilities defined in Equation 2, conditioned on the ancestral state probability of the previous marker in the chain and the likelihood of recombination between the two:
where Î³ _{ j } is a vector of ancestral state probabilities for the j ^{th} window, P(gA) is defined in Equation 2, G is the global ancestry or proportion of the genome inherited from each ancestral population, and P(r) is the probability of recombination between the midpoints of the current and previous windows. These probabilities are further dependent on the number of generations since admixture, and the genetic distance between window midpoints, d. The product of the forward and reverse Markov chains, Î³ _{ f } and Î³ _{ r }, is normalized (so that they sum to one) to obtain the final ancestral state probabilities for each window, conditional on admixture linkage disequilibrium with nearby windows,
The local ancestral state at each window is sampled using these ancestral state probabilities. Parameters informing the HMM, particularly those on which Î³ is conditioned (e.g. PCR coefficients in Equation 2, estimated global ancestry and estimated number of generations since admixture), are updated at the conclusion of each iteration, using the sampled ancestral states discussed in the preceding paragraphs (see Appendix section for more details).
ALDsuite retains computation efficiency as the number and density of markers increases by analyzing PCs of small chromosomal regions. Additional computational efficiency can be achieved in multicore environments with support for the parallelization of ALDsuite using a distributed MCMC approach in which a separate analysis, or chain, is run for each parallel process [37,38]. In order to avoid unnecessary duplication of effort during the burnin phase, each chain reports back to the main process after each iteration, where a remote proposal of each parameter is calculated based on the average of all parallel chains. Each chain then updates its own local parameter space using a weighted sum of the local and remote proposals:
where iter is the current iteration and n b u r n is the total number of burnin iterations. This results in a quicker convergence to the equilibrium distribution while allowing each chain to start sampling at an independent state.
Error checking
Marker checks
Several quality control checks can be performed on each marker using ALDsuite to identify potential genotyping errors, mapping errors, flipped markers and irregular variations in allele frequency:

1.
HardyWeinberg Equilibrium is tested using the hwexact() function in the hwde package [39].

2.
Markers with a missing data rate exceeding a userdefined threshold are screened (default threshold is 5%).

3.
Allele frequencies from genotypic data coded as A/C/T/G are compared among populations to identify potential AT/GC flips that may have occurred in data originating from different sources. The default is to drop these markers from the analysis set.

4.
Allele frequencies in the admixed population are compared with modernday, ancestral surrogate population allele frequencies to identify potentially irregular loci.
Individual checks
ALDsuite also includes several quality control checks for individuals, to identify potentially bad samples which the user may wish to remove:

1.
Individuals with a missing data rate exceeding a userdefined threshold are screened (default threshold is 5%).

2.
When sex chromosome data are available, simple gender checks are performed and possible issues are flagged.

3.
The sample is screened for potentially related individuals, and matches are flagged.
Parameter checks
The parameter state space can be saved at each iteration during the analysis for evaluation of convergence.

1.
A function is provided to graphically display the desired parameters over the course of the burnin and followon phases of the analysis. Greater parameter variability can be expected during the burnin phase, and multiple MCMC chains can be compared to evaluate how variable parameters are across independent chains. Parameters whoâ€™s mean values change significantly during the followon phase indicate the need for a longer burnin phase.

2.
To evaluate the representativeness of chosen modernday surrogate samples, the value of Ï„ should be checked (see Appendix section for more details). Higher values indicate a better fit; instances where Ï„<50âˆ’100 either indicate poorly chosen modernday surrogates or the presence of allele flips. In the analysis of African American data, using the YRI and CEU HapMap data as modernday surrogate samples, we have observed Ï„âˆˆ(200âˆ’1000), depending on the density of the marker set.
Statistical association
Local and global ancestry estimates across the genome are reported for each individual. With this information the user can use one of several statistical association techniques for mapping disease genes and/or fine mapping of diseaseassociated loci. When mapping disease genes by ALD, an association with local ancestry at a locus is the primary association being tested. The caseonly regression model, defined in Equation 6, compares the difference between local ancestry and global ancestry. Other data (e.g. casecontrol) can be similarly modeled as defined in Equation 7. In both of these models, regions with statistically significant regression coefficients for local ancestry are inferred to harbor disease modifying genes.
When a disease locus is identified, a fine mapping analysis is needed to identify specific variants most strongly associated with the disease outcome. In a fine mapping analysis both ancestral and genotype data are included in the model (see Equation 8), and an association between genotype and disease is the primary association being tested.
These generalized linear models are very flexible, allowing for multiple types of disease phenotypes (e.g. continuous, dichotomous, timetoevent) and any covariates deemed appropriate by the investigator. Wrapper functions for these models along with support for parallel computation is included in ALDsuite.
Simulations and power
Control populations
Chromosomes with known ancestry at each marker were simulated in a two step process: 1) recombination points were assigned to each chromosome based on the number of generations since admixture; 2) chromosomal segments were randomly selected from the YRI and CEU HapMap samples to fill in each chromosomal region, with the probablility of sampling a given HapMap chromosome conditional upon the assigned global ancestry for the simulted chromosome. In this way, admixed chromosomes were simulated with appropriate admixture linkage patterns across the chromosome without regard to how windows are chosen.
Random recombination rates, conditional upon the number of generations since admixture, and global ancestral proportions, G, were sampled, and 400 chromosomes were simulated. Values for the number of generations since admixture were Gamma distributed with a mean of 6 and standard deviation of 2, and values for G were Beta distributed with a mean of 0.82 and standard deviation of 0.1. These parameters were chosen to simulate a typical African American sample. The CEU and YRI populations were also used as modernday representative populations, but with the initial PCR estimates randomly modified to simulate imperfect surrogates. This was done by adding a normal random value to each of the regression estimates, the variance of which was scaled by each estimateâ€™s standard error.
A sample of 100 individuals from each simulation above was analyzed using ALDsuite, MULTIMIX and PCAdmix [18,28], and the proportion of correct and incorrect inferences are reported.
Empirical data
The ASW population from the International HapMap Project, a sample of African Americans from the Southwest USA, were analyzed using YRI and CEU populations as surrogate ancestral populations. These populations were analyzed using ALDsuite as well as MULTIMIX and PCAdmix [18,28], and a representative sample of the results on chromosome 20 are shown.
Additional tools
Several tools are included in the R package, additional to the local ancestry inference and disease association statistics described above. These include input and output data formatting aids, quality control and analysis of the data, and useful data sets. Formatting functions are available for generating prior parameter estimates for different populations using HapMap populations contained in the ALDdata package, and calculation of genetic distance in humans is performed using one of several maps, including the International HapMap Project and those generated by Matise et al. [36,40,41]. Error checking functions for quality control measures discussed in the Error Checking section are included as well as some basic graphics. Additional downstream statistical analysis and custom generation of graphics using the diverse and powerful toolset provided by R is also directly available [42].
Results and discussion
While sparse marker panels are more cost effective and have proven powerful in the detection several important disease risk genes, dense data provide more accurate ancestry inference and a finer resolution of recombination points [13]. One strategy that has been used is to follow up a MALD study with fine typing around an associated locus [43]. With ALDsuite both sparse and dense marker data are analyzed in combination, resulting in better global ancestry estimates, while being able to infer local ancestry on a much finer scale in areas of particular interest. This program should increase the utility of dense marker datasets available from many large cohort studies that include African Americans.
ALDsuite provides accurate inference of local ancestry, while indirectly modeling local, higher order LD remaining from ancestral populations. The analysis of our simulation resulted in 96.3% accuracy of local ancestry inference, compared to the 98.1% accuracy of PCAdmix and the 98.7% accuracy of MULTIMIX, which is on par with other leading analysis software [16,32]. Comparison of chromosomes from an analysis of the ASW population using ALDsuite, MULTIMIX and PCAdmix also shows a good degree of concordance between the methods used (see Figure 2).
One striking difference between the results shown in Figure 2 are the differing window sizes. The binning of markers in MULTIMIX and PCAdmix is done by arbitrarily grouping a fixed number of markers into each bin. In more densely sampled areas, such as those closer to the center of the chromosome, the window sizes are quite small, while other less densely sampled areas have much larger window sizes. The region at the beginning of the chromosomes in Figure 2, for example, cover as much as 4 cM. Binning of markers in ALDsuite is done by genetic distance, rather than the number of markers, creating a more constant window size across the genome. In more densely sampled regions, this helps maintain better computational properties, since fewer windows can be used to cover the same region, while in sparsely sampled regions a more precise estimate of the boundaries of ancestral haplotypes can be obtained.
Another key feature of ALDsuite that all other densemarker admixture software lacks is direct access to statistical methods needed to map disease phenotypes. Not only does ALDsuite provide utilities directly supporting admixture mapping and fine mapping studies (see Implementation section), but many other proposed methods can be easily implemented in R, using the output provided by ALDsuite [4446]. Also, of eleven MALD studies published in 2013 and early 2014, six used sparse marker panels for disease gene mapping, at least two of which explicitly thinned their dense marker data to accommodate the software used [47,48]. An additional 15 GWAS studies we identified from 2013 used various software listed in Table 1 to control for population substructure resulting from admixture, mostly using dense marker strategies (citations not listed here). This trend highlights the need for a dense marker software package that, like most sparse marker software, includes disease association statistics for MALD.
Conclusion
Admixture inference software can be categorized using a few different metrics including the number of admixing populations it can simultaneously infer, the way it models local LD when analyzing dense marker data, the number of admixing populations it will simultaneously infer and support of disease gene mapping (see Table 1). There are currently no software solutions which both offer analysis of dense marker data from more than two admixing populations and disease association statistics, requiring the use of several software programs, often with very different input and output data formats. ALDsuite offers a fast, accurate estimation of global and local ancestry with the tools needed from data quality control through mapping of disease genes, along with the rich statistical and graphical utilities provided with R.
Availability and requirements
Project name: ALDsuite Project homepage: https://github.com/johnsonra/ALDsuite Operating systems(s): Windows, Mac, Linux Programming language: R and C Other Requirements: R, version 3.0 or greater with the parallel, mvtnorm and hwde packages installed. The gdata and ncdf R packages are also recommended. License: GPL
Appendix
ALDsuite: Dense marker MALD using principal components of ancestral linkage disequilibrium
Computational details for the algorithm used to sample the joint distribution of the HMM for inferring local ancestry. Throughout, parameters are indexed by i (individual), j (marker), c (chromosome) and k (ancestral population).
Initialization of the parameter space
Distances, d, are calculated as the number of centimorgans to the previous marker, with each chromosome starting with a missing value.
The modern allele frequencies on chromosome segments originating from ancestral populations, Î©, parameterize the prior distribution of ancestral allele frequencies, P. Eigen vectors for groups of markers used in modeling of ancestral LD within each ancestral population are either given by the user or estimated from HapMap data by the software. Prior estimates of logistic regression coefficients, H, and their associated variancecovariance matrices, Î£, for inference of modern allele frequencies as a function of nearby, linked markers are also either provided by the user or estimated from HapMap data. All associated markers within a user definable window (default is 2 cM) are chosen to model ancestral LD, and the number of principal components, m1, accounting for 80% of the genetic variation in each subset are chosen to be included in the model, making a total of m coefficients, including the intercept.
Initial values for ancestry, A, are obtained using a quick frequentist algorithm, and global ancestry estimates for each parent are initially equal.
Initial values for average number of generations since admixture, Î», and effective population size of each prior population, Ï„, can also be specified by the user. When unspecified, default values tuned to the analysis of African Americans are used.
MCMC Algorithm
Step 1. Sample Ancestral States
Ancestral state probabilities are calculated using a forwardbackward algorithm similar to that used by admixture software for sparse marker sets [1012]. The main differences in our algorithm being that ancestral LD is indirectly modeled, allowing analysis of dense marker sets, and we estimate marginal ancestral state probabilities for each inherited chromosome, requiring the genotype data to be phased prior to analysis. These differences motivate the majority of differences between our package and other admixture software. In the forward portion of the algorithm, ancestral state probabilities, Î³, are calculated at each locus, dependent on the genotype at each locus (probability that the ith individuals jth locus of chromosome c originated from the kth ancestral population).
Before we treat the probabilities in Equation A1, we note that the probability of an observed recombination event, r _{ ijc }, over a distance of d _{ j } cM is a function of the number of generations since admixture, Î» _{ ic }:
and the probability of any crossovers happening in one haplotype since admixture over a window of size w cM follows a Poisson distribution:
The probability of an individuals genotype at a locus, a, conditional on the ancestral state, g, is a function of the allele frequencies in each population and the principal components of nearby, linked markers, spanning a region of w cM.
where the probability of one or more crossovers in the haplotype block of w cM, which informs the principal components regression, is defined in Equation A3, and p _{ jk } is the allele frequency in chromosomes with k ancestry. We highlight the dependence of Equation A5 on the probability of observing crossovers within the window supporting the principal components regression. If there is a crossover, the resulting haplotype is no longer representative of the ancestral population, and we rely upon the allele frequency instead.
The probabilities of each ancestral state are further dependent on the ancestral probabilities at the previous locus, Î³ _{ i(jâˆ’1)K }, the distance, d _{ j }, between these loci (missing if it is the first locus on a chromosome), the individuals recombination rates, Î» _{ ic }, and the individuals global ancestry, A _{ ick } (the distance between loci is in cM).
Now we treat the probability of the ancestral state, k, of a locus, dependent on the ancestral state at the previous locus in the Markov chain, k ^{âˆ—}:
For the first locus on each chromosome, the only prior information available is the global ancestry of the parents. We essentially treat this scenario as if there were a known recombination event, i.e. P(r _{1})=1.
This also applies to the marginal probability of the observed genotype, a, which depends Equation A4 and Equation A7:
The reverse chain is nearly identical, starting from the opposite end of each chromosome and working back. The final probabilities at each locus are obtained by multiplying the forward and reverse chains and normalizing,
and a sample, G, of Î³ is taken for use in Step 2:
Step 2: Parameter Updates
Updates of A and A ^{X}, global ancestry
The prior of A is Dirichlet distributed and parameterized by Ï‰. The posterior is Dirichlet distributed, parameterized by the sum of Ï‰ and Î³, for all autosomal markers.
We accept the sampled values for each MetropolisHastings sample, \(\dot {A}\), with probability
Patterson et. al. [11] have noted that sex chromosome ancestry is highly correlated with autosomal chromosome ancestry. Sex chromosome ancestry proportions are parameterized the same way here, by a scalar value, o m e g a _{ X }, conditional on A. The posterior is Dirichlet distributed, parameterized by the product of A and Ï‰ ^{X} and the sum of Î³ over the X chromosome.
We accept the sampled values for each MetropolisHastings sample, \(\dot {A}^{X}\), with probability
Update of Î», mean number of generations since admixture
The prior of Î³ is Gamma distributed, parameterized by a shape parameter, Î± _{1} and a rate parameter, Î± _{2}.
The posterior is Gamma distributed:
As noted in Equation A3, the number of crossovers is Poisson distributed. To sample the number of crossovers in each individual, conditional on there being at least 1 crossover, we generate a random uniform number for each locus, q _{ j }, such that
and the number of corresponding crossovers for each locus, n _{ xj }, such that
We then calculate the probability of 0 crossovers given G, p x _{ j0}, at each locus,
where
We keep the number of crossovers we sampled, n x _{ j }, at that locus with probability 1âˆ’p x _{ j0}. The sum of these sampled crossovers, we can sample the updated value, \(\dot {\lambda }\), which we keep with probability
Updates of p and Î², parameterizing allele frequencies for each population
The prior allele frequency of p is Beta distributed, parameterized by the product of Ï„ and P. The posterior is Beta distributed, parameterized by sum of the product of Ï„ with P and the number of reference/variant alleles sampled in Step 2.
Each proportion is individually updated and is kept with probability
For principal component regression modeling of the allele probabilities, conditional on local ancestry, Î² is multivariate normally distributed, parameterized by the prior B and the diagonal of Î£. The posterior is additionally parameterized by Ï„ and the logistic regression coefficients, \(\hat {\beta }\), of the principal component regression model of the haplotypes sampled at the end of Step 1.
The sampled value, \(\dot {\beta }\), is kept with probability
Update of P, B and Ï„, hyper parameters for p and Î²
The prior of P is Beta distributed, parameterized by the number of observed alleles in the modern day equivalent to the founder populations (e.g. Africans and Europeans for African Americans).
Î© is a vector of the number of variant alleles and the number of reference alleles in the modernday ancestral surrogate population sample. After each update of P, \(\dot {P}_{\textit {jk}}\), the change is kept with probability
The prior of B is multivariate normally distributed as a function of H and Î£, as estimated from the modernday surrogate ancestral population.
Sampled updates, \(\dot {B}\), are kept with probability
The prior of Ï„ is log normally distributed such that log _{10}(Ï„) has a mean of 2 and standard deviation of 0.5,
Samples values, \(\dot {\tau }\), are kept with respective probabilities,
where, given the length of Î²=l,
Update of Ï‰ and Ï‰ ^{X}, hyper parameters for A and A ^{X}
The prior of Ï‰ and Ï‰ ^{X} are log normally distributed, such that log _{10}(Ï‰) has mean 1 and standard deviation 0.5.
Updated values for Ï‰ and Ï‰ ^{X}, \(\dot {\omega }\) and \(\dot {\omega }^{X}\), are kept with probability
Update of Î±, hyper parameters for Î»
Similar to other admixture software, updates of are a function of the mean of the Gamma distribution, Î± _{1}/Î± _{2}=m, and the variance of the Gamma distribution, \(\alpha _1/\alpha _{2}^{2} = v\). Each is log normally distributed, such that log _{10}(m) and log _{10}(v) each have mean 1 and standard deviation 0.5.
Values for m and v are updated independently, parameterized by \(\dot {\alpha }\), and are kept with probability
References
MacLean CJ, Workman PL. Genetic studies on hybrid populations. I. Individual estimates of ancestry and their relation to quantitative traits. Ann Human Genet. 1973; 36(3):341â€“51.
Thoday JM. Limitations to genetic comparison of populations. J Biosocial Sci. 1969; Suppl 1:3â€“14.
Seldin MF, Pasaniuc B, Price AL. New approaches to disease mapping in admixed populations. Nature Reviews Genetics. 2011; 12(8):523â€“8.
Kopp JB, Smith MW, Nelson GW, Johnson RC, Freedman BI, Bowden DW, et al. MYH9 is a MajorEffect Risk Gene for Focal Segmental Glomerulosclerosis. Nat Genet. 2008; 40(10):1175â€“84.
Nalls MA, Wilson JG, Patterson NJ, Tandon A, Zmuda JM, Huntsman S, et al. Admixture mapping of white cell count: genetic locus responsible for lower white blood cell count in the Health ABC and Jackson Heart studies. Am J Human Genet. 2008; 82(1):81â€“7.
Parker MM, Foreman MG, Abel HJ, Mathias RA, Hetmanski JB, Crapo JD, et al. Admixture mapping identifies a quantitative trait locus associated with FEV1/FVC in the COPDGene study. Genet Epidemiol. 2014; 37(7):652â€“9.
McKeigue PM. Mapping genes underlying ethnic differences in disease risk by linkage disequilibrium in recently admixed populations. Am J Human Genet. 1997; 60(1):188.
Parra EJ, Marcini A, Akey J, Martinson J, Batzer MA, Cooper R, et al. Estimating African American admixture proportions by use of populationspecific alleles. Am J Human Genet. 1998; 63(6):1839â€“51.
Smith MW, Oâ€™Brien SJ. Mapping by admixture linkage disequilibrium: advances, limitations and guidelines. Nat Genet. 2005; 6:623â€“32.
Hoggart CJ, Shriver MD, Kittles RA, Clayton DG, McKeigue PM. Design and analysis of admixture mapping studies. Am J Human Genet. 2004; 74(5):965â€“78.
Patterson N, Hattangadi N, Lane B, Lohmueller KE, Hafler DA, Oksenberg JR, Hauser SL, Smith MW, Oâ€™Brien SJ, Altshuler D, Daly MJ, Reich D. Methods for highdensity admixture mapping of disease genes. Am J Human Genet. 2004; 74(5):979â€“1000.
Falush D, Stephens M, Pritchard JK. Inference of population structure using multilocus genotype data: linked loci and correlated allele frequencies. Genetics. 2003; 164(4):1567â€“87.
Tang H, Coram M, Wang P, Zhu X, Risch N. Reconstructing genetic ancestry blocks in admixed individuals. Am J Human Genet. 2006; 79(1):1â€“12.
Price AL, Weale ME, Patterson N, Myers SR, Need AC, Shianna KV, et al. Longrange LD can confound genome scans in admixed populations. Am J Human Genet. 2008; 83(1):132â€“5.
Sundquist A, Fratkin E, Do CB, Batzoglou S. Effect of genetic divergence in identifying ancestral origin using HAPAA. Genome Res. 2008; 18(4):676â€“82.
Price AL, Tandon A, Patterson N, Barnes KC, Rafaels N, Ruczinski I, et al. Sensitive detection of chromosomal segments of distinct ancestry in admixed populations. PLoS Genet. 2009; 5(6):1000519.
Baran Y, Pasaniuc B, Sankararaman S, Torgerson DG, Gignoux C, Eng C, et al. Fast and accurate inference of local ancestry in Latino populations. Bioinformatics. 2012; 28(10):1359â€“67.
Churchhouse C, Marchini J. Multiway admixture deconvolution using phased or unphased ancestral panels. Genet Epidemiol. 2013; 37(1):1â€“12.
Maples BK, Gravel S, Kenny EE, Bustamante CD. RFMix: A Discriminative Modeling Approach for Rapid and Robust LocalAncestry Inference. Am J Human Genet. 2013; 93(2):278â€“88.
Pritchard JK, Stephens M, Donnelly P. Inference of population structure using multilocus genotype data. Genetics. 2000; 155(2):945â€“59.
McKeigue PM, Carpenter JR, Parra EJ, Shriver MD. Estimation of admixture and detection of linkage in admixed populations by a Bayesian approach: application to AfricanAmerican populations. Ann Human Genet. 2000; 64(Pt 2):171â€“86.
McKeigue PM, Colombo M, Agakov F, Datta I, Levin A, Favro D, et al. Extending admixture mapping to nuclear pedigrees: application to sarcoidosis. Genet Epidemiol. 2013; 37(3):256â€“66.
Tang H, Peng J, Wang P, Risch NJ. Estimation of individual admixture: analytical and study design considerations. Genet Epidemiol. 2005; 28(4):289â€“301.
Sankararaman S, Sridhar S, Kimmel G. Estimating local ancestry in admixed populations. Am J Human Genet. 2008; 82(2):290â€“303.
Sankararaman S, Kimmel G, Halperin E, Jordan MI. On the inference of ancestries in admixed populations. Genome Res. 2008; 18(4):668â€“75.
Pasaniuc B, Sankararaman S, Kimmel G, Halperin E. Inference of locusspecific ancestry in closely related populations. Bioinformatics. 2009; 25(12):213â€“21.
Alexander DH, Novembre J, Lange K. Fast modelbased estimation of ancestry in unrelated individuals. Genome Res. 2009; 19(9):1655â€“64.
Brisbin A, Bryc K, Byrnes J, Zakharia F, Omberg L, Degenhardt J, et al. PCAdmix: Principal componentsbased assignment of ancestry along each chromosome in individuals with admixed ancestry from two or more populations. Human Biol. 2012; 84(4):343â€“64.
Hu Y, Willer C, Zhan X, Kang HM, Abecasis GR. Accurate localancestry inference in exomesequenced admixed individuals via offtarget sequence reads. Am J Human Genet. 2013; 93(5):891â€“99.
Loh PR, Lipson M, Patterson N, Moorjani P, Pickrell JK, Reich D, et al. Inferring admixture histories of human populations using linkage disequilibrium. Genetics. 2013; 193(4):1233â€“54.
Rodriguez JM, Bercovici S, Elmore M, Batzoglou S. Ancestry inference in complex admixtures via variablelength Markov chain linkage models. J Comput Biol. 2013; 20(3):199â€“211.
Yang JJ, Li J, Buu A, Williams LK. Efficient inference of Local ancestry. Bioinformatics. 2013; 29(21):2750â€“6.
Zhang Y. De novo inference of stratification and local admixture in sequencing studies. BMC Bioinf. 2013; 14 Suppl 5:17.
Liu EY, Li M, Wang W, Li Y. MaCHadmix: genotype imputation for admixed populations. Genet Epidemiol. 2013; 37(1):25â€“37.
Guan Y. Detecting structure of haplotypes and local ancestry. Genetics. 2014; 196(3):625â€“42.
International HapMap Consortium. A second generation human haplotype map of over 3.1 million SNPs. Nature. 2007; 449(7164):851â€“61.
Murray L. Distributed Markov chain Monte Carlo. In: LCCC: NIPS workshop on learning on cores, clusters and clouds. Perth, Western Australia: CSIRO Mathematics, Informatics and Statistics: 2010.
Wu XL, Sun C, Beissinger TM, Rosa GJ, Weigel KA, Gatti NdL, et al. Parallel Markov chain Monte Carlo  bridging the gap to highperformance Bayesian computation in animal breeding and genetics. Genet Sel Evol: GSE. 2012; 44:29.
Maindonald JH. The hwde Package. 2013. http://cran.rproject.org/web/packages/hwde/.
Matise TC, Chen F, Chen W, De La Vega FM, Hansen M, He C, et al. A secondgeneration combined linkage physical map of the human genome. Genome Res. 2007; 17(12):1783â€“6.
Nato AJ, Buyske S, Matise TC. The Rutgers Map: A thirdgeneration combined linkagephysical map of the human genome. 2014. http://compgen.rutgers.edu/download_maps.shtml.
R Development Core Team. R: A language and environment for statistical computing. Manual. 2013. http://cran.rproject.org.
Nelson GW, Freedman BI, Bowden DW, Langefeld CD, An P, Hicks PJ, et al. Dense mapping of MYH9 localizes the strongest kidney disease associations to the region of introns 13 to 15. Human Mol Genet. 2010; 19(9):1805â€“15.
Zhu B, AshleyKoch AE, Dunson DB. Generalized admixture mapping for complex traits. G3 (Bethesda, Md.) 2013; 3(7):1165â€“75.
Redden DT, Divers J, Vaughan LK, Tiwari HK, Beasley TM, FernÃ¡ndez JR, et al. Regional admixture mapping and structured association testing: conceptual unification and an extensible general linear model. PLoS Genet. 2006; 2(8):137.
Shriner D, Adeyemo A, Rotimi CN. Joint Ancestry and Association Testing in Admixed Individuals. PLoS Comput Biol. 2011; 7(12):1002325.
KimHoward X, Sun C, Molineros JE, Maiti AK, Chandru H, Adler A, et al. Allelic heterogeneity in NCF2 associated with systemic lupus erythematosus (SLE) susceptibility across four ethnic populations. Human Mol Genet. 2013; 23(6):1656â€“68.
Jeff JM, Armstrong LL, Ritchie MD, Denny JC, Kho AN, Basford MA, et al. Admixture mapping and subsequent finemapping suggests a biologically relevant and novel association on chromosome 11 for type 2 diabetes in African Americans. PloS One. 2014; 9(3):86931.
Acknowledgements
This project has been funded in whole or in part with federal funds from the National Cancer Institute, National Institutes of Health, under contract HHSN26120080001E. The content of this publication does not necessarily reflect the views or policies of the Department of Health and Human Services, nor does mention of trade names, commercial products, or organizations imply endorsement by the U.S. Government. This Research was supported [in part] by the Intramural Research Program of the NIH, National Cancer Institute, Center for Cancer Research.
Author information
Authors and Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests.
Authorsâ€™ contributions
RCJ performed the majority of the programming and wrote the first manuscript draft. GWN contributed to the admixture inference methods. RCJ, GWN, JFZ and CAW contributed intellectually to the manuscript and development of the software package. All authors read and approved the final manuscript.
Rights and permissions
Open Access Â This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.
The images or other third party material in this article are included in the articleâ€™s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the articleâ€™s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
To view a copy of this licence, visit https://creativecommons.org/licenses/by/4.0/.
The Creative Commons Public Domain Dedication waiver (https://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Johnson, R.C., Nelson, G.W., Zagury, JF. et al. ALDsuite: Dense marker MALD using principal components of ancestral linkage disequilibrium. BMC Genet 16, 23 (2015). https://doi.org/10.1186/s128630150179y
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s128630150179y