A hierarchical statistical model for estimating population properties of quantitative genes

Wu, Samuel S; Ma, Chang-Xing; Wu, Rongling; Casella, George

doi:10.1186/1471-2156-3-36

Methodology article
Open access
Published: 12 June 2002

A hierarchical statistical model for estimating population properties of quantitative genes

Samuel S Wu¹,
Chang-Xing Ma^1,2,
Rongling Wu¹ &
…
George Casella¹

BMC Genetics volume 3, Article number: 36 (2002) Cite this article

902 Accesses
Metrics details

Abstract

Background

Earlier methods for detecting major genes responsible for a quantitative trait rely critically upon a well-structured pedigree in which the segregation pattern of genes exactly follow Mendelian inheritance laws. However, for many outcrossing species, such pedigrees are not available and genes also display population properties.

Results

In this paper, a hierarchical statistical model is proposed to monitor the existence of a major gene based on its segregation and transmission across two successive generations. The model is implemented with an EM algorithm to provide maximum likelihood estimates for genetic parameters of the major locus. This new method is successfully applied to identify an additive gene having a large effect on stem height growth of aspen trees. The estimates of population genetic parameters for this major gene can be generalized to the original breeding population from which the parents were sampled. A simulation study is presented to evaluate finite sample properties of the model.

Conclusions

A hierarchical model was derived for detecting major genes affecting a quantitative trait based on progeny tests of outcrossing species. The new model takes into account the population genetic properties of genes and is expected to enhance the accuracy, precision and power of gene detection.

Background

The identification of individual genes governing phenotypic variation is crucial to understand the mechanistic basis of quantitative inheritance and ultimately provide information about the design of optimal breeding strategies in both plants and animals. Quantitative genetics based on new statistical and computational technologies has advanced to the point at which individual genes can be detected and mapped on chromosomes [1, 2]. The underpinning for the detection and mapping of genes is founded on their particular segregation pattern. For a pedigree derived from inbred lines, the segregation pattern of a gene can be exactly predicted. In this situation, quantitative genetic theory can lay a sufficient foundation for gene detection and, thus, the properties of a detected gene can be adequately described by its effect and genomic location. However, for outcrossing populations, such as forest trees, in which inbred lines are not available, genes also display population genetic properties [3]. Hence, the foundation for gene detection in these populations should be established on a combination of quantitative genetics and population genetics. The estimation of the population genetic properties of genes, such as allele frequencies and the degree of disequilibrium, can not only enhance the precision and power of gene detection, but also is fundamentally important to our understanding of population differentiation and evolution [4, 5].

We present a strategy for investigating the existence of genes in an arbitrarily complex progeny test based on phenotype data and estimating the effects of these genes on a quantitative trait and their population properties under the maximum likelihood framework. Only a limited number of statistical studies have been performed to detect genes based on the phenotypic data [6–13], and none of these studies have considered the population and evolutionary genetic properties of genes. Our study population can be derived from a natural population instead of inbred lines as used for general agronomic crops. A finite number of unrelated parents are randomly selected from a natural population and are mated to produce progeny tests under a mating design such as factorial or diallel. In a recent paper, we used such progeny tests to derive a model for detecting major genes affecting a quantitative trait [14]. This model has power to estimate the effect of drift errors on genetic variation during hybridization, thus providing information about dynamic changes of the genetic architecture of a population under artificial hybridization. However, the estimates of population genetic parameters from the progeny test using the above model cannot be generalized to the original population from which the mating design is derived. In this paper, we developed a hierarchical model for analyzing the pattern of gene segregation at both population and family levels, which thus can provide estimates of genetic parameters in two successive generations of the population. Estimates of population genetic parameters using this hierarchical model can reasonably infer the genetic structure of the original population. A forest tree example is used to demonstrate the application of the new statistical model for gene detection.

Model

Population genetic structure

Suppose there is a segregating major gene responsible for a quantitative trait in a natural or experimental diploid population. This gene has s different codominant alleles, A ₁,...,A _s, whose frequencies in the population are denoted by p ₁,...,p _s with p _κ = 1. These s alleles randomly unite to form s(s + 1)/2 distinguishable genotypes including s homozygotes and s(s - 1)/2 heterozygotes. The population frequencies of the genotypes are denoted by p _κι (κ, ι = 1,...,s). If the population is at Hardy-Weinberg disequilibrium (HWD), the genotypic frequency is the product of the two corresponding allelic frequencies, plus the coefficient of HWD (d _κι), i.e., p _κι = p _κ p _ι + d _κι, where d _κι has the restrictions, max{-p _κ (1 - p _ι), -p _ι (1 - p _κ)} ≤ d _κι ≤ p _κ p _ι and ∑_υ≠κ d _κυ - ∑_v≠κ d _υυ ≥ - [15].

Mating design

A finite number of unrelated individuals are randomly selected from the population as female and male parents to generate a I × J factorial mating design. Based on sampling theory, these selected individuals have genotypes at the major locus whose counts follow the same multinomial distribution as those in the original population, with probability vector p = (p _κι, κ, ι = 1,...,n). The segregation of progeny genotypes in the mating design depends on how the major-locus genotypes segregate (1) among the selected female or male parents and (2) within each of the full-sib families (Table 1). The segregation among the I female or J male parents having the same pattern as in the original population is viewed as the population-level segregation. The segregation within a full-sib family of size n _ij derived from the ith female and jth male parent follows the Mendelian segregation ratio, viewed as the family-level segregation. To understand the genetic structure of the original population, different genetic information derived from this two-level segregation will not be mixed, but rather combined within a maximum likelihood framework.

Table 1 Factorial mating design using female and male parents sampled from a natural population.

Full size table

Estimation method

Let y _ijk denote the phenotypic measurement for the kth offspring from ith female parent and jth male parent and g _ijk denote its genotype at the major gene, i = 1,2, ..., I, j = 1, 2, ..., J, k = 1,2, ..., n _ij. The statistical model describing the phenotype-genotype relationship can be expressed as

where μ_κι is the genotypic value of genotype A _κ A _ι; and e _ijk is the error term with the normal distribution N(0,σ²).

Let be the genotype for the ith female parent and for the jth male parent. We denote by y the collection of the observed data and by z = (g,g _f,g _m) the collection of the major-locus genotypes for offspring, female parents and male parents. We assume the following hierarchical model:

where f denotes the normal distribution of the phenotype N(μ_κι, σ²), h is distributed according to the Mendelian segregation pattern and ψ is multinomially distributed with probability vector p. Note that genotypes for siblings are dependent and, given genotypes, the traits are independent of each other. Under the above model (y, z) has a joint distribution as follows:

Statistically, equation (2) is a mixture model in which each component is represented by a genotype within a full-sib family [16–18]. The maximum likelihood estimates for θ = (μ, σ², p) can be obtained using the EM algorithm [19, 20]. At the t ^th step of the EM sequence, the expected log-likelihood is proportional to

where Σ_G sums over all possible genotypes, are the posterior distributions of offspring genotypes and iG, jG are the posterior distributions of parent genotypes. Conditional on , these posterior probabilities can be evaluated as follows:

The EM sequence is defined by

Suppose the major gene has two different codominant alleles, the posterior probablities in equation (4) can be exactly calculated for a usual mating design. However, for large designs with many parents or for the case where the major gene has many different alleles, we have to rely on a Monte Carlo version of the EM algorithm. For example, we can sample from the conditional distribution of z = (g,g _f, g _m) given using the following Gibbs sampler [20]:

Example

We use data from aspen trees to illustrate the appication of our statistical model for detecting a major gene. The example is derived from a 6 × 6 factorial mating design of Populus tremuloides and P. tremula. The parents for the crosses were randomly selected from a mixed breeding population of two different species established at the University of Minnesota. Originally, the P. tremuloides trees used as the female parents from the Upper Peninsula of Michigan and northern Wisconsin, and the P. tremula male parents came from northeastern Poland and Germany [21]. The seedlings from the progeny population were planted in two different locations, one on a cutover forest site near Grand Rapids, Minnesota, and the other on farmland near Rhinelander, Wisconsin. Both trials were laid out in a randomized complete block design with 10 replicates and six seedlings per family in each replicate at 2.5 × 2.5 spacing. At the end of the second year in the field, all trees were measured for stem height.

Figure 1 presents density estimates of the second year height from each family pooled over the two locations. Trees from eight families did not have growth data due to mortality. The phenotypic distributions suggest the existence of a major gene in the progeny test. For example, in the family derived from Clone-5 × TA-1-68, the phenotypic distribution is a mixture of at least two components, showing the existence of a segregating gene. Figure 1 also indicates different segregation patterns among the families, with some more smooth and others more waved, thus suggesting that some parents are heterozygous with their alleles segregating in the progeny.

The effect and genotypic frequencies of the major gene are estimated using the hierarchical model assuming two alleles. Since no significant difference is detected between two locations, we only report the result based on pooled data. With probability of almost one, parents TA-1-91, TA-2-68, TA-4-91, XTA-3-6 have genotype A ₁ A ₁, parents TA-1-68, TA-1-75, CLONE-5, TA-5-61 have genotype A ₁ A ₂ and the rest parents have genotype A ₂ A ₂. The estimates of the genotypic values of A ₁ A ₁, A ₁ A ₂ and A ₂ A ₂ are ₁₁ = 56.7 ± 0.80, ₁₂ = 40.6 ± 0.32 and ₂₂ = 30.0 ± 1.02, respectively. The estimate of residual variance is = 218.676 ± 6.09. Here, the estimates of standard errors for the parameter estimators are obtained based on the Fisher information matrix [22]. We also fitted the model under the null hypothesis that there is no major gene with a single normal distribution. The likelihood ratio test has (p < 0.0001), suggesting the existence of a major gene for height growth. The ratio of the dominant to additive effect for this major gene is less than 0.2, which implies that it affects the stem height growth in an additive manner.

The population genetic parameters of this major gene are estimated based on the genotypes of the parents detected. The genotype frequencies estimated are identical for the three genotypes, i.e. ₁₁ = ₁₂ = ₂₂ = 1/3. In consequence, the allele frequencies are also same ( ₁ = ₂ = 1/2). Thus, the mixed breeding population of P. tremulodes and P. tremula is not at Hardy-Weinberg equilibrium, because the genotype frequencies are not products of the corresponding allele frequencies. The coefficient of Hardy-Weinberg disequilibrium is estimated as . The estimate of allele frequencies and Hardy-Weinberg disequilibrium using our hierarchical model can be well generalized to the original breeding population from which the parents were sampled. However, one should be cautious about the accuracy of the estimator due to the small sample size.

Simulation

The behavior of the EM estimators is evaluated by two simulation experiments. In the first simulation, a 6 by 6 mating design same as our real example is assumed. The major gene is assumed to have two codominant alleles A ₁ and A ₂ whose frequencies in the population are p ₁ = 0.6 and p ₂ = 0.4. The genotypic values are set to be 65, 40 and 25 with standard deviation 15. Based on the hierarchical model defined in equation (1), we first simulated a set of the parental genotypes (A ₁ A ₁,A ₁ A ₁, A ₁ A ₂, A ₁ A ₁, A ₁ A ₂, A ₁ A ₁ for females and A ₁ A ₂,A ₂ A ₂, A ₂ A ₂, A ₁ A ₁, A ₁ A ₂, A ₁ A ₂ for males). Then we randomly generated 50 offspring genotypes and observations for each family.

Our EM estimating procedure converged to the correct parental genotypes after only a few iterations (normally 2 or 3). A loglikelihood test is highly significant with log-likelihood ratio equals 125 (df = 4). The estimates of genotypic values from a single simulation replicate are 64.1 ± 0.74 for A ₁ A ₁, 39.5 ± 0.59 for A ₁ A ₂ and and 26.8 ± 1.64 for A ₂ A ₂. The estimate of residual variance is 241.6 ± 10.30.

In our second simulation we studied a 5 by 5 mating design. We used 25 replications to investigate the empirical accuracy of the estimators. We assume the major gene has 2 alleles and randomly generated parental genotypes. The parental genotypes were randomly generated based on allele frequence p ₁ = 0.5. That resulted in female genotypes (A ₁ A ₁, A ₁ A ₂, A ₂ A ₂, A ₁ A ₂, A ₂ A ₂) and male genotypes (A ₁ A ₂, A ₁ A ₁, A ₂ A ₂, A ₁ A ₁, A ₁ A ₂). For each family we simulated the 50 genotypes for its offsprings and then simulated its trait measurement using a normal and log-normal distribution.

The results from the second simulation study are summarized in Figure 2. The top panel of the plots is for the simulation where trait measurements are simulated from a normal distribution. The bottom panel corresponds to lognormal data. In each plot we graphed the ideal estimates of the parameter (assuming that we know the genotypes of each offspring) versus our estimates based on the EM algorithm. Mean squared errors are also indicated in the plots. It is seen from Figure 2 that all four parameters can be estimated quite accurately in the normal case while they are slightly biased for the lognormal data.

Discussion

We have derived a hierarchical model for detecting major genes affecting a quantitative trait based on progeny tests of outcrossing species. The new model takes into account the population genetic properties of genes and is expected to enhance the accuracy, precision and power of gene detection. The information about the population behavior of genes estimated from this model is fundamentally important to our understanding of the genetic architecture of a natural or experimental population and is also useful for the design of sound breeding strategies for agricultural crops and forest trees.

We used an example from interspecific aspen hybrids to illustrate the application of our gene-detecting model. The model has successfully identified a major gene with additive effects on second year stem growth in the hybrid aspens. The additive effect of genes on height growth in young poplars was also observed in a molecular marker experiment [23]. In this marker experiment, an additive quantitative trait locus was found to explain almost a quarter of the total phenotypic variation in second year height growth. The result from the aspen hybrids was consistent with that from a simulation experiment based on a mating scheme analogous to that used in our real example. The consistency of the results from the simulation study suggests that the model proposed here can be adequately used to detect genes of large effect on the phenotypes.

The model will have immediate applications for many species in which progeny tests have been established for a number of years [24]. Most of these tests are maintained in different locations and measured annually, thus offering desirable opportunities to address two important questions: (1) How does the genes interact with environment? (2) What is the developmental plasticity of gene expression? In our real example, no evidence has been obtained for the change of gene expression at the major locus detected over the two field trials, despite their dramatic differences in climate, soil properties and silvicultural measures [21]. If the stability of the expression of this gene over a range of environments is confirmed by more accurate genotype determination methods, e.g. based on genetic markers. It will have a tremendous application in breeding for stable cultivars of forest tree species.

Our analysis and simulation is based on diallelic inheritance. But multiallelic inheritance can be similarly considered, although more parameters should be introduced. Also, our model assumes the segregation of a single gene in progeny tests. The principle behind the model can be extended to consider two or more genes. Modelling multiple genes may be closer to biological reality, because the linkage, epistatic interaction and genetic association among genes are considered. However, the consideration of these relationships needs to estimate an increased number of unknown genetic parameters. Finally, our model is based on a balanced factorial design. For some populations, like mammals, mating designs are frequently unbalanced because of limited resources or reproductive difficulties. It is very important to extend our analysis and model to an unbalanced mating design of any complexity. For all of these extensions mentioned above, maximum likelihood approaches may not be sufficient owing to an increasing dimension of parameter space. A more powerful computational tool, e.g., Markov chain Monte Carol methods, may be needed [20] to make our model more tractable in real applications.

References

Hoeschele I, Uimari P, Grignola FE, Zhang Q, Gage KM: Advances in statistical methods to map quantitative trait loci in outbred populations. Genetics 1997, 147:1445–1457.
CAS PubMed Google Scholar
Balding JD, Bishop M, Cannings C: Handbook of statistical genetics. John Wiley & Sons, New York 2001.
Wu RL: Mapping quantitative trait loci by genotyping haploid tissues. Genetics 1999, 152:1741–1752.
CAS PubMed Google Scholar
Wright S: Evolution and the genetic of populations. II. The theory of gene frequencies. Univ. of Chicago Press, Chicago 1969.
Elston RC: Segregation analysis. Adv Hum Genet 1981, 11:372–3.
Google Scholar
Hoeschele I: Genetic evaluation with data presenting evidence of mixed major gene and polygenic inheritance. Theoretical and Applied Genetics 1988, 76:81–92.
Article Google Scholar
Hoeschele I: Statistical techniques for detection of major genes in animal breeding data. Theoretical and Applied Genetics 1988, 76:311–319.
Google Scholar
Le Roy P, Naveau J, Elsen JM, Sellier P: Evidence for a new major gene influencing meat quality in pigs. Genetical Research 1990, 55:33–40.
Article CAS PubMed Google Scholar
Knott SA, Haley CS, Thompson R: Methods of segregation analysis for animal breeding data: a comparison of power. Heredity 1992, 68:299–311.
PubMed Google Scholar
Le Roy P, Elsen JM: Simple test statistics for major gene detection: a numerical comparison. Theoretical and Applied Genetics 1992, 83:635–644.
Google Scholar
Loisel R, Goffinet B, Monod H, De Oca GM: Detecting a major gene in an F2 population. Biometrics 1994, 50:512–516.
Article CAS PubMed Google Scholar
Janss LL, Thompson R, Van Arendonk JAM: Application of Gibbs sampling for inference in a mixed major gene-polygenic inheritance model in animal population. Theoretical and Applied Genetics 1995, 91:1137–1147.
Article Google Scholar
Janss LL, Van Arendonk JAM, Brascamp EW: Bayesian statistical analysis for presence of single genes affecting meat quality traits in a crossed pig population. Genetics 1997, 145:395–408.
CAS PubMed Google Scholar
Wu RL, Li BL, Wu SS, Casella G: A maximum likelihood-based method for mining major genes affecting a quantitative character. Biometrics 2001, 57:764–768.
Article CAS PubMed Google Scholar
Weir BS: Genetic Data Analysis II. Sinauer Associates, Sunderland, MA 1996.
Titterington DM, Smith AFM, Makov UE: Statistical Analysis of Finite Mixture Distributions. John Wiley & Sons, NY 1985.
Feng ZD, McCulloch CE: On the likelihood ratio test statistic for the number of components in a normal mixture with unequal variances. Biometrics 1994, 50:1158–1162.
Article CAS PubMed Google Scholar
Fernando RL, Stricker C, Elston RC: The finite polygenic mixed model: an alternative formulation for the mixed model of inheritance. Theoretical and Applied Genetics 1994, 88:573–580.
Article Google Scholar
Dempster AP, Laird NM, Rubin DB: Maximum likelihood from incomplete data via EM algorithm. Journal of Royal Statistical Society Serial B 1977, 39:1–38.
Google Scholar
Robert C, Casella G: Monte Carlo Statistical Methods. Springer, NY 1999.
Li BL, Wu RL: Heterosis and genotype x environment interactions of juvenile aspens in two contrasting sites. Can. J. Forest Res. 1997, 27:1525–1537.
Article Google Scholar
Oakes D: Direct calculation of the information matrix via the EM algorithm. J. R. Statist. Soc. B 1999, 61:479–482.
Article Google Scholar
Bradshaw HD, Stettler RF: Molecular-genetics of growth and development in Populus. 4 Mapping QTLs with large effects on growth, form, and phenology traits in a forest tree. Genetics 1995, 139:963–973.
CAS PubMed Google Scholar
McKeand SE, Bridgwater FE: A strategy for the third breeding cycle of loblolly pine in the southeastern US. Silvae Genetica 1998, 47:223–234.
Google Scholar

Download references

Acknowledgements

We thank Dr. Jim Hobert, Dr. Dudley Huber, Dr. Bailian Li and Dr. Tim White for helpful discussions and reviews regarding this study. This manuscript was approved as Journal Series No. R-08032 by the Florida Agricultural Experiment Station.

Author information

Authors and Affiliations

Department of Statistics, University of Florida, Gainesville, FL, 32611, USA
Samuel S Wu, Chang-Xing Ma, Rongling Wu & George Casella
Department of Statistics, Nankai University, Tianjin, 300071, P. R. China
Chang-Xing Ma

Authors

Samuel S Wu
View author publications
You can also search for this author in PubMed Google Scholar
Chang-Xing Ma
View author publications
You can also search for this author in PubMed Google Scholar
Rongling Wu
View author publications
You can also search for this author in PubMed Google Scholar
George Casella
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rongling Wu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wu, S.S., Ma, CX., Wu, R. et al. A hierarchical statistical model for estimating population properties of quantitative genes. BMC Genet 3, 36 (2002). https://doi.org/10.1186/1471-2156-3-36

Download citation

Received: 15 March 2002
Accepted: 12 June 2002
Published: 12 June 2002
DOI: https://doi.org/10.1186/1471-2156-3-36

A hierarchical statistical model for estimating population properties of quantitative genes