An EM algorithm for mapping segregation distortion loci

Background Chromosomal region that causes distorted segregation ratios is referred to as segregation distortion locus (SDL). The distortion is caused either by differential representation of SDL genotypes in gametes before fertilization or by viability differences of SDL genotypes after fertilization but before genotype scoring. In both cases, observable phenotypes are distorted for marker loci in the chromosomal region close to the SDL. Under the quantitative genetics model for viability selection by proposing a continuous liability controlling the viability of individual, a simplex algorithm has been used to search for the solution in SDL mapping. However, they did not consider the effects of SDL on the construction of linkage maps. Results We proposed a multipoint maximum-likelihood method to estimate the position and the effects of SDL under the liability model together with both selection coefficients of marker genotypes and recombination fractions. The method was implemented via an expectation and maximization (EM) algorithm. The superiority of the method proposed under the liability model over the previous methods was verified by a series of Monte Carlo simulation experiments, together with a working example derived from the MAPMAKER/QTL software. Conclusion Our results suggested that the new method can serve as a powerful alternative to existing methods for SDL mapping. Under the liability model, the new method can simultaneously estimate the position and the effects of SDL as well as the recombinant fractions between adjacent markers, and also be used to probe into the genetic mechanism for the bias of uncorrected map distance and to elucidate the relationship between the viability selection and genetic linkage.


Background
In a segregation population derived from a cross between two inbred lines, some molecular markers often show distorted segregation ratios from Mendelian expectations [1][2][3]. The distortion is frequently related to gamete gene, sterile gene and chromosome translocation [4]. So the detection of the gene or locus, known as segregation distortion locus (SDL) mapping, is warranted. However, the challenge encountered in SDL mapping is mainly caused by the unavailability of phenotypic data for the underlying trait. In fact, molecular markers linked to the SDL frequently show segregation distortion and the degree of distortion depends on the size and the position of SDL. Therefore, it is possible to detect SDL by means of the distortion.
Mapping SDL is usually studied at the population level by examining the change of gene (or genotypic) frequencies of markers [5]. In the past a single marker was often used to detect the linkage between the marker and SDL [6,7]. Its shortcomings are very similar to those of single-marker approaches in quantitative trait loci (QTL) mapping [8].
Since the introduction of interval mapping of QTL [9], Hedrick and Muona [10] developed a flanking-marker analysis to estimate the fitness parameters for a viability locus. The model of Hedrick and Muona [10] is actually a complete recessive model. Mitchell-Olds [11] detected one putative viability locus at a time and then scanned the entire genome for every putative position to provide a test statistic profile for the detection of SDL. However, his model only test and estimate the degree of dominance. Luo and Xu [12] extended the maximum-likelihood (ML) method to estimate degree of dominance and selection coefficients using an outbred full-sib family as an example. Wang et al. [13] developed a multipoint ML method to estimate the position and the genotypic frequencies of SDL in an F 2 population. However, the efficacies of the methods mentioned above have been seldom addressed in simulation studies. Recently, Luo et al. [14] developed a quantitative genetics model for viability selection. This approach makes it possible to carry out simulation studies, to partition the selection into additive and dominant effects and to remove the effects of non-genetic cofactors from the analysis [14,15]. However, this approach raises two issues. Firstly, they assumed that segregation distortion didn't affect the construction of genetic linkage map. In fact, marker segregation distortion is known to affect the estimates for both recombination fractions in pairwise analysis of markers and the order of the markers on a linkage group [16][17][18]. As for the genetic parameters, then, Luo et al [14] adopted the Simplex algorithm [19] to search for the solutions at the cost of computational consuming. Under the liability model proposed by Luo et al [14], therefore, in this paper it is necessary to extend the multipoint approach by combining the estimations of the genetic parameters of SDL with the reconstruction of genetic linkage maps. The new method for SDL mapping was implemented via an expectation and maximization (EM) algorithm rather than Simplex procedure. The genetic factors that might affect the estimates of recombination fractions between adjacent markers would be discussed in detail. A series of Monte Carlo simulation experiments together with a working example from the Mapmaker/QTL software were carried out to verify our approach.

Genetic model
Considering an SDL in an F 2 population derived from a cross between two inbred lines, we assumed three genotypes at this locus, AA, Aa and aa, to have genotypic values a -d, d and -a -d, respectively, with a and d indi-cating additive and dominant effects, and an imaginary trait, liability, invisible to the investigators but visible to nature, controlled the viabilities of individuals. It should be noted that the genetic variance in an F 2 population was a 2 + d 2 rather than as usual. The phenotypic value of the jth individual was described by the following linear model, where g j was the genotypic value for the jth individual, and ε j a normally distributed residual variable with mean zero and standard deviation 1.0, which accounted for polygenes that were linked to the markers and for environmental variation [14,18]. Provided that the liability was subject to natural selection, an individual would survive if z j ≥ 0 and would be eliminated from the population if z j < 0. Since all the sampled individuals had survived from the viability selection, the liability of each genotype followed a truncated distribution with a cumulative probability, where h indexed the genotypes of the SDL, and f h was referred to as the relative fitness of the hth genotype [14]. The expected frequencies of the three genotypes were similarly

Mapping SDL under a liability model
We assumed that there was no crossing-over interference among the markers on the linkage group considered, an SDL caused segregation distortion of some or all markers linked to the SDL, and three genotypes for each marker had different viability coefficients. Let the order of the m markers on a same chromosome be M 1 , M 2 ,...,M m ; x k be a dummy variable defined as x k = 1, 0, -1 for a homozygote of P 1 , a heterozygote and a homozygote of P 2 at the kth marker, respectively; z k be indicator for phenotype of the kth marker (M k ); r k (or r k,k+1 ) be the recombination fraction between the kth and (k+1)th markers; and s k,1 and s k,2 (0 ≤ s k,1 < +∞ and 0 ≤ s k,2 < +∞ for k = 1, 2,...,m) be the viability coefficients of M k m k and m k m k relative to M k M k at the kth marker. . . .
Now let an SDL locate between the kth and (k+1)th markers, and φ jh be the indicator function, taking the value of 1, if the jth individual belonged to the hth possible genotype in the F 2 population, otherwise taking the value of zero.
The parameters were Ω = (p AA , p Aa , p aa , δ) or Ω = (a, d, δ), with δ indicating the SDL location. The distribution of φ jh was described as where n was sample size. The likelihood defined with matrix notation was There were several ways to find the ML estimates (MLEs) of model parameters. We here adopted an EM algorithm [20] and treated φ jh as missing data. We regarded δ as constant for the moment, now the parameter set was θ = (a, d)'. For the EM algorithm, we needed to obtain the expectation of the complete data log-likelihood function, where the constant C didn't depend on the parameters of interest, and but did depend on the viability coefficients and map distance between adjacent markers, which could be determined by Zhu et al [18]. The EM algorithm was described as follows.

M-step
The MLEs of parameters were obtained by the Fisher-scoring algorithm as it was impossible to get their explicit solutions [22]. The θ could be updated by where S(θ (0) ) was the score function, and I was the Fisher information matrix (more details were given in Appendix). And θ (1) would replace θ (0) in all subsequent estimating steps, and the procedure was iterated until the convergence occurred. The converged θ (1) was the MLEs of θ in this M-step.
The E and M steps were iterated until the convergence occurred.
The MLE for the SDL position could be obtained by examining the likelihood-ratio profile along the chromosome as was commonly done in interval mapping of QTL [9].
Following parameter estimation, we tested an overall null hypothesis that was no effect of SDL at the locus of interest (δ). The null hypothesis was formulated as H 0 : a = d = 0.0, which was tested using the likelihood-ratio (LR) test statistic: Under the null hypothesis, the statistic LR approximately followed chi-square distribution with two degrees of freedom.
The critical value for power calculation was determined by computing 1,000 permutations [23], the experiment-wise type I error was set at 5%, and the confidence interval of an SDL location was determined by the bootstrapping method [24].

Simulation model
We simulated one chromosome of 100 cM (or 50 cM) long covered by m evenly spaced codominant markers (m = 6, 11 or 21) and put a single SDL at position 25 cM (another SDL was put at position 75 cM if necessary). The dominance ratio of the SDL was denoted by dr = d/a. Given the broad heritability (h 2 ) and dr, the additive and dominant effects could be obtained using numerical algorithm [25]. Based on the method described in Luo et al. [14], all genotypes of both distorted markers and SDL for each individual in an F 2 population were simulated. All simulations were replicated 100 or 1000 times depending on the purpose of the analyses. Empirical power was calculated by counting the number of runs in which test statistics were greater than the critical values [26].

Effects of various factors on SDL mapping
In this simulated experiments, the effects of sample size, SDL heritability and marker interval length on SDL mapping were studied, respectively. The performance of the proposed method was evaluated by statistical power, average and standard deviation of estimates with 100 replicates. All parameters and results were listed in Table 1. The results showed the general behavior of QTL mapping, i.e., the estimate for each parameter was very close to its corresponding true value, the power and the precision for SDL mapping increased with the increase in sample size and SDL heritability, respectively. However, marker interval length had slight effect on the power under the three levels studied.

Mapping multiple SDL
Similar to the interval mapping procedure of Lander and Botstein [9], the single-locus model for SDL mapping was used to search for multiple loci. Eleven markers were evenly placed on a simulated chromosome of length 100 cM. Two SDL each with a 0.5 dominance-ratio and a 0.15 heritability were respectively located at positions 25 cM and 75 cM on the simulated chromosome. One hundred independent simulation runs were performed for a sample size of 200. The results were listed in Table 2. Both loci were identified at almost 100% power. The results from simulation experiments demonstrated that the new method based on single-SDL model may be considered as an approximate approach to search for multiple loci if the SDL are sufficiently separated by markers.

A working example
As a demonstration of the proposed method in this paper, we re-analyzed a sample dataset (the source filename: sample.raw) in the MAPMAKER/QTL software [27]. It consisted of 333 F 2 individuals from a cross between two inbred lines in tomato. Each plant was genotyped for 12 marker loci that were divided into two linkage groups. Single-marker chi-square test showed that 5 and 2 markers on the first and second linkage groups deviated from Mendelian segregation ratios, respectively (data not shown). Given the reconstructed linkage maps using the method of Zhu et al. [18], 1000 simulated datasets without segregation distortion were simulated and used to determine the critical value [23]. The confidence interval of a SDL location was determined by the Bootstrap method [24].
The map distances between consecutive markers were calculated twice with and without considering SDL. The former was corrected map distance obtained from the method of Zhu et al. (2007) [18]; and the latter was uncorrected one using the Mapmaker/EXE 3.0 software [27]. The results were listed in Table 3. The results showed that the corrected map distances differed from the uncorrected ones when there were distorted markers. The genetic reason of these inconsistencies would be discussed in the following section. Using the proposed method here, a total of four SDL were mapped (Table 4, Fig 1). Two SDL were on the first linkage group and the others on the second one. The genetic parameters for the four SDL were listed in Table 4. The results showed that the distortion was stronger for the first linkage group than for the second one (Fig 1). It resulted in a maximum difference between the corrected and uncorrected map distances for the first marker interval on the first linkage group. Moreover, two linked SDL on the second linkage group also gave rise to two big differences ( Table 3). As compared to a single SDL, therefore, linked SDL had a larger effect on the estimate of map distance.

Effect of genetic model of SDL on the estimation of map distance
In this section, our purpose was to make clear the genetic reason for the inconsistencies between corrected and uncorrected map distances when there were distorted markers. Six evenly spaced codominant markers were simulated on a single-chromosome segment of length 50 cM. Two linked SDL with locations at positions 10 and 20 cM (exactly the 2nd and 3rd marker loci) were simulated on the simulated chromosome. One hundred simulation runs were performed for a sample size of 300. Each of datasets was analyzed twice by the method of Zhu et al. (2007) [18] and the Mapmaker/EXE 3.0 software [27]. The former was corrected map distance and the latter uncorrected one. For an additive-dominant model, all genetic parameters and the results were listed in Table 5. Results showed that uncorrected map distance was underestimated for most cases, overestimated for opposite dominant effects, and unbiased for all negative additive effects. The results from the real dataset analysis above partly confirmed the result that opposite dominant effects of the two linked SDL on the second linkage group (Table  4) gave rise to the overestimation (Table 3). For an epistatic model, all genetic parameters and the results were listed in Table 6. Results showed that uncorrected map distance was underestimated for most situations, overestimated for negative additive-by-additive or negative dominant-by-dominant effects, and unbiased for additive-bydominant effect. As we expected, corrected genetic distances were unbiased when considering SDL (Table 5 and  6). Hence, corrected linkage maps were recommended to

Discussion
For SDL mapping, most researchers concentrate their attention upon detecting and testing either the selection coefficients or the degree of dominance under the fitness model [7,10,11]. Luo et al. [14] pioneered in the development of SDL mapping under a liability model. Zhu et al. [18] proposed a new method for the reconstruction of linkage maps with distorted, dominant and missing markers. Under the liability model, we developed a method to simultaneously estimate the position and the effects of SDL as well as the recombination fractions between adjacent markers. This approach remains the merits of Luo et al. [14] but differs from others in several aspects. Firstly, it combines the detection of SDL with the reconstruction of marker linkage map. The position and the effect of SDL can be estimated along with the selection coefficient and the degree of dominance. Then, the proposed method may be used to elucidate the relationship between the viability selection and genetic linkage. Thirdly, the likelihood function is involved in the distribution of genotypes of SDL rather than that of marker genotypes in the previous studies [11,28]. Finally, we adopted an EM algorithm rather than the Simplex procedure to estimate the genetic parameters. Of course, we should notice one common assumption of the mentioned-above approaches that marker segregation distortion is caused by some genetic or viability reasons. For genetic reason, there are two different mechanisms for segregation distortion, one at the gametic level and the other at the zygotic level. In both cases, observable phenotypes are distorted for marker loci in the chromosomal region close to the SDL. Thus the two mechanisms are included in our proposed method. Although we have no way to distinguish them in SDL mapping, the results from the genotype and allele tests [29] for the marker closest to the SDL can be used to infer the presence of zygotic or gametic viability selection in an F 2 population but not in backcross, double haploid and recombinant inbred line populations. Moreover, it should be noted that genetic linkage between distorted markers has been carefully discussed in Wu et al. (2007) [30].
There are two primary routes by which selection can affect the extent of linkage disequilibrium [31]. The first is a hitchhiking effect, in which an entire haplotype that flanks a favored variant can be rapidly swept to high frequency or even fixation [32]. The second way in which selection can affect linkage disequilibrium is through epistatic selection for combination of alleles at two or more loci on the same chromosome [33]. This selection form leads to the association of the particular alleles at different loci. The major difficulty in linkage disequilibrium-based mapping is to quantify the relationship between recombination fraction and linkage disequilibrium measurement. Our analyses are confined to exclude the factors that influence linkage disequilibrium except linkage and selection. We first combine the viability selection with quantitative genetics model, and then explore the relationship between genetic modes of the viability genes and the estimates of the recombination fraction. The simulation studies indicated that most of the genetic modes of the viability genes at the two linked SDL may result in underestimation of genetic distance. We hope that the tentative attempt will make for elucidating the genetic relationship between viability selection and genetic linkage.
The likelihood-ratio (LR) score profile for segregation distor-tion loci mapping against the tomato genome Figure 1 The likelihood-ratio (LR) score profile for segregation distortion loci mapping against the tomato genome. The tomato genome derived from Mapmaker 3.0 software (Lander et al. 1987) was composed of two linkage groups. In addition, it will be interesting and challenging to combine the SDL analysis with QTL mapping to see what the effects of distorted markers has on the results of QTL mapping. While doing this, one may take a risk of detecting false QTL not due to their genetic effects on the quantitative traits but due to violation of the Mendelian segregation law. It will be a great breakthrough in quantitative genetics area if we can develop a method to separate the effects of viability loci from the effects of QTL [14]. By reason of the complexity of the combined analysis, the related investigations will be discussed separately elsewhere.

Conclusion
Our results suggested that the proposed method can serve as a powerful alternative to existing methods. Under the liability model, the new method can simultaneously estimate the position and the effects of SDL as well as the recombination fractions between adjacent markers, and also be used to probe into the genetic mechanism for the bias of uncorrected map distance and to elucidate the relationship between the viability selection and genetic linkage.

Authors' contributions
CZ designed and carried out the simulation study, and drafted the manuscript. YMZ conceived of the study, participated in the design, coordinated it and revised the manuscript. All authors read and approved the final manuscript.

Appendix: Fisher-scoring algorithms for obtaining MLEs of parameters
The Fisher-scoring algorithm can be used to estimate parameters in the M-step of EM algorithm. Let θ = (a, d) T . The newly estimated θ can be expressed by the score-function vector S and the Fisher information matrix I, where S = ∂lnL/∂θ = (∂lnL/∂a, ∂lnL/∂d) T is score function, and is Fisher information matrix.
More specifically, the score function and the Fisher information index of the expected complete data log-likelihood can be derived using