Statistical Framework
As the framework for our comparison, and in conjunction with the previous simulations and conclusions provided by the work of Rebaï et al. [4], we consider a backcross experimental design originating from a cross of two homozygous inbred lines, differing in the trait of interest, and producing heterozygous lines that are backcrossed to one of the initial homozygous parental lines. We examine both normal and binomial phenotypic distributions. In general, we denote each marker as M1...M
k
, where k is the number of markers being examined and allow each marker to have two alleles, M11, M12...Mk 1, Mk 2. The 2kphenotypic means are differentiated via subscripts (e.g., μ
M
11...Mk1/M11...Mk1 or μ
M
11...Mk1/M12...Mk2) and the frequencies of these classes are denoted as p11, p21...pk 1under the binomial scenario (i.e.,
= np11).
Single Marker Model and Hypotheses
A simple linear regression backcross model is employed for single marker QTL detection
Y
j
= β0 + β*X*j+ ε
j
; j = 1,...,n (1)
where Y
j
is the quantitative trait value, X*jis an indicator variable that denotes the state of a particular marker, β0 is the overall mean, and β* is the effect of an allelic substitution at the marker. Ideally, if the marker and QTL are completely linked, the effect of an allelic substitution is the effect of the QTL. If k markers are considered independently, k linear regression models can be considered (i.e., one for each marker, M1, M2,..., M
k
) by denoting the allelic substitution associated with marker M
i
as β* = β
i
, for i = 1... k. For k = 2 markers, we denote the allelic substitution associated with marker M1 as β* = β1, where β0 = μ
M
11/M11 and β1 = μ
M
11/M12 - μ
M
11/M11; and the allelic substitution associated with marker M2 as β* = β2, where β0 = μ
M
21/M21 and β2 = μ
M
21/M22 - μ
M
21/M21.
A compound hypothesis testing the effect of an allelic substitution at either or both of these two independent markers is,
Rejection of this compound null hypothesis indicates an association between a QTL and either or both of the markers, M1 and M2, hence the term intersection test. From a statistical perspective the relative position of the two markers is irrelevant. However, to compare this to a two marker model there is an implicit assumption that the markers considered form an interval, or are adjacent to one another. This marks a departure from the traditional single marker analysis where no consideration to marker order is given. To define an overall level α test, the significance level α must be adjusted for the individual tests to account for multiple testing. There are many ways to account for multiple testing. Assuming the markers are independent, the Bonferroni correction can be applied [9]. The Bonferroni correction is conservative for the intersection test and the lack of independence between markers would tend to make it more difficult for the intersection test to reject.
More generally, for k markers, the compound hypothesis testing the effect of an allelic substitution at any of the independent markers, M1...M
k
is
Rejection of this compound null hypothesis indicates an association between a QTL and at least one of the markers, M1...M
k
. To define an overall level α test, using a Bonferroni correction [9], each β* is tested at an adjusted significance level of
. An association between a QTL and a marker is then indicated when the individual single marker test rejects the null at the adjusted α level.
The practical result of the application of an intersection test, is the simplicity of calculation of the single marker test statistic, with a correction for multiple testing.
Two Marker Regression Model and Test of the Corresponding Interval
Extending the (backcross) notation defined previously, a multiple linear regression model (based on two markers) can be employed for QTL detection purposes. The model is defined as
Y
j
= β0 + β1X1j+ β2X2j+ β3X3j+ ε
j
; j = 1,...,n
where X1jand X2jare the genotypic states of the respective markers M1 and M2, along with their respective allelic substitution effects (β1, β2), and X3jis the combined genotypic states of markers M1 and M2 with allelic substitutions at both markers M1 and M2 having effect β3. Interestingly to note, when one is selectively genotyping, the information in β3 is maximized.
In other words,
Based upon this two marker model with four parameters, the hypothesis employed to perform a level α test for association between a trait and the marker loci M1 and M2 is the test of β3 where,
The null hypothesis for this test is that there is no association between either marker (M1 or M2) and the trait. A similar set of hypotheses follow for an F2 experimental design.
This model parameterization differs from the least squares interval mapping approach first introduced by Knott and Haley [2]. In the parameterization proposed here, only one test is performed for the pair of markers. In contrast, the regression based interval mapping approach [2], recalculates the value of the independent variables for each putative position in the interval. Our two marker regression has a different parameterization from Knott and Haley [2]. We chose the alternate parameterization in order to directly compare the two marker model and the single marker model. In the Knott and Haley [2] parameterization, flanking markers are used to define the coefficients of the regression as mean, additive or dominance effects. For s steps along the interval between two markers M1 and M2 values of X are calculated according to the conditional probability of a QTL in that location.
The regression based interval mapping parameterization thus provides a mechanism to test for additive and dominance effects using tests of the regression parameters. In our parameterization, the regression coefficients are tests for detection. Thus, the two parameterizations have different null hypotheses for the tests of the regression coefficients and are not directly comparable in terms of power. We use the alternative parameterization so that the interpretation of the tests is comparable in the single marker and two marker regression models and we can directly compare the power of the two tests.
Simulations
Data were simulated for two marker backcross and F2 populations with binomial trait distributions and two marker backcross populations with normal trait distributions. A total of 339 parameter combinations were examined (Table 1). For each combination of parameters, 1000 data sets were simulated. Traits were simulated from a binomial distribution Bin(n,p) where sample sizes n = 100 and n = 500 were utilized, and from a normal distribution N(
, 1.0) with n = 500. The effect of the binary trait [13] varied based on μ = np
i
(Table 1). The binomial probabilities p1, p2, and p3 represent the probability that a binary trait is present given a specific BTL genotype (GT), or the penetrance of the trait for the specific genotypes Q1/Q1, Q1/Q2, and Q2/Q2, respectively. The location of the locus relative to marker loci M
1
and M
2
also varied. Similarly, the effect under the normally distributed phenotype was allowed to vary (Table 2) under seventy five parameter combinations. The effect size is the difference in the penetrances (for binary traits) and between the means (for normally distributed traits). For each phenotypic trait distribution and each parameter combination (Table 1 and 2) we analyzed, via least squares, 1000 simulated data sets using both the single marker regression model and the two marker regression model.
For the intersection test, the null hypothesis was rejected when the empirical p-value for either single marker regression test statistic was less than
= 0.025 (Bonferroni adjustment). For the comparable two marker test (i.e., β3 = 0), the null hypothesis was rejected when the empirical p-value was less than α = 0.05. Under each parameter combination, the cumulative assessment of statistical power was evaluated from the 1000 simulated data sets as the proportion of times the empirical (permutation) p-values were less than the specified α level.
Drosophila Analysis
The population of Drosophila melanogaster used in our analysis was a set of 98 RILs (recombinant in lines) derived from a cross of two isogenic lines as described in Wayne et al. [8], for the trait ovariole number. There were 76 informative markers on 4 chromosomes. Markers used were the cytological map positions of the insertion sites of roo transposable element markers, with the exception of the fourth chromosome, where a visible mutation was used as a marker (spa) [12]. A complete linkage map was obtained for chromosome 1 (the X) and chromosome 3, with 15 adjacent marker pairs (16 markers) on 1 and 36 adjacent marker pairs (37 markers) on 3. There was a centromeric break in the genetic map for chromosome 2, such that there were 18 adjacent pairs (19 markers) on the left arm and 2 adjacent pairs (3 markers) on the right arm.
To compare the intersection test to the two marker test, the 71 pairs of markers identified above were examined. For each pair, the two marker regression with the test of the β3 parameter was conducted at α = 0.05. The two individual markers were then separately modeled in a linear regression model (see Equation 1), and the intersection test was conducted. For the 71 unique pairs of markers, concordance between the intersection test and two marker test was estimated using the Kappa coefficient, and McNemar's test [14] was conducted to determine whether systematic differences existed between the two methods. Regression based interval mapping was performed according to the Haley and Knott parameterization [1, 2]. Analysis was conducted using S-PLUS 2000 (Insightful Corp.).