Multivariate linear model
Consider n individuals derived from a backcross population crossed from two inbred lines with observations on some densely distributed codominant markers and on m quantitative traits. Supposed that the maximum number of QTL is p, the phenotypic value y
ki
of individual i for k th trait can be described by the following multivariate linear model:
(1)
for i = 1, 2, ..., n and k = 1, 2, ..., m, where γ
kj
is model indicator variable, indicating the j th QTL of k th trait included (1) or excluded (0) from the model; bk 0is population mean; b
kj
is QTL effect; x
kij
is QTL genotype, if QTL genotype is homozygote x
kij
= 1, otherwise -1; e
ki
is residual error and assumed to follow multivariate normal distribution. If we denote equation (1) by matrix, it can be expressed as:
(2)
for i = 1, 2, ..., n, where y
i
= [y1i, y2i, ..., y
mi
]T, b0 = [b10, b20, ..., bm 0]T, b
j
= [b1j, b2j, ..., b
mj
]T, e
i
= [e1i, e2i, ..., e
mi
]T. They are all (1 × m) column vectors. Equation (3) is QTL genotype matrix and Equation (4) is model indicator matrix, they are all (m × m) diagonal matrix.
(3)
(4)
Prior specification
The prior distribution of each QTL effect vector b
j
is multivariate normal distribution, p(b
j
) ~ N(0, ), where is the hyper-parameter, and We take , which is simply an extension from Bayesian single trait analysis [15]. The importance of the choice of the hyper-parameter will be discussed later. In a large backcross population and under the definition of x
mij
(-1 or 1), can be simplified as = Σ
e
. The prior of the covariance matrix of residual error follows Inverse Wishart distribution, Σ
e
~ Wishart-1(v
e
, ), where, v
e
and are prior degree of freedom and covariance matrix of residual error, respectively, and can be obtained from other method, such as CIM based multitrait analysis [2], etc. The prior distribution of population mean b0 is normal distribution with mean and variance equal to those calculated by phenotypic values. The prior probability distribution of QTL position λ
kj
is uniform distribution with bounds of two flanking markers, p(λ
kj
) = 1/d
j
, where d
j
is length of the interval where j th QTL is confined. Assuming that epistatic effect is absent, the prior inclusion probability for j th effect can be expressed as p(γ
kj
= 1) = 1 - l
k
/L
k
]1/N(see also [15]), where l
k
is the prior expected number of main-effect QTL, and could be roughly estimated with the use of standard genome scans; N is the number of possible main effects for each QTL and equal to 1 in BC family [15]; L
k
is the upper bound of QTL number, and equals to the number of marker interval in our simulation study, while in another approach suggested by Yi [15]L
k
is taken as 3 + 3·, which causes the model space to reduce dramatically [15].
Joint posterior density
The observable variables include phenotypic values, and marker information, . The unobservable variables include population mean, ; QTL effects, ; QTL genotypes, ; model indicator variables, ; (co)variance of residual error, Σ
e
, and QTL positions, . Let θ be the vector of hyper-parameters, Θ = {b0, b, Σ
e
, λ, X, Φ}, then the joint prior density of the unobservable variables is denoted by p(Θ|θ). The joint posterior probability of Θ, given the observable variables y and m, can be expressed as:
p(Θ|y, m) ∝ p(Θ|θ)·p(y, m|Θ), (2)
where, p(y, m|Θ) is the likelihood and can be written as:
p(y, m|Θ) = p(y|Θ)·p(m|Θ), (6)
where p(y|Θ) is multivariate normal density, and p(m|Θ) can be derived from a Markov model [14].
MCMC sampling
MCMC algorithm generates samples from Markov chains which converge to the posterior distribution of parameters, without the constant of proportionality being calculated. From these posterior samples, summary statistic of the posterior distribution can be calculated. MCMC algorithm proceeds as follows:
a. Initialize all parameters with values in their legal domain.
b. Update the population mean b0.
c. Update the QTL effects vectors .
d. Update the variance-covariance matrix Σ
e
of the residual error.
e. Update the QTL genotype indicator matrices and the QTL location vectors jointly, for j = 1, 2,..., p.
f. Update the model indicator variable matrices .
The conditional posterior distribution of the population mean b0 is multivariate normal with mean
(7)
and variance-covariance matrix
(8)
The conditional posterior distribution of the QTL effect b
j
is sampled from multivariate normal distribution with mean
(9)
and variance-covariance matrix
(10)
The posterior distribution of the residual error follows inverted Wishart distribution,
(11)
where and df
e
= n.
In step e, the QTL locations and QTL genotype matrices are updated jointly. For locus j, we can firstly sample a new QTL position for each trait from their prior distribution (described later), then sample the QTL genotype matrices on the new position using equation (15), and finally, they are updated by the efficient Metropolis-Hastings algorithm [20, 21]. Because the sampling of X
ij
is too complicate and we are going to firstly describe it. Due to the QTL genotype x
kij
has two possible values (-1 or 1) in BC line, if m traits are investigated jointly, X
ij
has 2mkinds of possible formations, and the general pattern of X
ij
can be written as:
(12)
where, z1, z2, ..., z
m
∈ {-1, 1}. For clarity, we omit the subscript ij from and present formulas to denote the genotype matrix of i th individual and j th loci. Because the QTL genotypes x
kij
of i th individual in the j th interval for all traits may be correlated, the joint prior probability of the genotype matrix X
ij
can't be simply expressed by the following equation:
(13)
Instead, it can be derived from the Markov model (see Equation 14), assuming that the order of markers and QTL is M
j
Q1Q2 ... Q
m
Mj+1(see Figure 7), where, Q1, Q2, ..., and Q
m
denote the QTL respectively affecting trait 1, trait 2, ..., and trait m in j th marker interval. Indicator variables x1ij, x2ij, ..., and x
mij
denote the genotypes of these QTL.
(14)
If no segregation interference is considered, the joint prior probability can be factorized into equation (14), and each term in equation (14) can be derived from Haldane map function. Only the first term in equation (14) is conditional on two flanking markers; others are not only conditional on two flanking markers but also on the genotypes of all the QTL prior to the interested one. If double recombination is ignored [2], each term in equation (14) can be inferred only by the genotype of the left nearest loci (marker or QTL) and the right marker, then equation (14) can be simplified as:
(15)
Each term in equation (15) can be easily inferred.
It is worth mentioning that we assume the sequence of markers and QTL is M
j
Q1Q2 ... Q
m
Mj+1, and in fact, the sequence of QTL may be variable in each round of updating. Therefore, we should firstly ascertain the sequence in each round, and then construct the appropriate formula to calculate the joint prior probability of the QTL genotype p(X
ij
= |mi,j,λ j,mi,j+1) according above rules. For clarity, we take an example to demonstrate it. Consider 3 QTL Q1, Q2, and Q3 that affect 3 traits respectively in an interval. Assuming that in a certain round the sequence of markers and QTL is M
j
Q3Q1Q2Mj+1, then the formula for calculating the joint prior probability of the QTL genotype can be written as:
Once we obtain the joint prior probability of the QTL genotype, the joint conditional posterior probability of X
ij
can be expressed as:
(16)
where is likelihood, and follows multivariable normal distribution,
(17)
Once we have calculated 2mpossible posterior probabilities for the corresponding QTL genotype matrices, we are going to sample one genotype matrix according to their posterior probabilities. We firstly constructed the cumulative probability function F(d) by accumulating the 2mprobabilities in an arbitrary sequence for d = 1, 2, ..., 2mand F(0) = 0, which is a discrete distribution; then sampled a random number from uniform distribution, u ~ U[0,1]; and compared u with F(d), if F(d - 1) <u ≤ F(d), then the d th genotype matrix is accepted.
The new sampled QTL genotype matrices are only the proposal value, which should be updated along with the proposal QTL position vector λ
j
= [λ1j, λ2j, ..., λ
mj
] by the Metropolis-Hastings algorithm [20, 21]. For each trait, the new proposal position is sampled around the existing one from uniform distributions, ~ [λ
kj
- δ, λ
kj
+ δ), where δ is tuning parameter, usually taking a value of 1 or 2 cM. The new position vector is denoted by ; then the new QTL genotype matrix is sampled conditionally on the new position using equation (16); finally, the position vector and genotype matrices are accepted jointly with probability equal to min(1,α), where
(18)
p() and p(λ
j
) is the prior probability of new and old position respectively, and they are cancelled out under uniform prior distribution; and p(X
ij
|λ
j
, ...) is the prior probability of QTL genotype conditional on new and old position, which has been described detailed previously; and , are all proposal ratio.
In step f, block sampling of the indicator variable matrix Φ
j
is expected to have a better performance than separately updating each γ
kj
in Φ
j
. Due to there are two possible values (0 or 1) for each model indicator γ
kj
, if m traits are investigated jointly, each model indicator matrix Φ
j
has 2mkinds of formations. The general formula of it can be written as:
(19)
where, w
k
∈ {0,1}, for k = 1, 2, ..., m. Because the prior probability of each γ
kj
is independent, the joint prior probability for all possible formations can be written as . Then the conditional posterior probability of Φ
j
can be written as
(20)
The approach to sample Φ
j
is similar to QTL genotypes sampling previously mentioned.
Post-MCMC analysis
For summarizing the posterior sample, we use the mean of the posterior sample to estimate the QTL effect and the residual (co)variance, and the mode of the posterior probability or the peak of the 2logeBF statistic to localize QTL. 2logeBF statistic was introduced by Yi et al.[17] into QTL mapping, and BF statistic is defined as the ratio of the posterior odds to the prior odds for inclusion against exclusion of the locus [24]. The critical value of BF is 3 or 2logeBF = 2.1 for declaring the existence of a QTL.
In single-trait analysis, we can pick the QTL by plotting the profile of the posterior probability or 2logeBF statistic against the genome. In multitrait analysis, if only two traits are considered jointly, we can use a three-dimension graph to summarize the statistic for all traits jointly (e.g., Figure 2 in [19]). However, if the number of trait is greater than 2, we can't plot them in one graph. Instead, we can solve the problem by plotting the marginal posterior probability distribution. If we divide the genome into H bins, and denote each bin of k th trait with ζ
kg
, for g = 1,2, ..., H, then the marginal posterior probability distribution of ζ
kg
is defined as p(ζ
kg
|y) = p[(ζ
kg
= λ
kq
) ∩ (γ
kq
= 1)], where, q indicates the q th interval that locus ζ
kg
resides in. Then , which can be calculated at each possible locus for each trait, respectively.