In this section, I show results from applying the ABCreg software to the inference scheme of [11], who used rejection sampling (Method 2 above) to infer the parameters of a simple population bottleneck model from sequence data obtained from a European population sample of Drosophila melanogaster. This model has three parameters, t
r
, the time at which the population recovered from the bottleneck, d, the duration of the bottleneck, and f, the bottleneck severity. The parameters t
r
and d are scaled in units of 4N0 generations, where N0 is the effective population size at the present time, and f = N
b
/N0, the ratio of the bottlenecked size to the current size (0 <f ≤ N0). See [11] for more details of the model. The data consist of 105 X-linked, non-coding loci surveyed by [25] and another ten from [10]. For each of these 115 non-coding fragments, sequence variation was surveyed in population samples from Zimbabwe, and the Netherlands. Thornton and Andolfatto used a two-step approach for the parameter inference. First, a relatively wide uniform prior was used in conjunction with a fairly liberal tolerance for acceptance. Then, the 1stand 99thquantiles of the resulting posterior distributions were used as the bounds on a new, uniform prior, and the acceptance criteria were made more strict. Three summary statistics were used: the variances across loci of nucleotide diversity (π, [26]), the number of haplotypes in the sample, and a summary of the site-frequency spectrum of mutations [27]. The rejection sampling scheme took two weeks to run on a large computer cluster.
I repeated the analysis using the local regression approach using the same data and uniform priors on parameters (see Table one of [11]). The analysis was done assuming that ρ = 4N
e
r (the population recombination rate) is equal to 10θ (see [11] for details), and the value of θ (the population mutation rate, see [28], p. 92) at each locus was obtained by the method of [29] using data from a Zimbabwe population sample. C++ code was written using the GSL and the coalescent routines in libsequence[21] to sample 5 × 106 draws from the prior distribution on the three parameters, and to record the resulting summary statistics. Simulating from the prior took 24 hours on four 2 gigahertz AMD Opteron processors. The tolerance was set such that 103 acceptances were recorded for the regression. The model has three parameters, and three summary statistics are used. Once the simulations from the prior distribution are complete, the entire ABC analysis was performed with one command:
reg -P 3 -S 3 -p prior -d data -b data -t 0.0002 -T,
where the arguments specify the number of parameters (-P), number of summary statistics (-S), names of files containing the prior (-p) and data (-d), the prefix of the output file names (-b), the tolerance (-t), and -T specifies the transformation described in [24]. The reg command takes seconds to run on a desktop CPU. Thus, the entire inference procedure took roughly 1 day using 4 CPU, compared to the original analysis based on rejection sampling, which took many CPU-months [11]
Figure 1 shows the comparison of the output from reg to the rejection sampling results of [11]. The regression and rejection approaches give very similar results for the time the population recovered from the bottleneck (Figure 1a), but the regression approach gives posterior distributions that are slightly left-shifted and have smaller variances, relative to the rejection sampling for both the duration (Figure 1b) and severity (Figure 1c) of the bottleneck. The major difference between the methods, however, is the total computation time required-approximately one day on four processors for the regression approach, compared to 14 days on 100 processors for the rejection-sampling approach.
Because the method is quite rapid, the performance of the estimator is easily evaluated. Figure 2 shows the result of testing the estimator on 103 random samples from the prior model used for the inference in Figure 1. The properties of the estimator are qualitatively similar to those reported in [11], but were much faster to obtain (about 20 minutes of computation time on a desktop computer, compared to 160 minutes when the procedure is scripted in R).