Given the increasing prevalence of genome-wide data, and the development of methodologies for the analysis of such data, there is an increasing need for tools that can simulate data appropriate for long, genomic regions. Two options suggest themselves:
1. Model and simulation: The traditional approach has been to use a model that is
(a) thought to be a reasonable approximation to the evolutionary history for the organism of interest, and
(b) easy to simulate.
By far the most popular such model is the coalescent [1, 2] However, use of the coalescent becomes less practical for long genomic regions.
2. Existing data and perturbation: An alternate, newer approach is to take an existing data set and then perturb it in some fashion to produce "new data from old". A simple example of such an approach would be re-sampling. More specific examples can be found in [3, 4].
The first approach has the advantage of being able to produce data that is not dependent on an existing data set. However, the model it uses will be, by definition, an approximation to the evolutionary processes that produced the real data. The second approach, while being dependent on the presence of an initial data set, has the advantage that the evolutionary model underlying the unperturbed data is correct. We don't know how the data got there, but it is 'real' data, so it got there via the correct evolutionary history. However, the need to then perturb the initial data to produce new data sets adds noise to the evolutionary process and thereby results in data that is only an approximation to reality. Furthermore, the extent of the dependence of the new data sets on the initial data set is unclear, and it is therefore not obvious how typical such data might be of other, unobserved, real data.
We believe both of these approaches have merit. In this paper we restrict ourselves to a discussion of the former approach, in which we use an evolutionary model to simulate new data sets. The use of the standard coalescent model becomes impractical as the length of the simulated region increases. However, the coalescent has been proven to be a powerful simulation tool in these contexts (e.g., [5]). Thus, in this paper we exploit an approximation to the full coalescent algorithm. This approximation, the sequentially Markov coalescent (SMC), was introduced by McVean and Cardin [6]. It is able to simulate significantly longer regions while maintaining the properties of short-range summary statistics. Since our particular interest is in the development of such algorithms as a tool for the testing of disease mapping methodologies, we pay close attention to the behavior of linkage disequilibrium [LD] in data simulated under the SMC model. The coalescent was introduceid in [1]. It provided an elegant and efficient model for the evolution of a population of randomly-mating, neutral, haploid individuals. As such it has become a very widely used tool. Over time, generalizations have been introduced to deal with the more obviously restrictive aspects of the original model. For example, recombination was introduced in [2]. Selection was introduced in [7, 8]. Useful reviews are found in [9–11].
Our interest here centers on the use of the coalescent algorithm to simulate long chromosomal regions. When long regions are considered, and the recombination rate is therefore very high, the coalescent algorithm becomes somewhat problematic to use. Run-times become longer (see "Results") and memory requirements become greater.
In a case in which two widely-separated regions were being considered, one might simulate these two regions independently, relying on the fact that the regions would be essentially unlinked. However, when one is studying a long, continuous region such a strategy becomes inappropriate since linkage disequilibrium is likely to be present along the entire region. (In a situation in which recombination hotspots were present, one might try to independently simulate regions between hotspots.)
Rapid simulation of coalescent ancestries is central to estimation methods such as rejection algorithms, or to the use of simulation-studies as a test-bed for new methodologies. Thus we use a simple approximation to the coalescent in which the difficulties associated with simulating long chromosomal regions are mitigated.
Hudson [2] introduced recombination into the coalescent model. Griffiths and Marjoram [12] then embedded this within the ancestral recombination graph (ARG), a more tractable description of the coalescent model in the presence of recombination. Shortly thereafter, Wiuf and Hein [13, 14] introduced an alternate description of the ancestral process with recombination in which the sample is constructed by moving "along the chromosome". Their algorithm gains efficiency by ignoring a class of recombination events that do not affect the present day sample.
In order to discuss this further, we review the concept of ancestral material. A chromosomal region in an individual is considered to be ancestral if it is eventually inherited by any of the sample of interest drawn from the present day population. Thus, individuals in previous generations are likely to contain chromosomal regions that are both ancestral and non-ancestral.
In essence there are five types of recombination events that occur on the full ARG:
1. Recombination in ancestral material;
2. Recombination in non-ancestral material that has ancestral material to both sides;
3. Recombination in non-ancestral material that has ancestral material only to the left;
4. Recombination in non-ancestral material that has ancestral material only to the right;
5. Recombination in an individual that carries no ancestral material.
We illustrate some of these events in Figure 1 [see Additional file 1]. Only the first two types of event actually impact the composition of the sample of interest.
As the recombination parameter, ρ, increases, the number of recombinations in the ARG, which is of the order of ρ log(n) for a sample of size n, (e.g., [12]) grows. A simulation of the full ARG would contain all such recombination events, and hence be highly inefficient. This is not, primarily, due to the large number of recombination events per se, but rather is caused by the growing size of the ARG, which makes increasing demands on computer memory.
Simulating the ancestral recombination graph
We begin by introducing some notation. Denote the length of chromosome being considered by the unit interval [0,1]. Let x ∈ [0,1] denote a point within the region of interest. McVean and Cardin's SMC method introduces an approximation to an elegant scheme introduced by Wiuf and Hein [13, 14], which we describe fully in "Implementation". In summary, Wiuf and Hein's method moves from left-to-right along the chromosome. Starting with the tree appropriate for x = 0 they find the (exponentially distributed) distance along the chromosome to the next recombination event. They then pick a point uniformly at random on the graph constructed so far and introduce a recombination at that point. The left emerging line from that recombination follows the path of the existing line (indicated in green on Figure 2 – [see Additional file 2]), but the right emrging line, which is the newly-introduced line, follows a new path (calculated from the usual coalescent prior and indicated in red on Figure 2). Once we have constructed the path for the new line we are left with a new graph that consists of the old graph plus this new line. This procedure is iterated until the end of the chromosome is reached. Note that the size of the graph increases as x increases. (For details see "Implementation".)
There is a class of recombination events which occur to lines on the full ARG but which do not affect the properties of samples generated from that graph. These are recombinations which occur in regions which are non-ancestral for that line (i.e. are not passed on to the sample of interest at the bottom of the graph) and which do not have ancestral regions to both their left and right (corresponding to events of types 3, 4 and 5 in the list of the previous section). The Wiuf and Hein algorithm gains efficiency over an algorithm based on the full ARG by excluding recombinations of types 4 and 5. In particular, it excludes recombinations which occur in non-ancestral material and which only have ancestral material to their right on that line, but not those that have ancestral material only to their left. This has the curious feature of making the density of recombination events in the simulated graph increase as we move along the chromosomal region. (However, it is important to note that these 'extra' recombination events occur in non-ancestral material and do not influence the composition of the final sample.) Since, for large ρ, the efficiency of algorithms that simulate (a subset of) the ARG is largely a function of the amount of memory required to store the graph, this makes the Wiuf and Hein algorithm more efficient than algorithms based on the full ARG. The popular ms algorithm of Hudson [15] also excludes recombinations in non-ancestral material that only have ancestral material to their left (i.e. of type 3). Thus, all events that do not affect the current sample are ignored in the ms algorithm, and it is therefore more efficient than the Wiuf and Hein algorithm.
The novelty of the SMC scheme proposed by McVean and Cardin, is that before adding the new line corresponding to a recombination event they delete the old (existing) line for that recombination, (i.e. all parts of the old line between the point at which the recombination occurred, and the point at which the old line coalesced, are deleted). Thus, each graph we construct is in fact a tree, and knowledge of the deleted lines is lost rather than being stored within the total graph constructed so far (as in the algorithm of Wiuf and Hein). Consequently, the SMC algorithm explicitly disallows coalescence between lineages with no overlapping ancestral material. The motivation is to significantly increase algorithmic efficiency, due to lower memory requirements, while retaining most of the LD structure [6]. This is possible largely because the process by which trees are constructed as we move along the chromosome using the SMC is now Markovian.
In our implementation of their algorithm we include a slight modification in which the old (existing) line for the recombination is deleted after (rather than before) the new line is added. Thus, in this modified version, which we refer to as SMC', the intuitive interpretation is that we only include recombination events that occur in ancestral material and ignore all events occurring in non-ancestral regions. Our motivation for doing this is as follows. The original specification of the SMC algorithm has the consequence of excluding a class of recombination events that occur in ancestral material but do not affect the pattern of LD in the data (because the two lines resulting from the recombination coalesce with each other before coalescing with any other line). We denote this class of recombinations by R. Thus, since all recombinations are forced to be other than type R, the rate at which recombinations of type not equal to R occurs will now be slightly higher than it would normally be under the full coalescent model (for a given recombination parameter ρ). This suggests that LD will decay slightly more quickly in the original, SMC, version of the algorithm (see "Results" for more details). Our FastCoal software implements both the SMC and SMC' versions of the algorithm.
We give results to supplement those in [6] and demonstrate that the SMC/SMC' approximation is much more efficient than the coalescent for high ρ and produces data that is almost indistinguishable from that produced by an exact coalescent algorithm. The algorithm simulates an approximation to the true coalescent model. However, the degree of approximation is extremely close, at least in terms of patterns of LD. Furthermore, implementation of the algorithm results in software that is very significantly faster, and has much lower memory requirements, when ρ is large. In fact, the memory required by the algorithm is independent of ρ (in direct contrast to ms, for example).
Note that by discarding the old line associated with each recombination, at any given moment the algorithm stores information for one coalescent tree rather than a more complicated and memory-intensive graph. This allows the SMC/SMC' algorithm to run efficiently for high ρ (when the size of the corresponding graph becomes very large). Other than that, the SMC/SMC' process and Wiuf & Hein algorithms are essentially the same. Thus it follows, in a manner directly analogous to that in [13, 14], that T(x), the marginal genealogy at a particular location x, is still exactly described by the traditional coalescent process. See [6] for an extended discussion of the properties of the SMC algorithm and a derivation of theoretical results.
We now consider the degree to which data produced by the approximation algorithm is similar to that produced by traditional coalescent algorithms and then demonstrate the relative computational efficiencies for a variety of parameter values.