Fully Bayesian tests of neutrality using genealogical summary statistics

Background Many data summary statistics have been developed to detect departures from neutral expectations of evolutionary models. However questions about the neutrality of the evolution of genetic loci within natural populations remain difficult to assess. One critical cause of this difficulty is that most methods for testing neutrality make simplifying assumptions simultaneously about the mutational model and the population size model. Consequentially, rejecting the null hypothesis of neutrality under these methods could result from violations of either or both assumptions, making interpretation troublesome. Results Here we harness posterior predictive simulation to exploit summary statistics of both the data and model parameters to test the goodness-of-fit of standard models of evolution. We apply the method to test the selective neutrality of molecular evolution in non-recombining gene genealogies and we demonstrate the utility of our method on four real data sets, identifying significant departures of neutrality in human influenza A virus, even after controlling for variation in population size. Conclusion Importantly, by employing a full model-based Bayesian analysis, our method separates the effects of demography from the effects of selection. The method also allows multiple summary statistics to be used in concert, thus potentially increasing sensitivity. Furthermore, our method remains useful in situations where analytical expectations and variances of summary statistics are not available. This aspect has great potential for the analysis of temporally spaced data, an expanding area previously ignored for limited availability of theory and methods.


Background
The field of population genetics has a long history in the development of tests of selective neutrality. This is both because of the difficulty of developing a tractable alternative to the neutral theory and because of the ongoing debate about how well the neutral theory can explain real data. Although a number of important steps have been made to develop powerful tests of neutrality [1][2][3] there are evident problems with many currently available tests. For example many of the tests, such as Tajima's D (D T ) and Fu and Li's D (D F ) have difficulty in accurately discriminating between selection and changes in population size.
In fact, most available tests of neutrality can only test constant population size neutrality against alternatives that include both population growth and selection. Furthermore, most tests require accurate knowledge of the number of mutations that have occurred or the branch lengths in the gene tree, and do not adequately take into account the uncertainty in these quantities (i.e. most tests implicitly assume an infinite-sites model of evolution). Finally, tree-based summary statistics are often based on one estimate of the genealogy, despite the fact that the true genealogy and branch lengths are seldom known.
Broadly speaking, on the basis of the sequence information used, statistics for testing neutrality can be placed into three classes: 1. statistics that use the mutation (segregating site) frequency spectrum [1,2,4,5], 2. statistics that use the haplotype distribution [3,6,7] and 3. statistics that use the pair-wise distance (mismatch) distribution [8,9].
A recent comprehensive survey of the power of these different classes of tests for detecting population expansion found that classes 1 and 2 were generally more powerful than the best class 3 statistics [10]. Some of the bestknown test statistics come from class 1 and essentially work by comparing aspects of the mutation frequency spectrum with neutral expectations. This class of test statistics include D T [1], D F [2] and the H statistic [5]. In the simplest case, these statistics can be used to measure deviations from the null hypothesis of constant population size, random mating and no recombination. For example D F measures the normalized difference between the number of mutations on the external branches and the total number of mutations in the genealogy. Under the null hypothesis of neutral evolution the expectation of D F is zero, and a significant departure from zero signifies selection (balancing, directional, negative), recombination or changes in population size. The last of these alternatives is problematic because exponential growth is expected to give results similar to directional or purifying selection. For this reason it would seem desirable to develop a method that directly accounts for alternative demographic models of population size through time. In this context, several studies have combined the use of summary statistics and demographic models [11][12][13][14][15].
Apart from biasing the mutation frequency distribution, selection may also affect the shape of the gene tree [16]. Although few attempts have been made to use this expectation in a rigorous test of neutrality (c.f. [17]), a number of branching models and summary statistics measuring tree imbalance exist in the literature of speciation models [18][19][20][21]. A method that could use information both from the mutation frequency spectrum and from the shape of the gene tree may be more powerful than either used individually.
If all sequences comprising a gene tree are sampled from the same time point (as is required by most tests of neutrality) then there is very little power to distinguish between selection and exponential growth. However if sequence data is available from different times, during which measurable evolution has taken place, as in RNA viruses and ancient mitochondrial DNA (mtDNA) data [22,23] then the power to distinguish between these two alternatives is potentially much greater. Unfortunately, the expectations and variances of crucial quantities (such as tree length) are not yet available for serially sampled data, so this potential power has not been tapped.
Apart from analyses of intra-population sequence variation, evidence for non-neutrality can also be detected by comparing within-and between-species sequence variation [24]. For example, it has been widely observed that in some species there is an excess number of polymorphic non-synonymous sites segregating within the species relative to the number of non-synonymous sites with fixed differences between closely related species [19,25]. This effect is consistent with the conclusion that a substantial fraction of non-synonymous mutations are slightly deleterious mutations (SDMs) that often persist as polymorphisms within populations for some time but have a low probability of eventual fixation [26]. However this pattern is not universal. In fact, at least in Drosophila the pattern appears to be the reverse [27], possibly implying a prominent role for recurring positive selection [28]. Regardless of the direction of non-neutral evolution this test may suggest, it has been shown that, as with summary statistics of the mutation frequency spectrum, the accuracy of these methods is compromised by the effects of unrecognized historical demographic change [29]. Both within-species and between-species methods rely on the fact that SDMs become increasingly rare relative to neutral mutations at higher frequencies. For example, within a panmictic population, the distribution of SDMs is expected to predominate near the tips of a population genealogy [30], so that SDMs are on average younger than neutral mutations [25]. Thus the older branches (and associated mutations) within a population will tend to consist of relatively fewer SDMs (as purifying selection has had longer to act).
Although a number of researchers have observed nonneutral behaviour of non-synonymous polymorphism in protein-coding regions, few have considered the effect of SDMs on linked genetic variation in non-coding regions. This is particularly pertinent to the study of the control region of mitochondrial genes, which is extensively used for within-population genetic sampling of animal mtDNA [31]. The action of Hill-Robertson interference is expected to exacerbate the persistence of SDMs in populations [32], because it reduces the efficiency of purifying selection. Even moderately deleterious mutations, which would otherwise be removed by selection very quickly, can persist in the population if there is substantial genetic linkage between sites [30]. Therefore, in non-recombining genetic elements such as the mitochondrial genome and the genomes of negatively stranded RNA viruses, mutations that are themselves selectively neutral will nevertheless tend to share the fate of linked deleterious mutations.
In this paper we extend an existing Bayesian method originally applied to investigating non-neutrality in HIV evolution [33], that can be used to test for selective neutrality in both coding and non-coding genetic regions sampled from within a single population. The method assumes no knowledge of ancestral mutation frequencies and takes into account the confounding effects of demographic history. We demonstrate the utility of this method on four examples comprised of one non-coding data set and three coding data sets. This method assumes a single genealogy describes the evolutionary history of the sequences under study, but makes no assumptions about ancestral mutation frequencies and takes into account the confounding effects of demographic history. We demonstrate the utility of this method on four non-recombinant examples comprised of one non-coding data set and three coding data sets.

Results and discussion
We employed a suite of summary statistics to test the assumption of neutrality on four example data sets. Because selection is expected to change both the distribution of mutations on the tree and the shape of the sample genealogy [30], statistics that measure both of these departures were included in the analysis. Fu and Li [2] compared two estimates of population parameter θ that can be derived for a sample of n sequences:

Summary statistics
1. the total number of singleton polymorphisms and 2. the total number of segregating sites divided by . Under neutrality the difference between these two measures is expected to be zero, and the variance in the difference can be calculated. The resulting normalized test statistic D F assumes an infinite sites model of mutation, because it equates mutations with branch lengths in the underlying coalescent tree and does not therefore account for the possibility of multiple mutations at a single site. To avoid this assumption we employ a genealogy-based version of D F , which compares the length of terminal branches to the total length of the coalescent genealogy (we term this the genealogical D F ). In addition to the genealogical D F , two other measures of branch length distribution (age of most recent common ancestor, and total tree length; see Table 1) and three measures of tree imbalance B 1 , I c and C n were also employed.
The B 1 statistic is the maximum number of nodes between an internal node and the tips of the tree, summed over all internal nodes and excluding the root [34]. Higher values of B 1 are expected with increasing symmetry of the phylogeny. Colless's tree imbalance index I c considers each internal node of a bifurcating tree and partitions the number of terminal sequences that descend from it into two groups, r and s, where r ≥ s. Symmetry is measured based on the difference between r and s, summed over all internal nodes [18]. The measure increases from 0 for a perfectly symmetrical tree, to 1 if the tree is completely asymmetric. The final tree-asymmetry measure, Cherry count C n , is simply the number of pairs of sequences joined by their most recent common ancestor [20]. More symmetrical trees are identified by higher values of C n . All six summary statistics used are listed in Table 1.

Summary Statistic Reference
Description The total length of all branches of the tree. t MRCA -The difference in age between the most recent common ancestor and the most modern individual. D F [2] A classic summary statistic for testing neutrality. Normalized difference between external branch lengths and total tree length. B 1 [34] A measure of tree-imbalance. C n [20] The number of internal nodes with exactly two terminal children(the number of cherries). I c [18] A measure of tree-imbalance. Ranges is [0,1]. Larger numbers signify more imbalanced trees.

Data analysis
Brown bear mitochondrial DNA An alignment of non-coding mitochondrial DNA from the d-loop of brown bears Ursus ursus was compiled as an example of non-coding molecular sequence data that is assumed to be evolving neutrally. The data set comprised 30 previously published ancient DNA sequences [35], along with 44 modern brown bear sequences obtained from GenBank. The software BEAST [36] was used to conduct Bayesian MCMC analysis on the full data set (n = 74), yielding estimates of evolutionary rate, population size and ancestral genealogy ( Table 2). The substitution model chosen allowed for different rates of transitions and transversions [37] as well as Γ-distributed rate heterogeneity among sites [38]. Both constant-size and exponentialgrowth models of demography were investigated. To test if the assumption of neutrality was warranted, posterior and posterior predictive values were calculated for each of the summary statistics in Table 1, along with their corresponding multivariate posterior predictive p-value. A Bayes factor computed via importance sampling [39,40] was used as a model choice criterion to compare the relative marginal likelihoods of the two models, resulting in rejection of the exponential growth model in favour of a constant-size population. However under both models, differences between the posterior and posterior predictive values did not suggest any significant departures from neutrality in any of the six summary statistics investigated. The multivariate p-values for constant and exponential growth were 0.219 and 0.284 respectively. This result suggests, at least in terms of tree asymmetry and branch length distribution, that selective neutrality cannot be rejected for the d-loop of brown bears.

RNA virus data sets
Three RNA virus data sets were also analyzed under the same model conditions as described above. The first was a multiple sequence alignment (n = 129) of the g gene (L = 629 bp) of human respiratory syncytial virus (HRSV) spanning 46 years from 1956 to 2002 [41]. This virus was used as an example of a coding gene of an RNA virus that exhibits only a weak signal of non-neutrality in terms of its tree shape. The estimates of mutation rate and population size are shown in Table 2. A constant population size was preferred over exponential growth using a Bayes factor. The multivariate posterior predictive p-values did not reject neutrality (p = 0.33). We followed up with a series of univariate analyses using the individual summary statistics. The tree length T, age of the root t MRCA and D F statistics are all close to significance under the assumption of constant population size, as shown in Table 3, while the remaining univariate statistics are less suggestive. Therefore, there is only marginal evidence for low levels of nonneutrality in the tree shape of HRSV.
To demonstrate the ability of this method to detect nonneutrality, two additional data sets were analyzed. The first was a previously published data set of the E gene of the dengue-4 virus (n = 69, L = 1485) from Puerto Rico [42] spanning 17 years. The second was a data set of hemagglutinin sequences from human influenza A virus selected to have a similar time frame (1981)(1982)(1983)(1984)(1985)(1986)(1987)(1988)(1989)(1990)(1991)(1992)(1993)(1994)(1995)(1996)(1997)(1998). These two viral data sets are both expected to exhibit the effects of adaptive selection, particularly influenza A virus, given the nature of their life histories [42][43][44]. As for the previous data sets, posterior and posterior predictive values were calculated for each of the summary statistics in Table  1. Under constant population size, the multivariate posterior predictive p-value = 0.0269 for the dengue-4 virus data set and = 0.0240 for the influenza A virus data set. Posterior parameter estimates from the MCMC analyses. The effective population size is reported only as a product with generation time (N e τ) and the compound parameter has unit of years for virus data sets and radiocarbon years for the brown bear data set. Posterior means are reported for all model parameters. For each data set, the demographic model chosen by a Bayes factor is marked (*). 1 marginal likelihood, 2 exponential growth rate, 3 substitution rate, 4 shape parameter of the Γ-distribution.
Both of these data sets exhibited significantly more negative D F then expected under neutrality, suggesting that the relative length of the terminal branches is larger than expected in both data sets. Additionally, the human influenza A data set also had marginally more tree-imbalance than expected under neutrality, and for the dengue-4 data set, the age of the most recent common ancestor was significantly smaller than expected under neutrality. Figure 1 shows the posterior and predictive distributions for D F and tree length T for all four data sets. Figure 2 shows (A) two human influenza A virus trees from the posterior distribution of the exponential growth analysis along with (B) the corresponding simulated trees in the predictive distribution. These observations provide qualitative evidence for the ability to detect non-neutral evolutionary dynamics from tree shape as suggested in a recent review of the nascent field of phylodynamics [16].
Distinguishing selection from exponential growth A criticism often leveled at tests of neutrality such as D F , is that significantly negative values of D F could signify exponential growth rather than non-neutral evolution. As demonstrated above the methodology employed here allows the demographic history to be described parametrically as part of the model. Therefore, inference and testing can both be achieved under a model of exponential growth. In this case, any additional departure from expectations cannot be attributed to exponential growth as the demographic signal is incorporated into the test via the predictive distribution. The results of the tests including exponential growth are also presented in Table 3. Interestingly, model selection by Bayes factors can not strongly reject constant populations in all of the data sets except for dengue-4. In the case of dengue-4, the log Bayes factor in favor of the exponential population model is approximately 8. However, the multivariate p-value for dengue-4 is no longer significant once exponential growth is incorporated. We can therefore distinguish between selection and growth in the dengue-4 and influenza data sets. In dengue-4, the departure from neutral expectations can be explained by an incorrect choice of demographic functions. Whereas in influenza, significant departures from neutral expectations are observed under both demographic scenarios. In contrast, there is little evidence of non-neutrality in the bear and HRSV data sets.

Simulations
For infinitely long sequences, for which no uncertainty in the underlying genealogy exists, p B behaves like a classical p-value. In the infinite data situation, the posterior distribution of T(·) collapses to a single point and equation (5) then returns the probability of observing a test statistic under the null hypothesis of selective neutrality as extreme as the test statistic of the data. In finite data situations, p B is stochastically less variable than a uniform distribution but with the same mean. This implies that the distribution of p B is more centered about 1/2 than a uniform random variable, leading to slightly more conservative tests when one choses small Type I error rates. To test the assertion that p B can still be interpreted as a p-value even when sequences are of short length and there is significant uncertainty in the underlying genealogy, a simulation study was undertaken. A number of replicate data sets (n = 100) were simulated and analyzed as follows: 1. A time-structured coalescent tree was simulated with sample times at 0, 300, 600, and 900 days, with 10 sequences at each time and a constant population-size parameter N e τ = 1500 (the product of N e and generation length in days).  3. A Bayesian MCMC analysis was run on the resulting DNA sequence alignments using BEAST (Drummond and Rambaut 2004), assuming a constant population and an HKY85 + Γ model of substitution. The demographic and Posterior and predictive distributions of tree length T and D F Figure 1 Posterior and predictive distributions of tree length T and D F . Posterior and predictive distributions of tree length T and D F for all four data sets. The dengue-4 data is from an analysis assuming exponential growth, while the other three analyses assumed a constant population size. Human influenza A virus shows the largest departure from neutrality, with the posterior distribution completely disjoint from the predictive distribution. 5. Using equation (5), the posterior distribution of genealogies was compared with the predictive distribution of genealogies, resulting in a p B value (using the D F statistic to summary the genealogies, as D F proved most powerful on the real data sets).
In the above scheme, the model used to simulate the data is the same as the model that we are testing against. Therefore we would expect the p B values to be distributed approximately uniformly between 0 and 1 under the null hypothesis. Figure 3 shows the cumulative probability distribution for the p B statistics calculated using the above scheme as well as for a 100 replicates where exponential growth (N e τ = 5000, r = 2 × 10 -3 ) was assumed instead of a constant population size. For the case of constant population size, the p B values are distributed approximately uniformly, with 8 false positives (i.e. p B < 0.05) suggesting that the test is neither overly sensitive nor too conservative ( Figure 3). However for the case of exponential growth the values of p B do not appear uniform with too few extreme values suggesting that the test would be conservative. These results strengthen the conclusion of a relative abundance of external-branch mutations in the viral data sets analyzed in this paper, because the significant statistics observed for real data sets both under constant-size and exponential growth assumptions would not be expected under neutrality.

Conclusion
The results presented here demonstrate the utility of posterior predictive simulation for testing the goodness-of-fit of population genetic models of molecular evolution. In particular we tested the assumption of neutrality under both constant population size and exponential growth on four example data sets where temporally spaced data was available. In both dengue-4 and the human influenza A viruses there was a significant excess of mutations on terminal branches whether or not exponential growth was assumed. In contrast gene trees of HRSV and the d-loop of brown bears did not exhibit any significant departure from neutral expectations in terms of tree shape or genealogical distribution of mutations, although all four data sets had greater than average numbers of mutations on terminal branches relative to internal branches when compared to neutral expectations. Furthermore all four data sets had below average age of the root and below average tree length. In terms of tree-imbalance, both above and below average imbalances were observed for all three tree-imbalance statistics measured (B 1 , C n , I c ).
This paper has been primarily concerned with demonstrating the utility of using existing summary statistics for testing neutrality in temporally spaced data sets. While we have demonstrated that existing statistics, such as D F can be successfully used to uncover non-neutral evolution it remains likely that better summary statistics may exist. We have described a method for comparing measures of tree shape with their expectations even if the tree shape statistic cannot be directly calculated from the sequence data. We hope that further development of test statistics of tree shape explicitly designed for temporally spaced data will proceed. By doing this we hope tests of recent phylodynamic theories [16] of genetic diversity and evolution in viral pathogens can be constructed. With the posterior predictive framework outlined here, new statistics should greatly increase our ability to detect non-neutral evolution and other departures from standard models of molecular evolution and population genetics. One potentially fruitful direction lies in examining violations of neutrality in the underlying substitution process, as well as in treeshape. Efficient methods to detect substitution model violations by comparing the expected numbers of different classes of nucleotide substitutions have already been introduced [45]. This allows future work to combine appropriate summary statistics across the full model parameter space in order to maximize statistical power to detect non-neutral evolution.
Our reliance on posterior predictive simulation may raise the concern that the observed data Y for each example is "used" twice, first in generating the posterior distribution of model parameters and then in estimating the test-statistic employed to reject the null hypothesis. An alternative approach utilizing prior predictive simulation exists [46] and satisfies the above criticism. However, prior predictive simulation is undefined under improper prior distributions [47] and may not offer sufficient statistical power when vague priors are employed [48,49], such is the case Cumulative distribution of p B Figure 3 Cumulative distribution of p B . (Cumulative distribution of p B values (based on D F statistic) on 100 simulated data sets under a constant population (open circles) and an exponentially growing population (closed circles). The ideal behaviour for p B when applied to data simulated from the null distribution would be a uniform distribution (see main text for details). This plot shows that if the true demographic history is a constant population, then p B will be a good test of neutrality. However, if the true demographic history is exponential growth p B will be a conservative test, as can be seen by the lack of high p B values in the closed circles. θ are well-estimated given Y, then the posterior predictive p-value yields results similar to classical p-values (when available), while the prior predictive assessment is highly sensitive to prior distribution choice [47]. In addition to these fully Bayesian predictive methods, Bayes factors [50] are effective model selection tools in phylogenetics [51]. A Bayes factor measures the relative likelihood of two competing models. To compute the Bayes factor in favor of the null model of neutrality, one must specify an alternative model. Unless the researcher has firm a priori knowledge about how neutrality might be violated in their data, we recommend starting with rejecting the null through these predictive methods and only then attempting the difficult task of non-neutral model construction and fitting.
Although both dengue-4 and the human influenza A viruses exhibit very ladder-like trees that are highly imbalanced, our analysis suggests that this amount of imbalance is not much more than would be expected given the sampling scheme and estimated effective population sizes. However it can be argued that small effective population sizes, by themselves, are evidence for selection. This is because effective population size (N e ) is a measure of the number of productively replicating individuals, only when the population is evolving under conditions of neutrality. In the absence of any such prior assumptions, N e should be considered only as a surrogate measure of diversity in the population. Because diversity is reduced by selection, a low estimated N e could be a sign that the population process is being driven by natural selection. Nevertheless, the results presented here emphasize that ladder-like trees, by themselves, do not necessarily suggest selection. Consequently, interpretation of tree shape imbalance should not be made in the absence of an understanding of the expectations under the null model. Overall, for the example data sets chosen in this study, tree shape did not seem to be a powerful indicator of non-neutral evolution. Finally, by incorporating a demographic model into the test framework, we have ruled out exponential growth as the reason for significant predictive probabilities (p B ) in all data sets besides dengue-4. Nevertheless there remain a number of alternative explanations for neutrality being rejected.
Both human influenza A and dengue-4 viruses show a significant excess of mutations on terminal branches when compared to the predictions of the best fitting parameters of the neutral model. These departures from neutrality lend insight into the process of molecular evolution in RNA viruses, and suggest that new models that take into account these departures need to be developed to accurately model their genetic variation. In contrast, at least with respect to tree shape and genealogical distribution of mutations, neutrality seems to be an approximately adequate model for the G gene of HRSV and the d-loop of brown bears. We hope that further application of posterior predictive simulation will shed light on the pattern of within-population genetic variation in a wide range of species and genetic elements.

Methods
To assess selective neutrality in evolution, traditional test statistics summarize either the observed sequence data Y directly or the shape and inter-node distribution of a fixed gene genealogy G relating the sequences, where G is assumed known. In general, however, G is unknown a priori and must also be inferred from the sequence data with considerable uncertainty for measurably evolving populations [23]. This presents a difficulty for classical statistical tests. We overcome this short-fall in a Bayesian framework using posterior predictive assessment of model fit [33,47].
In this framework, we estimate G and its associated uncertainty from Y using a statistical model of molecular evolution and population demography and simultaneously compare a summary statistic of the random genealogy G to the statistic's expectation under neutrality. Our approach relies on assuming a statistical model for molecular evolution under neutrality. We employ a standard choice based on a continuous-time Markov chain process for nucleotide substitution [52] and an underlying coalescent process to generate the genealogy [53]. In particular, we assume the [37] (HKY85) substitution model with discrete -distributed rate heterogeneity across sites [38] parameterized by Ω = (μ, κ, σ). Parameter μ is the overall rate of mutation, κ is the transition/transversion bias and σ is the Gamma shape parameter. We assume a demographic coalescent process that allows for exponential population growth parameterized by θ = (N e τ, r). Parameter N e τ is the product of the effective population size and generation time and r is the exponential growth rate.
Restricting r = 0 results in a constant population-size model. After assuming a prior distribution over (Ω, θ), we can approximate the posterior distribution using Markov chain Monte Carlo (MCMC) techniques [54,55]. We refer interested readers to [22] for further details on prior choices and our MCMC approach. Simulation of (1) is readily available using the software BEAST [36].
With the tools to infer the random genealogy G and model parameters given sequence data in hand, we now consider summary statistics to assess the neutral model fit.
Consider a vector of test statistics T(G) = [T 1 (G), ..., T K (G)] that summarize the shape of the genealogy G. Each element T k (G) for k = 1, ..., K serves as a unique mapping between G and the real numbers and generally returns a small value if G were generated by a neutral process and a large value otherwise. One such example for T k (G) is D F . Different T k (G) serve to detect different types of departures from the neutral tree form.
It is important to note that T(G) depends on an unknown model parameter in contrast to a classical test statistic that depends only on fixed quantities, such as the observed data Y or a fixed estimate of the genealogy . In the Bayesian literature, test statistics that depend on unknown model parameters (and also sometimes the data directly) are generally referred to as "discrepancy values" [48] to help differentiate them from classical measures. To simplify notation, we continue to refer to T(G) as a summary statistic with the implicit understanding that it is random and not directly observable. The advantage afforded by leaving T(G) a random variable is that we are now able to compare the discrepancy between the observed data Y and the posited neutral model as a whole, instead of between the data and the best fit of the model. To use T(G) to assess the model fit of neutrality, we consider the following thought experiment. Suppose we randomly simulate under a neutral model a genealogy G rep from a replicated population almost identical to the population yielding the sequence data Y, where both populations share the same unknown demographic parameters θ, number of tips and tip-dates. Then, we compare quantities T(G rep ) and T(G) given Y. Disparate values signify model misspecification caused by non-neutral evolutionary forces.
We recall that T(G rep ) and T(G) given Y are not fixed values, but are random variables represented by probability distributions. As a consequence, we must integrate over all possible realizations weighed by their posterior probabilities to generate a test based on T(·). This process is called posterior predictive simulation [46][47][48]56]. Model selection and critique using posterior predictive simulation has had a successful history in phylogenetics [33,49,[57][58][59].
The central distribution that we require is the posterior predictive distribution of the test statistic In practice, one approximates the predictive distribution in (2) by first generating a posterior sample {G (p) , Ω (p) , θ (p) } for p = 1, ..., P from P(G, Ω, θ|Y). Then, for each p, one draws G rep,(p) ~ P(·|θ (p) ), where P(G rep |θ) describes a selectively neutral coalescent process. Finally, one tabulates T(G rep,(p) ). We interpret this predictive distribution as a description of the values that T(·) generates when applied to genealogies from selectively neutral populations. To assess neutrality in the observed data, we compare the predictive distribution to the posterior distribution of the test statistic approximated by tabulating T(G (p) ) for p = 1, ..., P.
When the test statistic T(·) is univariate [33], assessing differences between predictive and posterior distributions can be done in two ways [47]. The first method is graphical, generating a scatterplot of {[T(G rep,(p) , T(G (p) )], p = 1, ..., P}. The second method is more formal, employing tailarea probabilities.
Let the posterior predictive p-value [48] then p B remains well-defined even though T(G rep ) and T(G) given Y are not directly observable [47]. Probability p B shares many characteristics with a classical p-value; for example, p B can be viewed as its posterior mean and, under the null hypothesis of neutrality, p B is approximately distributed as a Uniform [0, 1) random variable [48]. Given these properties, we reject the selectively neutral model for extreme values of p B , say p B <α = 0.05 for strictly non-negative T(·) or p B <α/2 or p B > 1 -α/2 otherwise.
To calculate p B , a consistent estimator is where 1{·} is the indicator function, returning 1 if its argument is true and 0 otherwise.
When the test statistic T(·) is multivariate, we are able to detect a greater variety of departures from selective neutrality simultaneously, but a single tail-area probability becomes more troublesome to calculate. In this situation, we first standardize individual elements T k (·) such that var [T k (G)|Y] = 1 for all k. This places all measures on a common scale. We then generate scatterplots of the multivariate distributions. We agree with [47] in that comparing the posterior and predictive distributions graphicallŷ can provide more information than reporting a single pvalue. For example, we can identify which components T k (·) in T(·) contribute greatest to the discrepancy between the data and a selectively neutral model.
To calculate a tail-area probability in the multivariate setting, we turn to the (squared) Mahalanobis distance in constructing a posterior predictive test [60]. Let be an estimate of the predictive mean of T(G rep ) and be an estimate of its variance-covariance matrix, such that for p = 1, ..., P. Then, we define the (squared) Mahalanobis distance where we substitute T(G) for x when considering the distance's posterior distribution and T(G rep ) for x when considering its predictive distribution. Mahalanobis distances are commonly used in discrimination analysis and classification. The metric of the Mahalanobis distance M(·) is the inverse of the variance-covariance matrix of the predictive distribution and, as such, returns distances normalized relative to the multidimensional spread of the data under selective neutrality. Following in the light of Equation (5), we define the multivariate posterior predictive p-value A consistent estimator of the multivariate p B is readily available in the vain of Equation (6).
When it is unclear a priori which elements T k (·) provide the most power to reject selective neutrality, the multivariate approach side-steps the multiple testing problem inherent in examining each element independently. In these situations, we consider first using (9) as a global test with a fixed Type I Error rate α and then sub-selecting a small number of individual T k (·) for further univariate analysis. For researchers who begin by examining the K univariate analyses separately, we recommend applying a Bonferroni correction by decreasing the critical value cutoff from α to α/K per test. For large K, a Bonferrioni cor-rection is overly conservative, especially when considering the potentially high correlation between T k (·). At this point, monitoring the false discovery rate [61] becomes more practical. V