Consider the table of observed haplotypes counts F:
where B
i
represent alleles at marker B and A
i
alleles at marker A. Let Ω be the set of all tables T with row and column sums equal to r1, ..., r
I
and c1, ..., c
J
, respectively. Given a criterion to quantify the discrepancy between F and the table expected under independence (linkage equilibrium), a volume measure is defined as the proportion of tables T ∈ Ω that lead to a smaller discrepancy value. If the recorded discrepancy is the biggest possible, then the volume measure will have value close to 1 (the exact value 1 will be attained as the sample size increases to ∞). Conversely, if all other tables lead to larger discrepancies, the volume measure will be zero.
One may notice that this definition of volume measure is similar to one minus the p-value for a test of independence. Indeed, volume measures are related to the "volume test," an original notion introduced by Hotelling [8], and the effect of sample size on the measures is very much the same as its effect on a p-value. The key difference between volume measures and variants of the commonly used Fisher's exact test for independence is that in the case of volume measures, the relevant proportion of tables is evaluated assuming that all tables with the same margins are equally probable, while in the case of Fisher's exact test tables are generated under the hypothesis of independence. Because of this, volume measures and Fisher's exact tests answer two very different questions: the first compares the observed table to all tables with the same margins, while the second one assesses the likelihood of the observed table under independence. A thorough discussion of the different interpretations and uses of these two approaches can be found in [7]. In order to concretely evaluate volume measures, one has to choose a criterion for discrepancy and be able to explore the space of tables with fixed margins to evaluate the required proportions. We start illustrating the first point by focusing on three specific measures: a) Dvol, which is defined only on 2 × 2 tables and coincides with D' when the population haplotype distribution is known; b) Mvol, which is a generalization of Dvol to multiallelic markers; c) Hvol, which is based on expected homozygosity and captures information that is close to the one described by R2, although it can be defined on tables with any number of entries.
When I = J = 2, let Ω1 = {T : ti+= r
i
, t+j= c
j
, (t11 - r1c1/n)(f11 - r1c1/n) > 0}. We then define Dvol as
where .
For general I × J tables, recall that Ω denotes the set of all contingency tables with the same row and column sums as F: Ω = {T : ti+= r
i
, t+j= c
j
}. Then, we define
The definition above should clarify how Mvol is closely related to Dvol, and the difference between the two is that Mvol does not consider the "sign" of the association, a notion that is undefined in generic I × J tables.
Letting , we can define the measure Hvol:
We have mentioned how Hvol captures information closely related to that of R2. A careful discussion of the interpretation of LD measures based on homozygosity can be found in [9]. Here it suffices to recall that joint homozygosity relates to a measure of agreement between the two markers and excess in homozygosity indicates that knowledge of the allele value at one marker increases predictive accuracy of the allele values at the other marker. The results of a recent empirical study conducted using homozygosity-based measures are documented in [10].
Note that all the above definitions use the strict inequality sign. The choice of this over ≤ is irrelevant for large n, but it makes a difference in the case of small n, where strict inequality allows us to better discriminate against apparent association due to small samples.
To evaluate these measures, we need to explore the space of all tables with the same margins. In the case of I = J = 2, this can be done by simple enumeration. For multiallelic tables enumeration is impractical. An obvious alternative is to restrict one's attention to a sample of possible tables. However, obtaining a sample of tables according to the uniform distribution among all tables with fixed margin (as opposed to according to the Fisher-Yates distribution) is not easy. It is indeed the computational difficulty associated with volume tests [7] and measures [6] that has substantially hindered their wide-spread application. Previous solutions have been proposed with Markov chain Monte Carlo algorithms in [11], as well as rejection sampling (see [12] for a review). The main contribution of this paper is that we have successfully implemented a sequential importance sampling (SIS) algorithm, originally introduced in [12], to evaluate volume measures accurately and in a timely manner. This implementation makes volume measures applicable to high throughput analysis.
To enumerate all tables in Ω1 (I = J = 2), it is useful to notice that t11 must satisfy
max(0, r1 + c1 - n) ≤ t11 ≤ min(r1, c1), (2)
and after t11 is chosen, we can fill in other entries of the 2 × 2 table by the marginal sum constraints. Therefore we can enumerate tables in Ω1 by assigning all possible integers satisfying (2) to t11, and keeping those tables such that (t11 - r1c1/n) has the same sign as in F.
We now consider the SIS procedure for I × J tables. Let u(T) be the uniform distribution over all tables in Ω. Then Mvol(F) can be treated as the expectation of the indicator function 1{m(T)<m(F)}with respect to u(T). It is hard to sample directly from u(T). The idea of importance sampling is to sample tables from another proposal distribution g(T), and then estimate Mvol(F) by
where T1, ..., T
L
are L independent and identically distributed (i.i.d.) samples from g(T). SIS generates a table cell by cell by decomposing the proposal distribution g(T) as
g(T)= g(t11)g(t21|t11) ... g(t
IJ
|tI-1,
J
, ..., t11).
Notice that the support for the first entry t11 is max(0, r1 + c1 - n) ≤ t11 ≤ min(r1, c1). We sample an integer uniformly from the above range for in, i.e., g(t11) is the uniform distribution on the support of t11.
Recursively, suppose we have chosen ti 1= for i = 1, ..., k - 1. Then the support for tk 1is . We sample an integer uniformly from the above range for tk 1. The procedure is continued until all the entries in the first column have been considered. Then we update the row sums by subtracting the realization of the first column from the original row sum, and sample the second column of the table in the same way.
The computing time and precision of the algorithm are different for 2 × 2 or larger size tables. For 2 × 2 tables, our algorithm simply lists all possible tables with fixed margins. CPU time is then proportional to the total number of tables: usually their enumeration takes a fraction of a second. The algorithm is exact and we do not have approximation errors in the output. For the general case of I × J tables, the CPU time depends on the number of generated uniform random variables: I × J × L for L Monte Carlo samples. It is important to keep in mind that the output of the algorithm is not exact, but an estimate of the true volumes ratio (so different runs will give slightly different results). The precision of the final estimate depends on the number L of Monte Carlo samples and how well the proposal distribution in the SIS algorithm approximates the target distribution for a given table. Indeed, the value of the parameter L has to be specified by the user. It is advisable to conduct multiple trial runs to estimate the precision of the estimate and select a value of L that assures an acceptable precision.