In this section we describe the mathematical details of the iXora haplotype inference algorithm, the measures used to quantify the precision of the output, and the different downstream processing of the output. We conclude the section with the description of the measures used to compare the results from different phasing algorithms.
iXora algorithm: haplotype inferencing
The outline of the three phases in the iXora algorithm is shown in Figure 1. In this section what follows is a more precise mathematical description of the steps which is presented as a sequence of key observations: thus this describes as well as provides the rationale for each of the steps. Note that Figure 1, annotated by equation and observation numbers, is a road map for the exposition in this section to help the reader understand the description.
Notation
Let be an n × m matrix that encodes n (> 1) progeny with m (> 1) markers. Each row i represents a progeny and each column j represents a marker. The order of the markers is also captured in the matrix, i.e., if and only if marker j is located between markers j′ and j′′ on the chromosome. The j th marker of the i th progeny, denoted or simply 〈i j〉, also referred to as the position 〈i j〉, is a pair of alleles: each individual allele can be accessed as and . Thus if or simply written as AC, A and C. The two observed alleles at marker j (across all individuals) will be denoted as Z and Z′. A marker j is polymorphic if Z ≠ Z′. We make the assumption that all the markers in are polymorphic. When or the j th marker at individual i, or position 〈i j〉, is said to be homozygous. Similarly, when , position 〈i j〉 is said to be heterozygous.
We next introduce a definition and notation for conjugacy. Let conjugate of z be written as , where z is a matrix or a discrete value as defined below. Note that the conjugate of conjugate of z is z, i.e., always holds. For parents p = a,b, if p = a, then and vice-versa. Similarly, if k = 0, then and vice-versa. Thus (and ). Also, and for the parents p = a,b.
Solvability of a given genotype matrix
Assume that there is no more than one crossover, in an individual, between two adjacent positions. If d
t
is the true number of such crossovers, then 0≤d
t
≤ 2(m - 1)n. Let the estimate of d
t
be d
c
≥ 0. Consider a scenario where each position in is heterozygous. Then there exist an exponentially large number (2mn) of distinct and equally-likely haplotype configurations of the progeny, each with no crossovers. While informative markers in general are chosen for their heterozygosity in data sets, it is observed that the same marker is also homozygous in many a progeny. Thus, in practice, due to Mendelian inheritance [16] it is very unlikely to have a run of markers that are heterozygous in all the progeny. It is this random sprinkling of homozygous positions in , that makes d
c
estimation possible. We conducted simulations to study the relationship between d
c
and the extent of homozygosity in in realistic data scenarios.
Let e denote the mean number of crossovers in a progeny. We used simulations with values 2 ≤ e ≤ 15. We observed that for this wide range of crossover profiles, the required fraction of homozygous sites in to get a good estimate of d
t
(i.e., within 5% of the true value) was bound. The empirical observation is summarized below.
Observation 1. When at least 28% of each subsample of randomly chosen positions inis homozygous, the estimated d
c
is within 5% of the exact value d
t
, i.e., d
c
≥ 0.95 d
t
.
In practice, we have encountered values of n in the range of 50 to 400 and m in the range of 30 to 600. We observe that consistently, at least 50% positions are homozygous and the solutions, obtained using methods of the following sections, displayed e ≈ 2. With decrease in cost and in increase in accessibility of sequencing and genotyping technologies, m could be orders of magnitude larger in the coming years. Note that e is not expected to increase with m. Thus the lower bound estimation scales well with m, since the observation above is independent of m and the algorithm discussed in the later sections is linear in m.
Phase I: Preprocessing
Since all the individuals have the same two parents, let the two parents be a and b. Let Va(0) encode haplotype 0 of parent a and Va(1) encode haplotype 1 of parent a. Similarly, Vb(0) and Vb(1). The two distinct allele values of parent p at marker j are represented by and .
At Step 1, Ma and Mb are initialized, for each i and j and p = a,b, as follows. Let the two alleles at marker j be written as Z
j
and .
(1)
Also, Va and Vb is initialized as and , for each j and p = a,b.
If marker j is homozygous in both parents, then position 〈i,j〉 is heterozygous, for all progeny i. If marker j is not homozygous in both parents, then possible progeny values are Z
j
Z
j
, , and . When both parents are homozygous, it may not be apparent what the allelic values of the individual parents are, but as is seen in the running example of the Overview section, this does not affect the solution. Such loci are marked as "-” in the parents, i.e., "-” for p = a,b. Also the column j of the matrices are updated as , for each i. If exactly one parent (either a or b) is homozygous, then only Z
j
Z
j
, but not , can be observed in some progeny, while the rest are . In Step 2, we identify all such markers.
Note that while it is easy to estimate if a marker is homozygous in both parents or heterozygous in both, it is not obvious to estimate the heterozygous parent when exactly one of the parents is so. In Step 3 we identify markers which are homozygous in exactly one parent (i.e., either a or b). The execution of this process is illustrated in the Overview section through the running example and is described in detail here.
Recall that a marker j is homozygous in parent p if all the entries in Mp are H. For a fixed p, for a pair of non-homozygous markers j and j′ four counts, are computed for all combinations of k1,k2 ∈ {0,1} as:
In words, c01 is the count of individuals where marker at j is inherited from haplotype 0 of the parent p and the marker at j′ is inherited from haplotype 1 of the same parent. Similarly, c10, c11 and c00. Let
Let t = c00 + c11 and t′ = c01 + c10. Also, let tmax = max{t,t′} and tmin = min{t,t′}. Then this is interpreted as tmax individuals with no crossovers while tmin individuals have a crossover between locations j and j′. In practice, we use a stricter condition where the values tmax and tmin need to be well separated. We use two threshold fractions: 0 < α1,α2 ≤ 1. is said to be polarized if and only if the following hold:
In our implementation we use α1 = 0.3 and α2 = 0.15, which we have observed to yield accurate phasing results on real data, confirmed by the precision measures discussed later. When the polarization condition is violated, it is considered to be an error in data . In practice, we observed that in all such cases this was due to experimental errors at one of the markers. Hence, we make the following assumption. For parent p and for the pair of non-homozygous markers j and j′, must be polarized. In practice, when this condition is violated, we flag the marker for closer scrutiny in the experiments. A consequence of this assumption is the following.
Observation 2. Let J
a
(resp J
b
) be the set of markers with heterozygous parent a (resp b ). Ifis not polarized, then j ∈ J
p
and.
This observation states that it is possible to computationally obtain two non-overlapping sets of markers, where one set represents the markers that are homozygous in only parent a and the other set represents the markers that are homozygous in only parent b. Thus, when the parent genotypes are not available, this observation is utilized in Step 3 to resolve the markers that are homozygous in exactly in one parent.
Let marker j be homozygous only in parent a based on the above process, then and column j in Ma and Mb are updated (Step 4) as follows: , and . Next, column j in Mp is updated as follows:
(2)
Note that the above is equivalent to: For a marker j that is homozygous only in parent a: is set to H for all i and if was X, then it is updated to 1. Similarly, for a marker j that is homozygous only in parent b: is set to H for all i and if was X, then it is updated to 1.
Rationale. Note parent a is homozygous at j (but parent b is not). The reason for switching X to 1 in column j of Mb is that no matter what the value of is, or will be eventually updated to, 1 encodes for the allele Z’.
When marker j is homozygous only in parent b, a similar update as above is performed. Note that in the implementation, as a marker j is resolved (Step 3), the marker is immediately updated in Vp and Mp (Step 4).
Phase II: Detecting crossovers
Without loss of generality, for each non-homozygous j in Mp, let j
n
x
t
be the adjacent non-homozygous marker in Mp. Substituting j′ with j
n
x
t
in the definitions from the last section, let tmax and tmin be redefined. Then in Step 5, Vp is phased as follows: For parent p, if , then the values of and are interchanged. Further, column j in Mp is updated by replacing an existing entry of 0 with 1 and an entry of 1 with 0, to reflect the updated .
Let . Then the minimal number of crossovers (or recombinations) between j and j
n
x
t
in haplotypes inherited from parent p is . Thus:
Observation 3. If d
c
is the total number of estimated crossovers, then
Note that d
c
could be larger than the sum on the left, since some further crossovers may be introduced in the next phase. Since only adjacent markers need to be considered for detecting crossovers, the following holds.
Observation 4. Given, the haplotype matrices Ma,Mb and the parent haplotypes Va,Vb can be constructed intime.
Missing values There are often missing values in the data, sometimes as high as 20%. When the value is missing at position 〈i j〉, we make the assignment , for p = a,b. However, if after the computation of V, marker j is homozygous at parent p, then and is appropriately updated. It can be verified that this treatment does not introduce any extra (spurious) crossovers.
Selfed progeny For selfed progeny, not only are the parents the same but even the haplotypes of the parents are deemed identical. Then one of the following conditions holds for p = a,b and a numeric k ∈ {0,1}:
-
(1)
V a=V b and if then , for all i and j, or,
-
(2)
V a ≠ V b and if then , for all i and j.
Thus Mb is defined completely by Ma and thus the task is to estimate only Ma.
Monotonic state transitions Next, the matrices Mp are transformed as follows. A position 〈i j〉 is a heterozygous trio, if the two parents at j are heterozygous and so is .
Let k represent a numeric value. Define two functions Lt(·) and Rt(·) on an element of the matrix as follows:
Note that a marker j can never be both leftmost and rightmost since the number of markers is at least two, hence both Lt(·) and Rt(·) are well defined. Let M
i
j
= x. Then the transition S
x
→Sy is applied to assign the value y to M
i
j
(written as M
i
j
←y) as follows.
o
o If Lt(M
i
j
) = Rt(M
i
j
) = k, then M
i
j
← e
k
.
o
o If Lt(M
i
j
) ≠ Rt(M
i
j
) with Lt(M
i
j
) = k1 and Rt(M
i
j
) = k2, then
o , (note that k1 ≠ k2)
o If Lt(M
i
j
) = Rt(M
i
j
) = k, then M
i
j
← e
k
.
o
o M
i
j
← k.
o
o If is numeric, then
To estimate the running time of the algorithm, we classify the transitions as intra-transitions and inter-transitions, depending on whether M or is used respectively. Thus is the only inter-transition. The above transitions are applied to the elements of Ma and Mb till no more transition is applicable. This final form of matrix is written as F(Ma) or simply Fa (similarly, F(Mb) or simply Fb).
The permissible transitions are captured in the transition diagram in Figure 2. The three possible start states are S
H
, S
X
and S
k
and the three possible final states are shown boxed. The curly arrow represents the inter-transition while the straight arrows represent the intra-transitions. Note that each M
i
j
undergoes no more than three transitions since there are no cycles in the state transition diagram. Also, each state with multiple outgoing edges have non-overlapping conditions, leading to a unique choice of transition.
Each element of Fp is in the final state, i.e., for each position 〈i j〉, , p = a,b. A numeric value of -1 indicates that no information regarding the haplotypes can be deduced. It can be verified that these state transitions induce the following properties on F and V.
Observation 5. For a fixed i and p = a,b, the following hold:
-
(1)
If is not numeric but is, then must hold.
-
(2)
If and are both not numeric, then 〈i j〉, must be a heterozygous trio. The converse however is not true.
-
(3)
If -1, for some j, then -1, for all j.
Observation 6.
-
(1)
Given M a and M b, F a and F b are unique.
-
(2)
F a and F b can be constructed from M a and M b in time.
To show that Fa and Fb are unique, the iterations can be viewed as of two types: one where only inter-transitions and the other where only intra-transitions are applicable. In the very first iteration, due to the possible start states, no inter-transition can be applied. Thus using uniquely applicable intra-transitions, each and each is transformed. When no intra-transition can be applied, the uniquely applicable inter-transitions are applied. Thus the iterations alternate between intra- and inter-transitions. Hence the final forms Fa and Fb are unique. Next, since each entry can go through no more than three transitions, it is possible to obtain Fa and Fb from Ma and Mb in time using suitable list data structures.
Phase III: Staging output
An optimization problem (e.g., minimizing an appropriate error function) whose solution is associated with an output configuration, such as alignment of multiple sequences or a phylogeny topology or landscape of crossovers in chromosomes, has the added burden of proving stability in the solution. In other words, how distinct in configuration are the next closest solutions? This is usually very difficult to answer, and most methods are inadequate in addressing this. However, due to the very specific nature of our problem, we provide an agglomerate of "best” solutions, so that its stability can be gauged. Our focus is not just on a maximum likely or a most parsimonious solution, but on an "agglomerate” of all (equally-likely) feasible solutions. This is a characteristic not just of the method but, in a sense, that of the data as well.
Suppose there are L > 0 feasible distinct solutions for a given . Is it possible to generate an agglomerate in a single pair of matrices Ra and Rb that captures all the L solutions ? In the following paragraphs we describe how to construct the agglomerate (Ra and Rb) based on the encodings in Fa and Fb.
Let the conjugate of F be defined as and .d that encodes all the possible solutions, is constructed from , p = a,b, as follows. We use new symbols ∗, q and Q in addition to the numeric values (0 and 1) in Rp. For each progeny i and p = a,b:
-
1.
If for some j, then without loss of generality and , for all j.
-
2.
If then
-
3.
For each j, if is numeric and is not, then .
-
4.
For each j, where and are both not numeric, then
(3)
This is illustrated in the following example.
Example 1. Letand, for some j. Further letandfor some i. Then by Equation 4,. However if, then.
The non-numeric values in Rp encode multiple solutions, possibly with multiple crossovers. Hence we call the contiguous blocks of non-numeric values as dispersion intervals. The details of interpreting these intervals is discussed in the following sections.
Precision measures of iXora output
Note that d
c
is the number crossovers seen in the result matrices, thus an average number of crossovers for the individuals is d
c
/n. But these also correspond to an output configuration. Then how precise is the output of iXora for an input genotype matrix ?
The matrices Ra and Rb enable the elicitation of the parameters for the various haplotype distributions across the markers as well as the precision measures. The agglomerate structure aids in defining a stability measure summarized as the metric (Equation 11). The other inadequacies, such as insufficient information and errors due to imperfect experiments or data-acquisition, are evaluated by the methodology and summarized as and respectively (Equations 12-13). The former is the extent of dispersion in position of each crossover over the agglomerate, and the latter is the observed error in the R matrices. The trio define the haplotype precision in the given genotypes.
Stability measure,
Let r be a run of contiguous values in a row (progeny) of Rp. A dispersion interval is a non-numeric run, sandwiched by the chromosome boundary or numeric value. Let Ra and Rb be runs in the two haplotypes of the progeny and are aligned in the examples discussed below. It is possible that only one of Ra or Rb is a numeric run. Then it is called a asymmetric. But if both are non-numeric, then by construction (Equation 4) ra = rb and the run is symmetric. A switch is defined to occur between q and Q or between Q and the numeric value that sandwiches the run. The number of switches is written as s w(r). The value ∗ is a wild card and can be treated either as q or Q. The switch detection is succinctly described by the following two examples. In the illustrations the dispersion interval is shown in green sandwiched between numeric values (in black) and each switch is marked as a numbered (red) down-arrow.
Example 2. Consider the following run.
The number of switches is 4 as shown.
The wild card may result in different positions of the same switch corresponding to whether it was interpreted as a q or a Q. These distinct positions of the switch is termed the wild card count of a switch. If a switch position is not affected by any wild card, its count is 1.
Example 3. The following run has three wild cards.
The first is treated as a q, the second as a Q and the third can be treated as a q or Q with two possible positions for the third switch.
The wild card counts of the four switches are 1, 1, 2 and 1 respectively.
When s w(r) > 0, the location of the switches may define additional positions 〈i j〉 in R that can updated from non-numeric to numeric. These are the positions that are to the left of the leftmost switch and to the right of the rightmost switch positions. Thus in the run of Example 2, the two q’s on the left cannot take the value 1, hence they can be assigned 0, thus the length of the dispersion interval shrinks from 8 to 6.
Similarly the length of the dispersion interval of Example 3 shrinks from 11 to 9.
The transformed values are shown in bold above. The same process is applied to every dispersion interval to transform the matrices Ra and Rb. The observations from the transformations above is summarized as:
Observation 7. R is such that for each dispersion interval r, if s w(r) > 0, then (1) r begins and ends with Q and (2) the positions of the first and the last switch sandwich the dispersion interval.
In fact, the following is easily verified:
Observation 8. Let r be a dispersion interval and let s = sw(r). Then (1) the total number of crossovers in the interval, across both the haplotypes, is exactly s and each crossover in each haplotype of the configuration is at a position of the switch.(2) If r is asymmetric, then s is zero. In general, s is even and the number of crossovers in each haplotype of each distinct configuration is odd.
In practice, we observed that in all data sets all the dispersion intervals had no switches. There was exactly one instance where s w(r) was 2. Also, the following is easily verified.
Observation 9. Let s = s w(r) for a dispersion interval r. Then if s ≤ 2, there is no additional crossover introduced by r. If s > 2, the number of additional crossovers is s-2.
Let U be the set of all dispersion intervals in R. Then
(4)
Dispersion index,
The dispersion index, , is a measure of the uncertainty in the position of the crossovers. This is computed as an average over all predicted crossovers. Thus
(5)
with
where c
l
is the wild card count of each switch l in r. When the location of each crossover is known precisely, this index is 0, while value 1 indicates maximum uncertainty in the location.
Error estimate,
If then the position 〈i j〉 has a mismatch if Note that the mismatch at this position could be flagged as an error or one of and can be modified to potentially introduce additional crossover(s). These are undetectable during the lower bound computation and do not overlap with the dispersion intervals. Let N be the number of such mismatches. Then, , the error estimate in is defined as
Also, in our experiments these mismatches were extremely low (less than 0.01) and when followed up turned out to be experimental errors. Hence we have followed the convention that such a mismatch be flagged as a potential error. Then an error, if any, is at 〈i j〉 in F that underwent the transitions . Note that the converse is not true. Also, an additional crossover, if any, is at 〈i j〉 in F that underwent the transitions . Again, the converse is not true.
To summarize, the trio () define the haplotype precision in the given genotypes . Across our experiments (30 distinct data sets) with real data [15], the mean precision trio were observed to be (0,0.002165,0.000299).
Downstream processing of iXora output
Since iXora associates the parent haplotypes (not just the inherited alleles) to each marker in a progeny, it is possible to study the distributions of the inherited parent haplotypes independent of or in association with a phenotype. The details of these and other related postprocessing available in the iXora framework are described here.
Haplotype frequency distributions
One of the important consequences of haplotype inferencing is obtaining the haplotype frequency distribution across the chromosomes. A marker j of progeny i has two alleles, one derived either from haplotype 0 or haplotype 1 of parent a and the other either from haplotype 0 or haplotype 1 of parent b. Since R is an agglomerate, it also contains the non-numeric values that encode multiple configurations. Based on the encoding in R we estimate the expected value of the frequency count and its variance. First, we enumerate the feasible solutions encoded by a non-numeric run r.
Example 4. In each of the following the length of the dispersion interval is 3 and the number of additional crossovers is zero. However, the number of configurations are different based on the structure of the switches. Two runs with zero switches in each:
A run with s w(r) = 2:
The 8 distinct configurations for Example 2 are:
Let w t(r) be the number of distinct solutions encoded by r.
Observation 10. Let s = sw(r). Let c1,c2,…c
s
be the wild card counts of the switch positions. Let wt(r) be the number of distinct feasible solutions due to r. Then
(1) If < 2, wt(r) = len(r) + 1.
(2) If s ≥ 2,
, where
(7)
-
(1)
above is easily verified and for (2), Equation 17 is explained through the following example.
Example 5. Consider the following r with sw(r) = 6.
Note that the 6 switches can be shared by the two haplotypes as 1 and 5 (the first column), or as 3 and 3 (second column) as follows.
There are six distinct configurations for the first case and since the two haplotypes can be switched it gives 2 × 6 distinct configurations. For the second there aredistinct configuration, giving a total of 12+20 distinct solutions due to r.
Expected frequency and variance
Based on R, we estimate the expected value of the frequency count of the haplotype pairs and its variance. If R
i
j
is non-numeric, let R
i
j
= α hold. Thus for each marker j consider the following, for x,y = 0,1,α,
In the absence of any other external information, each of the alternative solutions in R is equally likely. Under this assumption, it is easy to verify the following:
Observation 11. The expected count,, of each haplotype pair, k1,k2 = 0,1, is:
The varianceof the count of the haplotype pair is approximated as.
Haplotype-phenotype association analysis
Below, we describe the iXora methodology for associating discrete traits with genomic locations using haplotypes. The same approach can be used for continuous traits, using different statistical tests and randomizations. In general, the phasing output can be used in other types of statistical tests, for example to test for associations between a pair of markers and a phenotype. In the following, let L the number of discrete phenotype values.
Combination of parents We can test the effect of each haplotype pair at a marker with a phenotype as follows. In the case of discrete phenotype with L possible values, iXora constructs for each marker a 4 × L contingency table of haplotype pair & phenotype counts. In the current iXora implementation, we use Fisher’s exact test fisher.test in R [17] to test for association between haplotype pairs & phenotype. The test outputs a p-value for each marker, denoting the significance of the association.
Individual parents We can also investigate the contribution to phenotype of each parent individually. The contingency table in this case is a 2 × L table of haplotype & phenotype counts (a separate table for each parent at each marker). Fisher’s Exact test is used to test for association between haplotypes & phenotype. The test outputs a p-value for each marker and for each parent, denoting the significance of the association.
Significance thresholds via randomization We include a method for directly estimating the background distribution of p-values in the haplotype-phenotype data by randomizing the phenotype labels. If p-values observed in the randomized data are always larger than in the real data, then the findings on real data may indeed be significant. When working with categorical traits, we take into account the size (number of individuals) of each phenotype category in the permutation test. Let S be the size of the smallest category. For the permutation test, we randomly select S individuals from each category and permute their phenotype values. The permutation test is repeated T times, running the same statistical test on the randomized data as on the real data, for each marker. The smallest p-value observed in the randomized data is generated as output and becomes the threshold for significance in the real results.
iXora output visualization
The agglomerate solution from the phasing algorithm can be directly visualized to detect distortions in the data, with or without using phenotypic information. Approaches for visualizing the phasing solution are demonstrated in the following two paragraphs, while the third paragraph describes visualization of haplotype-phenotype associations. The figures shown here as examples stem from the use case described in detail in the Section "Using iXora” in Additional file 1. Therein is detailed the phasing an trait association process for locating the genomic region and specific haplotype that determines the simulated phenotype (height).
Individual haplotypes The individual haplotypes can be directly visualized, for example as the colored haplotype blocks shown in Additional file 1: Figure A3. The visualization indicates the locations of crossovers in each individual, including uncertainty in the crossover locations. As an alternative visualization, Additional file 1: Figure A4 shows iXora output on the frequencies of individual haplotypes per parent. In this case, the individuals are divided into two groups by phenotype, and clear distortion is observed regarding frequency of haplotypes in each group.
Haplotype pairs The agglomerate structure capturing all the equally-likely solutions enables estimation of the possible dispersion of the crossover locations. iXora visualization of the expected frequency distributions of the progeny haplotype pairs is shown in Figure 3 for two phenotype groups. Clear under-representation of two distinct haplotype pairs is observed for each group. This visualization can be used to spot unexpected distortions, whose significance can be further evaluated using statistical tests.
Phenotype association Phenotype association for each parent individually is shown in Figure 4. The p-value threshold obtained via randomizations is also shown. In this case it is evident that only one parent is associated with the phenotype, in one genomic region. An example of iXora results from statistical testing for haplotype pairs’ association with phenotype is shown in Additional file 1: Figure A5, and a comparison between genotype and haplotype association results is shown in Additional file 1: Figure A6.
Comparison with related methods
Here we first elaborate on the three distinct categories of population models for phasing, and then give details on the comparison of iXora with related phasing methods in literature. Technical details on the settings for each compared method can be found in Additional file 1.
Unrelated individuals (no parent information) These methods treat the input genotypes as samples from a population of unrelated individuals, and do not assign the progeny to parental haplotypes. While they may be applicable to human population data, they are not suitable for our purposes. fastPHASE can be adapted to our problem setting by treating the input as n + 2 individuals from a population originating from four founder haplotype clusters. We adapt MACH similarly, though it has a much larger default value on the number of haplotypes. FMPH (Integer Programming Formulations To Solve Maximum Parsimony Haplotyping) [12] is closest in spirit to iXora, but is computationally very expensive and suitable only for small data sets (up to ≈50×30), although a hybrid approach is suitable for slightly larger data sets (up to ≈150×100). While the limitation on number of individuals can be tackled by splitting the data (as we do for Merlin, HAPI, SHAPEIT2 below), the limitation on the number of markers is debilitating, so we were unable to run FMPH on our data sets.
Unrelated trios These methods allow the definition of family relationships between parents and progeny in the input, with the limitation that each parent has only one progeny. BEAGLE and HAPI-UR (HAPlotype Inference for UnRelated samples) are two such methods. The methods phase the progeny individually in terms of sequences that were transmitted from each parent. Therefore the progeny are not necessarily assigned to a consistent set of parent haplotypes.
Related trios These methods allow defining several progeny originating from the same two parents, thus the underlying algorithms utilize the full sibling information. However, unlike iXora, none of the existing methods was able to use the entire set of progeny per two parents. In our application this size is in hundreds. HAPI and Merlin produce results only on families of about 10 progeny while SHAPEIT2 can only process sizes up to 50. We therefore randomly divided the progeny into sets of appropriate sizes and phased the sets independently. However, we observed that the phasing results for the parents between sets were often inconsistent, affecting the overall accuracy. HAPI and Merlin produce an assignment of progeny to parental haplotypes while SHAPEIT2 does not.
Comparison measures Here we describe the measures that we used to compare the different methods. Since existing phasing methods generate various types of output, we use two different measures so that all the methods are comparable on at least one measure. Our interest was not simply restricted to phasing the progeny genotypes by assigning each allele to the correct parent (PA), but also assigning them to the correct haplotype of that parent (PHA).
First, the phasing accuracy (PA) of progeny is examined, on a marker-by-marker basis, of only the heterozygous positions. We report the number of correctly assigned and the unassigned (unknown) positions as a percentage. BEAGLE, HAPI, HAPI-UR, Merlin, SHAPEIT2 and iXora can be directly compared on this measure for progeny, because they report the parental origin (maternal, paternal) of each allele. To incorporate fastPHASE and MACH also in this comparison, we post-processed their output as follows: progeny haplotypes are labeled as ‘maternal’ and ‘paternal’, using the labeling that minimizes mismatches compared to true maternal and paternal haplotypes. After the post-processing, all methods can be compared on this measure for progeny. The same accuracy evaluation is used to report imputation accuracy, by examining only the phasing for the missing input values.
Second, the accuracy of assigning the correct parental haplotype (PHA) for each progeny allele is examined, again on a marker-by-marker basis. All markers, including homozygous positions are used. For the output of programs where the input had to be split into smaller families, we consider only those subsets of progeny whose parents’ phasing are consistent with the majority parents (see Additional file 1 for details). Again, we report the number of correctly assigned and unassigned (unknown) positions, deeming a progeny position to be correct only when both alleles of the genotype are assigned to the correct parental haplotype. Only HAPI, Merlin, and iXora can be directly compared on this measure for the progeny.
All the simulated data sets are available at the iXora website http://researcher.ibm.com/project/3430.