We calculated the measure of marker efficiency by the metric δ for each marker. (Note that δ as defined here is different from that defined in Rosenberg et al. 2003 [25]). We designated δstudy-AA-EAas the measure of marker efficiency between EA and AA in our study populations, and δreference-study-EAor δreference-study-AAas the quantitative difference in efficiency between marker characteristics as they were reported previously, and as we observed them in the study populations. We observed that the maximum δstudy-AA-EAwas 0.82, for the marker FY, and the minimum δstudy-AA-EAwas 0.15. The mean was 0.32 and median was 0.28. Larger observed δstudy-AA-EAcorresponded to greater marker efficiency for differentiating the EA and AA study populations. Furthermore, smaller values of the δreference-study(including δreference-study-EAor δreference-study-AA) indicate that the marker as observed is more similar to the marker as described in the reference (and therefore the reported allele frequencies were relatively accurate for LBM training). For markers with higher values of this measure, since they did not match the training frequencies as well, their utility in practice was reduced. An efficient classification marker would be one with bigger δstudy-AA-EAand smaller δreference-studywhen the reference allele frequencies are used for training for the LBM. Figure 1 shows the relationship of these three δ measures; the straight line in the Figure 1(1) indicates the equality of δstudy-AA-EAand δreference-study. Thus, Figure 1(1) illustrates that the majority of the markers have δstudy-AA-EA> δreference-study, and Figure 1(2) shows the ratio of δreference-study-AAto δreference-study-EAwith a horizontal line specifying δreference-study-AA= δreference-study-EA. (Twenty-two of 36 markers studied (61%) are above the horizontal line, which indicates that they are less representative (of prior reports) for AAs than for EAs. This reduced correspondence of the observed AA allele frequency compared to the prior reports relative to our observations in EA populations, also causes decreased assignment accuracy in AAs compared to EA – in fact, the assignment accuracy in AAs never reaches 100%. Even with imperfect training frequencies, the LBM using the selected makers to classify individuals into subpopulations still performed very well, with average assignment accuracy of 96.8% and 99.9% for AA and EA respectively.) These results illustrate, further, that the selected marker panel is a relatively informative marker set in differentiating between EAs and AAs.
Assignment accuracy
In order to ascertain the smallest sufficient marker set and identify how many makers are needed to reach reasonable assignment accuracy, we took the approach of selecting markers by marker efficiency, as we did previously in evaluating the Bayesian method [1]. The relative assignment accuracy was evaluated by adding markers one-by-one up to 36 markers, with the order of δ either descending or ascending; the results are shown in Figure 2 (This result by LBM can be compared with results from STRUCTURE in Yang et al. 2005 [1]; cf. Figure 3, p. 308). FY was the most informative marker, and due to its unique value in distinguishing the EA and AA populations under study, we performed analyses separately either including or excluding this marker.
In EAs (Figure 2, (1)), the assignment accuracy by LBM exceeded 99% using the most efficient marker FY, and reached 100% using the 10 most efficient markers excluding FY (when FY was excluded, the assignment accuracy using the next most efficient marker D11S936 dropped by 9%). In contrast, it would take 29 markers to reach >99% assignment accuracy when the least efficient markers are selected or the seven most efficient markers are omitted. In AAs (Figure 2, (2)), the assignment accuracy reached 96.4% using FY, and then the assignment accuracy changed inconsistently as more markers were added up to 21 markers, at which point assignment accuracy stabilized at 97.6%, achieving the maximum of 98.8% when all 36 markers were used. Overall, using LBM, it can exceed 95% when using at least the 9 most efficient markers. When FY was excluded, the assignment accuracy dropped by 38%.
This 38% drop, which reflects the difference in accuracy between the most efficient marker, FY, and the second most efficient one, D11S936, was further investigated by a corresponding analysis in which the study sample was randomly split into two groups and one group was treated as a reference sample. The drop declined to 6%, which was more comparable to the 9% in EAs. Thus, this reduced accuracy was in large part attributable to mismatch between reported training allele frequencies and frequencies that are more representative of our Northeastern US AA population. LBM never reaches perfect assignment accuracy for AAs in this sample even when all the 36 markers were used, but accuracy did reach 98.8%.
Comparison of observed and reference allele frequencies
The high assignment accuracy by LBMs was observed notwithstanding the deviation between our observed allele frequencies and the reference frequencies described above. We further compared our observed allele frequencies with published reference allele frequencies using the χ2 test. In EAs, after adjusting for sample size, there were 19 markers that differed at p < 0.05, while in AAs, the corresponding number of markers was 29. In other words, allele frequencies observed in EAs matched the reference group more closely than did allele frequencies observed in AAs. As a result, the LBM performed better in EAs than AAs, as might be expected given the dependence of LBMs on prior knowledge of allele frequencies.
Evaluation of the influence of mismatched reference allele frequencies on assignment accuracy by means of split samples
As noted above, in many cases our observed allele frequencies showed nominally significant differences from population reference frequencies. This could reflect, for example, sampling error, or differences in allele frequency for population groups with similar self-identified ethnicity that are assessed at different geographic locations. To further assess the impact of the reference group on the assignment accuracy for LBM, we randomly split our EA and AA study datasets each into two equal-sized samples, treating one as the study group and the other as the reference group. Thus, we were able to model geographically appropriate allele frequencies for each group, at the expense of reducing the analysis sample size by a factor of two. The distributions of the allele frequencies for the two split samples are the same in EAs and AAs for all the markers based on the χ2 test (p-value ranges from > 0.57 to 1). The results (Figure 3) for AAs using internal split samples improved dramatically compared to the results using the external reference group in AAs (Figure 2). These results (Figure 3) illustrate that the performance of the LBM depends greatly on how representative the reference allele frequencies are to those of the population being assigned when the parental population is known.
Logarithm likelihood ratio
We also calculated the logarithm of the likelihood ratio, expressing the comparison of the probability of being in the EA group compared to the AA group, based on formula (2) (Methods section), and generated a visual display of correct or misplaced group assignment for each individual, adding the markers one by one using a descending value of δ. Figure 4 shows the 12 most efficient markers. The horizontal line represents a log likelihood ratio of zero; those above zero are allocated to EA, and below zero to AA (refer to equation (2)). The vertical line separates the groups. Therefore, those in the upper right and lower left quadrants are misclassified based on self-identified race. The first graph represents the allocation of each individual using only the most efficient marker, FY. As markers are added to the analyses, the log likelihood ratios increase and the separation between clusters become more and more marked. (Note that the Y-axis scale is not constant.)
One individual in the AA series appeared to be misclassified; see Figure 4 with 9 to 12 markers. Based on this observation, we examined the phenotypic information for this subject, and determined that, although self-identified as AA, the subject had one AA and one EA parent.
Comparison of LBM results with Bayesian results obtained using STRUCTURE
We compared the performance of LBM with results obtained using STRUCTURE and the same panel of markers by Yang et al. 2005 [1] (Figure 5); the samples used for Figure 5 are exactly the same as those for Figure 3 in Yang et al. 2005 [1] (cf. Figure 3, p. 308). In EAs (Figure 5 – (1)), the LBM provided more accurate group assignment than STRUCTURE, with the FY locus included or excluded. In AAs (Figure 5 – (2)), the relative performance of STRUCTURE and LBM was mixed.