Measuring Uncertainty in Ancestry

In my paper, A New Model of Computational Genomics [1], I presented an algorithm that can test whether one mtDNA genome is the common ancestor of two other mtDNA genomes. The basic theory underlying the algorithm is straightforward, and cannot be argued with:

Given genomes A, B, and C, if genome A is the ancestor of genomes B and C, then it must be the case that genomes A and B, and A and C, have more bases in common than genomes B and C. This is a relatively simple fact of mathematics, that you can find in [1], specifically, in footnote 16. However, you can appreciate the intuition right away: imagine two people tossing coins simultaneously, and writing down the outcomes. Whatever outcomes they have in common (e.g., both throwing heads), will be the result of chance. For the same reason, if you start with genome A, and you allow it to mutate over time, producing genomes B and C, whatever bases genomes B and C have in common will be the result of chance, and as such, they should both mutate away from genome A, rather than developing more bases in common with each other by chance. This will produce the inequalities |AB| > |BC| and |AC| > |BC|, where |AB| denotes the number of bases genomes A and B have in common.

For the same reason, if you count the number of matches between two populations at a fixed percentage of the genome, the match counts between populations A, B, and C, should satisfy the same inequalities, for the same reason. For example, fix the matching threshold to 30% of the full genome, and then count the number of genomes between populations A and B that are at least a 30% match or more to each other. Do the same for A and C, and B and C. However, you’ll have to normalize this to an [0,1] scale, otherwise your calculations will be skewed by population size. My software already does this, so there’s nothing to do on that front.

By iteratively applying the population-level test for different values of M, we can also generate a measure of uncertainty associated with our observation. That is, not only can we test whether the inequalities are satisfied, we can also generate a measure of uncertainty associated with the test.

Specifically, fix M to some minimum value, which we select as 30% of the full genome size N, given that 25% is the expected matching percentage produced by chance, and 30% is meaningfully far from chance (again, see Footnote 16 of [1]). Further, note that as M increases, our confidence that the matches between A and B and A and C, are not the result of chance, increases. For intuition, note that as we increase M, the set of matching genomes can only grow smaller. Similarly, our confidence that the non-matching genomes between B and C will not be the result of chance decreases as a function of M. For intuition, note that as we increase M, the set of non-matching genomes can only grow larger.

As a result the minimum value for which the inequalities are satisfied informs our confidence in the B to C test, and the maximum value of M for which the inequalities are satisfied informs our confidence in the A to B and A to C tests. Specifically, the probability the B to C test is the result of chance is informed by the difference between the minimum M – N25%, whereas the A to B and A to C tests are informed by the difference N – M, where M is the maximum M. Note this difference is literally some number of bases, that is in turn associated with a probability (see again, Footnote 16 in [1]), and a measure of Uncertainty (see Section 3.1 of [1]). This allows us to first test whether or not a given population is the common ancestor of two other populations, and then further, assign a value of Uncertainty to that test.

Leave a comment