THIS WAS THE RESULT OF BAD DATA, DISREGARD. THE ALGORITHMS ARE HOWEVER GOOD IN CONCEPT.
I’ve assembled a dataset using complete mtDNA genomes from the NIH, for 10 individuals that are Kazakh, Nepalese, Iberian Roma, Japanese, and Italian, for a total of 50 complete mtDNA genomes. Using Nearest Neighbor alone on the raw sequence data, the accuracy is about 80%, and basic filtering by simply counting the number of matching bases brings the accuracy up to 100%. This is empirical evidence for the claim that heritage can be predicted using mtDNA alone. One interesting result, that could simply be bad data, the Japanese population (classifier 4 in the dataset), contains three anomalous genomes, that have an extremely low number of matching bases with their Nearest Neighbors. However, what’s truly bizarre, is that whether or not you include these individuals in the dataset (the attached code contains a segment that removes them), generating clusters using matching bases suggests an affinity between Japanese and Italian mtDNA. This could be known, but this struck me as very strange. Note that because matching bases is plainly indicative of common heritage, this simply cannot be dismissed.
The chart on the left shows accuracy as a function of confidence, which in this case is simply the number of matching bases between an input and its Nearest Neighbor. Note the x-axis on the left does not show the number of matching bases, and instead shows the ordinal index of the number of matches (i.e., a x value of 25 is the maximum number of matching bases, which is approximately 17,000). The chart on the right shows the distribution of classes in the clusters for the Japanese genomes, after removing the three anomalous genomes. Clusters are generated by fixing a minimum number of matching bases, in this case it’s fixed to the minimum match count for all Japanese genomes and their respective Nearest Neighbors. Any other genome that meets or exceeds this minimum is then included in the cluster for a given genome. Note the totals can exceed the size of the dataset, since the clusters are not mutually exclusive, and so e.g., the clusters for two Japanese genomes can overlap, adding to the total count using the same genomes. As you can see, it shows a strong affinity between Japanese and Italian mtDNA. The analogous chart for the Italian population shows a similar affinity for Japanese mtDNA. No other groups show any comparable affinity for Japanese mtDNA. Because DNA is finite, and e.g., mtDNA has a well-defined sequence length, the number of possible sequences is fixed. As a consequence, as you increase the minimum number of matching bases required for inclusion in a cluster, the number of possible sequences that satisfy that minimum decreases exponentially as a function of the minimum. Therefore, if you increase that minimum, groups that do not actually belong should drop off exponentially. Those that remain, at a rate that is not exponentially decaying are more likely to be bona fide members of the cluster.
“ddbj_embl_genbank[filter] AND txid9606[orgn:noexp] AND complete-genome[title] AND mitochondrion[filter]”
Here’s the code: