In my paper, A New Model of Computational Genomics [1], I introduced an algorithm that can predict ethnicity given mtDNA alone with about 80% accuracy. See Section 5 of [1]. This is shocking, because mtDNA is inherited directly from the mother, with no obvious means of transmitting information about paternal lineage, and ethnicity is of course a combination of both maternal and paternal lineage. That said, I realized today that I introduced information about the hidden testing classifier during the training step, because I construct the profile of a testing row, by knowing its true class, and comparing it to only its own population. In the real world, you’re not going to know the ethnicity of a testing row, and so this is academically interesting, but not practical for actually predicting ethnicity.
As a work around, I revised the algorithm to generate a profile of a given testing row, against all classes. In the current dataset, there are 50 classes, and so this would produce 50 profiles for each testing row. I then select the profile that minimizes the Euclidean distance between one of those 50 profiles, and a profile in the training dataset. The predicted class of the testing row is the class of the training row to which it matches (i.e., the class of the profile that minimizes the distance). As a consequence, the classifier of the testing row is truly hidden. This seems to have increased accuracy, to about 85%.
The bottom line is, mtDNA carries information about paternal lineage.
Attached is the updated code, and the dataset and all missing code can be found in [1]:
https://www.dropbox.com/s/fq5mkk2j7qh7vn6/Build_Profile_UPDATED.m?dl=0