Predicting Nationality Using mtDNA

I noticed that when you compare a population to itself, at least using mtDNA, it produces a characteristic profile that is unique to the population, as shown in the graph below that plots the results of comparing the mtDNA (full genome) of 10 Swiss people to each other.

Specifically, when counting matching bases between a given row (i.e., individual), and the rest of its population, the average (A), standard deviation (S), minimum (m), and maximum (M) of the number of matching bases, viewed as a vector P = (A,S,m,M), forms a unique profile for each population. Intuitively, each genome in a population has a roughly similar relationship to all the other genomes in the population, producing a signature profile pattern of the form in the graph above. However, sometimes you see multiple patterns within a given population, which also works for purposes of ML, since all you need is one pattern that pops up more than once, as shown in this graph comparing Ashkenazi Jews to each other, where genomes 1, 3, 4, 5, 8, 9; genomes 2, 7; and genomes 6, 10, plainly form three distinct profiles.

The idea is you do this for every genome in a given population individually, and this will construct a dataset of vectors in the form of P, one for each genome in the population (i.e., a number of rows equal to the number of genomes, each row in the form of P, together forming a matrix with four columns). Note however, you’re comparing genomes to other genomes in the same population (e.g., German mtDNA compared to German mtDNA). You then do this for every population, separately, constructing unique matrices for each population (i.e., one matrix for German, Italian, etc.).

Now combine all of the matrices into a single matrix dataset, and treat the known classifiers as unknown, and try to predict the classifier of a given profile (in this case nationality, e.g., German).

Simply run Nearest Neighbor, mapping P_i to P_j, for which the Euclidean norm of the difference ||Pi - Pj|| is minimum, treating the classifier of P_j as the predicted classifier of P_i. You’ve now converted DNA into a real number dataset, with just 4 columns, as opposed to a full genome, which in this case consists of about 17,000 columns.

I did exactly this over a dataset of 172 full genomes, from 18 populations, and the accuracy was 82.56%, without any other refinement. The total runtime was just 0.390 seconds, running on a MacBook Air. There is no way you’ll achieve this kind of runtime using Neural Networks. Each of the populations has only 10 full genomes, suggesting higher accuracy could be possible by simply increasing the size of the dataset. Other techniques can likely also improve accuracy.

Because mtDNA is inherited directly from the maternal line, if it weren’t for mutations, a perfect copy would be passed on from mother to daughter, etc, with the male line’s mtDNA simply vanishing. Nonetheless, we know that there is significant variation in mtDNA, which implies significant mutation. However, this work shows unambiguously that there is local variation, in that people that occupy the same present geographies, have similar mtDNA. This simply doesn’t make sense, unless there is some environmental impact on the mutations on mtDNA, causing people in similar environments to have similar mutations.

I think instead, it’s far more reasonable to assume that the male line actually does impact mtDNA indirectly, through the genetic machinery, which must be at least partly inherited from the paternal line. That is, the mechanisms that read and replicate DNA, and produce proteins, are not to my knowledge inherited from either sex exclusively. This implies that common paternal lines could produce similar mutations, which would explain the local similarity of mtDNA, and still allow for mutation. This implies that the paternal line could be discoverable through mtDNA alone, through the analysis of similar mutations on the same maternal line. For the same reasons, it implies that people with highly similar mtDNA would have similar paternal and maternal lines.

Here’s the dataset:

Here’s the code:


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s