Selection and the Vanishing of Traits

I think I just figured out why human beings lost basically all of their hair (versus primates), and the answer is, we stopped selecting for it. That alone shouldn’t matter, but if you add in a hypothesis that more or less constant mutation happens, on some level, then traits that are not actively selected for, will eventually vanish. This is basically an entropy of genetics, that would require constant effort, or environmental pressure, to maintain the traits of a species. In the case of body hair for humans, we stopped selecting for it because we developed the ability to use animal pelts, and as a consequence, both the environment, and possibly the individuals in question, stopped selecting for body hair, and presumably started selecting for other things.

Given that people still have hair on their heads, and to some extent on their bodies, it must have some utility, even if it’s just aesthetic, though this doesn’t undermine the more general thesis, that traits simply vanish, if not selected for, which is superficially impossible to argue with, for the simple reason that mutation is real, and as a result, all traits will be subject to what is basically erosion. If that erosion is significant, the trait in question could dwindle and vanish.


A New Model of Computational Genomics

I’ve updated my formal paper on genetics, A New Model of Computational Genomics, which now includes more theory, and experimental data regarding imputation. The most important improvement is with regards to the discussions surrounding the predictive power of the software, which allows ethnicity to be predicted with about 80% accuracy. In contrast, simulating a haplogroup, by identifying all bases common to a population (which would therefore include all genes common to that population), and using that to predict, had no predictive power at all, producing an accuracy consistent with chance. The bare minimum interpretation is that haplogroups are not precise enough to predict ethnicity at the level of a nationality. It’s also possible that they simply lack predictive power, which would at least call contemporary genetics into question. This doesn’t mean that they’re insignificant, it could however imply that they lack the significance necessary to make accurate, and narrow predictions given a genome of unknown provenance.



Predicting Ethnicity and Haplogroups

I noted in my paper, A New Model of Computational Genomics [1], that imputation using sequential bases is categorically inferior to using random bases, in several experiments testing the extent of imputation. That is, if you select K sequential bases (e.g., a particular gene), and attempt to predict the remainder of the genome using only that sequence, you underperform when compared to selecting K random bases in the genome. Because genes are sequential within a genome, it suggests that analyzing genes, and therefore haplogroups, might not be the best way to predict ethnicity, and therefore ancestry. This seems to be the case empirically.

Specifically, the attached code generates a set of bases in common for every population in a dataset of human mtDNA genomes. For example, the algorithm finds all bases that are common to Chinese individuals, and stores that as what is in essence a reference genome for the Chinese population. If a gene is common to all Chinese individuals, then it must be included in this reference genome, since the reference genome contains all bases common to the Chinese, and therefore, all genes common to the Chinese, in addition to any other bases they share as a population. All of the genomes are complete genomes taken from the NIH Database, and include provenance files with links to the NIH Database.

The next step is to predict the ethnicity of an individual using those reference genomes. Specifically, the algorithm takes a given testing genome, and finds the reference genome to which it is most similar. This process has an accuracy of approximately 1.2\%. There are 56 ethnicities in the dataset, and therefore this process performs about as well as chance, which is \frac{1}{56} = 1.8\%. The total runtime is a few minutes.


Haplogroups are plainly not precise, which you can see in the map above, that shows haplogroups crossing national boundaries. Moreover, discovering haplogroups requires a lot of work. In contrast, the software in [1] is capable of predicting ethnicity at the national level, with no human analysis ex ante, with an accuracy of about 80%. For example, the algorithms in [1] can discern between Swedes and Norwegians, whereas the haplogroups shown above plainly cannot, and instead Swedes and Norwegians are grouped together, though both are distinguished from Finns. Moreover, the attached code casts serious doubt on using genes and haplogroups for analyzing ancestry, since they’re apparently incapable of predicting ethnicity, which should be easier. That is, ancestry posits something in addition to ethnicity, which is that one ethnicity is the ancestor of another, and therefore, ancestry should be more difficult to predict than ethnicity alone.

My opinion is that these results suggest circular reasoning in the construction of haplogroups, where national, geographic, and language groups are used to define populations, and then common genes are identified, rather than allowing the genomes themselves to define groups of people, without reference to anything exogenous to the genomes. Moreover, this software shows that common genes do not allow you to predict ethnicity. In contrast, the software in [1] learns from a dataset of stated ethnicities, and is then able to predict the ethnicity of other genomes, without any human analysis at all. And again, the software in [1] is plainly more precise than haplogroups, in any case. Therefore, taken as a whole, [1] appears to present a superior method of analyzing ethnicity and ancestry, which is to use whole-genomes, treat the stated national / linguistic ethnicities as bona fide, and allow software to identify any relevant features. Moreover, the software in [1] also allows for the construction of populations that are based solely upon the genomes themselves, thereby allowing for the mechanistic, and therefore objective, construction of genetic groups, independent of national, geographic, and language groups.

Here’s the code and the dataset, and any missing code is linked to in [1]:

Javanese mtDNA

Again mostly due to chance, I found a Javanese genome (modern) in the NIH Database, and it is notable because at even 30% of the genome, there is no match to Heidelbergensis. This is not true for many of the populations in the dataset, which at this point contains 58 global ethnicities. The logical conclusion, is that the Javanese people are an isolated, modern population, that are closely related to very early humans, and no one else, save for the Neanderthals and Denisovans. This is really interesting, because e.g., the Norwegians, who are plainly geographically isolated, are related to basically everyone at 30% of their genome, which you can see below. In contrast, this Javanese genome produces a very thin distribution at even 30%, which is only 5% above chance. All of the code can be found in my paper, A New Model of Computational Genomics, and the dataset is linked to below.

Ancient Khoisan mtDNA

I’m working on something completely different related to ancient mtDNA, and I happened to find an ancient Khoisan genome in the NIH database. I also noticed earlier today, again working on something different, that both the Nigerians and Kenyans seems to have a relationship to the Denisovans. I already knew that the Kenyans were related to Denisovans, whereas, I never noticed any connection between the Nigerians and Denisovans. This prompted me to ask whether they had at least something more than chance in common with Denisovans, and the answer is yes. Specifically, the Nigerians start to match with Denisovans at about 30% of their genome. This is 5% above chance, and as a consequence, it is not possible that it is the result of chance. See, A New Model of Computational Genomics [1], specifically, footnote 16, which goes through the math.

There are two possibilities: one is that the Nigerians had a fleeting relationship with Denisovans, which caused only subtle changes to their mtDNA (see Section 5 of [1]). The other possibility is that they have an ancient, and possibly archaic connection to Denisovans. There is an ongoing search for so-called “Southern Denisovans”, since Denisovan fossils are typically found in Asia, not Africa. If Denisovans are actually from Asia, then we should not find ancient Denisovans in Africa. As it turns out, this particular genome is closely related to both Denisovans and Neanderthals, and is much closer to Denisovans than Neanderthals. You’ll also note that this genome is related to the Nigerians, again suggesting, an ancient connection between the Denisovans and Nigerians. Though this is not an archaic genome, since it’s only about 3,000 years old, it is ancient, and therefore consistent with the hypothesis that all hominins, i.e., Denisovans, Homo Sapiens, Neanderthals, and Heidelbergensis, all come from Africa. Below is the normalized match count for the Ancient Khoisan genome, at 50% of the genome. All of the code you need to run this analysis is in [1], and the dataset can be found here.

Ukrainian mtDNA

I hypothesized that many Ukrainians would be related to the Vikings, because of my admittedly loose understanding of the history, and it seems that I was correct. Specifically, the Ukrainians appear to be a mix of both Russian (not surprising) and Scandinavian heritage. What is surprising, is that they are also closely related to the Pashtuns of Afghanistan and Pakistan, who were also subjected to genocide by the Russians. This might be a coincidence, but I doubt it at this point, and I suspect instead, that this group of people (which includes many Jews, both Sephardic and Ashkenazi) has been the target of genocide for at least a century at this point, and that many Communist states deliberately exterminated exactly this bloodline of people. The chart below shows the distribution of ethnicities that are a 99% match to the Ukrainians.


Note that it must be the case that mtDNA contains information about paternal lineage, since my software can predict ethnicity, using mtDNA alone, with an accuracy of about 80%. This would be impossible if mtDNA did not contain information about paternal linage, and I’ve shown experimentally that the mtDNA of two populations does converge to a single, new set of genomes, almost certainly due to paternal selection. Further, note that PT stands for Pashtun, IL stands for Icelandic, and UK stands for Ukrainian (EN stands for English). The complete list of acronyms can be found at the end of my paper, A New Model of Computational Genomics [1].

Here’s the updated dataset, and any code required to generate the chart above can be found in [1].

On the Origins of Humanity

There’s apparently some debate about whether humans come from Africa, or from Asia, and after not reviewing much of the literature (being honest), and instead conducting my own research in genetics, I’ve concluded that we all come from Africa, and that many of us migrated to Asia, possibly Central Asia, and then some of us migrated from Asia, back to Africa and Northern Europe, and the Pacific. See, A New Model of Computational Genomics, generally. Specifically, it looks like some Scandinavians, Thai, Japanese, Khoisan, and Nigerian people are all very closely related to each other, to the point of 90% plus matches on the maternal line. I’ve shown that mtDNA must carry information about the paternal line as well, since my software can predict ethnicity with about 80% accuracy. As a consequence, it follows, that some Scandinavians, Thai, Japanese, Khoisan, and Nigerian people are all very closely related to each other, as a general matter. This is not to the exclusion of other people, it’s just most obvious in these populations. Therefore, I am of the belief that humanity began in Africa, which is in my opinion based in archeology, and not genetics. Specifically, that archeological evidence of early humans is most prevalent in Africa. Below is a plot from Wikipedia that shows the global distribution of tools associated with archaic humans from about one-million years ago, to about one-hundred-thousand years ago.


In contrast, the migration-back hypothesis, is in my opinion, rooted in genetics. Specifically, that you find simply inexplicable connections between global populations, in particular, certain Northern Europeans, Africans, and Asians. These relationships make no sense in the context of known history, and instead, make perfect sense, in the context of genetics, and common sense. Why did the early Egyptians appear to be Asian? Why do the Khoisan to this day appear to be Asian? Why are Stave Churches plainly reminiscent of Thai temples? One simple solution, is that all of these people are part a single group of people, that migrated back to the West, from Asia. On the left is a Norwegian Stave Church, to its right is a Thai Temple, after that is Menkaure and Queen Khamerernebty II (c. 2,530 BCE), courtesy of MFA Boston, after that Nefertiti (c. 1,370 BCE), courtesy of Wikipedia, and on the bottom right is Cleopatra (c. 50 BC), courtesy of Wikipedia, who plainly looks nothing like the rest of them.

Convergence of mtDNA

I’ve noted before that mtDNA must provide information about the paternal line, since I’ve written software that can predict ethnicity with about 80% accuracy, without any filtering for confidence, using mtDNA alone. See, A New Model of Computational Genomics [1], generally. Because ethnicity is a combination of both paternal and maternal ethnicity, there’s just no argument to the contrary – the accuracy would otherwise be horrible. I’ve developed reasonable hypotheses to explain this, specifically, the selection of particular maternal lines is probably a decent explanation for the fact that mtDNA must carry information about paternal ethnicity. That is, males in a given geography prefer particular females, for whatever reason, and that produces a unique distribution of maternal lines, which in turn, identifies the paternal lines.

However, some of my results suggest more direct influence from the paternal line. Specifically, it seems at least plausible that males select females that have mtDNA bases in common with them, which would over many generations cause the two maternal lines to fuse into one. For example, a Norwegian individual, when selecting among mates in Sweden, will select the mate that has the maximum number of mtDNA bases in common. This behavior would, over time, cause both Norwegian and Swedish mtDNA to combine, since each generation would mate on the basis of the maximum number of bases in common. This is course a random example, but I saw some evidence of this in the Danes, who seemed to be a mix between Swedes and Norwegians.

I’ve developed an experiment and software to test this hypothesis. Specifically, some populations are mixes between modern and archaic humans, and I’ve tested whether the introduction of archaic mtDNA impacts the modern mtDNA of the population in question. The experiment I’ve come up with is to test which Mongolians are at least a 60% match to Denisovans. There are 19 complete Mongolian genomes in the dataset, 8 Denisovan genomes, and 1 Heidelbergensis genome. All genomes are complete mtDNA genomes taken from the NIH Database, complete with provenance files for each genome linking to the genome descriptions. This gives each of the Mongolians genomes 8 chances to match with a Denisovan, and if a single match occurs, it is included in a list of genomes that are treated as in essence, Denisovan. Of the 19 Mongolian genomes, 4 were a match to at least 1 Denisovan. This leaves 15 genomes that did not match. The question is then, do the remaining 15 genomes have more in common with the Denisovans than a population that has no clear relationship to the Denisovans?

This is superficially impossible, because mtDNA is inherited directly from the mother to the child, typically with no mutations at all. However, my hypothesis is that males select females on the basis of genetic similarity. Specifically, that males attempt to maximize the number of bases in common with their female mate. This will, after generations, cause the mtDNA of the paternal line to converge with the mtDNA of the maternal line. Specifically for this experiment, it should be the case that the non-Denisovan Mongolian genomes have more bases in common with Denisovans than some other population that has no clear relationship to Denisovans. As a reference population with no clear relationship to either Denisovan or Heidelbergensis, I selected the English, and there are 9 English genomes in the dataset. The results suggest that I’m correct, since the average match count between a non-Denisovan Mongolian genome and the Denisovans is 4,957.9 bases, whereas the average match count between the English and the Denisovans is 4,673.2 bases. Applying the same methods to Heidelbergensis, we have 5,003.6 matching bases for the non-Heidelbergensis Mongolians, and 4767.4 bases for the English. The same is true of the Ashkenazi Jews, Kenyans, and Finns, all of whom have a similarly close relationship to the Denisovans. All of this is plainly consistent with the hypothesis that selection can alter mtDNA, specifically, selection by the paternal line.

Attached is the code and the dataset. Any missing code can be found in [1].

On Norwegian Ancestry

I wrote a short script (attached below) that allows you to quickly compare the distribution of ethnicities associated with a given ethnicity. Out of curiosity, I applied it to Norwegians and Swedes, and as I noted in, A New Model of Computational Genomics [1], they’re different people, that are of course nonetheless closely related. However, Norwegians are much closer to the people of the Pacific, specifically, the Thai. This is obvious when you look at Stave Churches, which are almost identical in structure and aesthetic to Thai temples, and moreover, don’t look anything like a normal church. On the left is a Norwegian Stave Church, and on the right, is a Thai temple, both courtesy of Wikipedia.

In fact, it turns out the distribution of Stave Churches is concentrated almost exclusively in Norway, at least according to Google. There are others elsewhere in Europe, but it seems the Stave Churches in Scandinavia are generally limited to Norway. The map below is obviously courtesy of Google, and you can generate it yourself by simply typing in, “Stave Churches near Scandinavia”.


If you actually compare the distribution of associated ethnicities between Norwegians and Swedes, you get the chart below, which plainly shows that the Norwegians are much closer to the people of the Pacific, indigenous peoples generally, and some Africans. Specifically, TH stands for Thai (obviously in the Pacific), SI stands for the Solomon Islands (islands in the Pacific), SQ stands for the Saqqaq (indigenous people of Greenland), SM stands for Sami (indigenous peoples of Scandinavia and Russia), NG stands stands for Nigeria, and KH stands for Khoisan (a people in Southern Africa). The complete list of acronyms can be found at the end of [1]. The chart below shows a normalized rank for the Norwegians (i.e., a scale from 0 to 1), minus that same rank for the Swedes. This causes the values to range from 1 to -1. The rank for a given column is the normalized number of matches at 90% of a given genome, and all genomes are complete mtDNA genomes taken from the NIH database. That is, the algorithm first counts the number of Norwegians that are e.g., a 90% match to the Nigerians, and then normalizes that number from 0 to 1, with 0 being none of them, 1 being all of them. Then, that same data is produced for the Swedes, and the chart below shows the difference between the two. Informally, this is Norway minus Sweden, and so, e.g., column 1 shows that the Swedes are closer to the Kazakhs than the Norwegians (note that KZ stands for Kazakh), since it is a negative number.

All of the genomes in the dataset can be found in [1], and come from the NIH Database. Moreover, all of the genomes have been diligenced to ensure that the ethnicity classifier is in fact the ethnicity of the person in question. So e.g., if a genome is classified as Norwegian in my dataset, then the notes associated with the genome either explicitly state that the person is Norwegian, or plainly indicate that the person is Norwegian (as opposed to a Swede living in Norway). The dataset contains a link to the NIH Database for every genome, where you can review the notes yourself.

Here’s the code, whereas the dataset (and any missing code), is linked to in [1].

Denisovan as Common Ancestor

I’ve noted many times that some Finns and Ashkenazi Jews are very closely related to Denisovans, with about 70% of their mtDNA genomes matching to Denisovans. See A New Model of Computational Genomics [1], generally. I’ve also noted that the Iberian Roma and Papuans are even closer to Heidelbergensis, with about 95% of their mtDNA genomes matching Heidelbergensis. I occasionally chip away at this work in my free time, and so I tested Sephardic mtDNA tonight, honestly expecting to find something different from Ashkenazi Jews, given that they really are different people, with different histories. It turns out, instead, that the three Sephardic genomes I found on the NIH website were also related to Denisovans, though they are however closer to the Munda people of India than Ashkenazis. This is surprising but not impossible, it just means that Jews really are a genetically distinct group of people, despite being a diaspora.

However, this got me thinking, that perhaps many people descend from Denisovans, since I was totally unable to find an archaic species that had an analogous relationship to the bulk of the dataset. That is, the Roma and Papuans (and many others scattered about the world) are extremely close to Heidelbergensis, suggesting they descend from Heidelbergensis. Similarly, some Finns and Ashkenazi and Sephardic Jews (and some others, again scattered about the world) are closely related to Denisovans, suggesting they descend from Denisovans. Many people are also a 95% match to Neanderthals, and they seem to be roughly the same group of ethnicities that are close to Heidelbergensis. The obvious question is, where do the rest of us come from?

To test this, I lowered the minimum match count to 30% of the genome, and compared the full dataset of genomes to Denisovans, Heidelbergensis, and Neanderthals. The results are plotted below, where the x-axis shows the acronym of the population in question, and the y-axis shows a normalized count of matching genomes, where a given genome constitutes a match if it has at least 30% in common with the applicable archaic genomes. The table of acronyms can be found at the end of [1], and there are links to the dataset I used in [1] as well. All genomes are complete mtDNA genomes taken from the NIH website.

There is at least one Neanderthal genome that has no connection to Heidelbergensis at all. In contrast, the Denisovans are related to both the Neanderthals and Heidelbergensis. This implies that the Denisovans are the common ancestor of both Heidelbergensis and at least that particular Neanderthal. See Section 6.1 of [1]. As a consequence, it seems we all descend from Denisovans. Also, out of curiosity, I lowered the match threshold for the Ancient Romans as well, since I’ve been otherwise unable to find a match, and it seems they have basically the same distribution as the Basque, Igbo, Munda, and Northern Europeans, all of whom are closely related, despite being totally different people.

The Ancient Egyptians were plainly wiped out, given that their appearance changed drastically over a very short period of time, and this is actually reflected in their genetics as well, with Pre-Roman Egyptians a distinct genetic group from Egypt during the time of Rome. These things of course do happen in history, but the shift is drastic, from what are plainly Asian people (genetically and morphologically) to European people. That’s just not normal, and there’s no history to my knowledge that explains it. On the left is Menkaure and Queen Khamerernebty II (c. 2,530 BCE), courtesy of MFA Boston, in the center is Nefertiti (c. 1,370 BCE), courtesy of Wikipedia, and on the right is Cleopatra (c. 50 BC), courtesy of Wikipedia, who plainly looks nothing like the rest of them.


Moreover, there are literally no living people that are a 90% match to the Ancient Romans. This is simply impossible, without deliberate genocide, as plenty of people are a 99% match to a far older Ancient Egyptian genome, and plenty of other ancient peoples. Keep in mind, Rome was an enormous empire at its heights, far larger than Ancient Egypt. The logical conclusion, is that the Ancient Egyptians and Ancient Romans were related people, and both were subjected to comprehensive genocide.