Global Alignments for Heidelbergensis

I ran an algorithm on the full dataset that finds the best global alignment when comparing two genomes. I applied this to a complete Heidelbergensis mtDNA genome, comparing it to all other mtDNA genomes in the dataset below (405 complete genomes), and it turns out, you get exactly the same population using the default NIH alignment. See Section 1.3 of Vectorized Computational Genomics [1], for a discussion of the default NIH alignment. Note that the acronyms for the population names in the graphs below can also be found at the end of that paper. Specifically, on the left below is the distribution of genomes that are at least a 96% match with Heidelbergensis using the default NIH alignment, and on the right is the distribution of genomes that are at least a 96% match with Heidelbergensis using the global alignment that maximizes the number of matching bases. The latter is achieved by shifting the genome one index at a time, and counting matching bases. Because mtDNA is circular, the bases that go past the end of the genome are pushed back to the beginning in a loop. The obvious conclusion is that these populations really are anomalously closely related to Heidelbergensis.

However, this is not true for lower threshold values below 96%, as the global alignment algorithm quickly produces a much more dense distribution for all populations. For example, below are the same two distributions produced using a minimum 80% match to Heidelbergensis. As you can plainly see, the global alignment (right) is much more dense, with nearly 100% of all populations at least an 80% match to Heidelbergensis. The plain takeaway here is that using the default NIH alignment is much more meaningful, because it filters the results, forcing acknowledgment of insertions and deletions, which again, can cause drastic changes to morphology and behavior.

Also, the nearest neighbor of the Heidelbergensis genome is unchanged, whether you use the default NIH alignment, or search for the globally best alignment, suggesting again, it’s more trouble than it’s worth to search for a globally best alignment, unless you’re deliberately searching for insertions and deletions within a pair of genomes. Specifically, it takes about an hour to find the nearest neighbor of every genome using the globally best alignment, whereas it takes about 25 seconds using the single default NIH alignment. Finally, I’ll note that using the default NIH alignment allows you to reliably predict ethnicity using mtDNA alone (i.e., only the maternal line). See [1] generally. This is actually astonishing, and though I haven’t tested the question, given the distribution above on the right, I would wager you’re not going to get good results using the best global alignment, since it causes all genomes to be roughly the same, precisely because it ignores insertions and deletions.

Here’s the dataset and the code:


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s