A while ago, I wrote an algorithm that calculates the distribution of bases at each index of a genome, over a given population. For mtDNA, there are 16,579 bases in each genome (in the dataset attached below), and the algorithm calculates the density of Adenine (A), and Thymine (T), Guanine (G), and Cytosine (C), at each index, over an entire dataset. This associates each of the 16,579 bases, with 4 real numbers, each in
, that sum to 1, for the simple reason that the bases must be one of the four. However, note there are minor deviations due to missing bases. The distribution of bases over the entire dataset are unequal, and that’s interesting in and of itself. However, I just tested the distribution of bases within each population ethnicity, to see if there’s meaningful variation in the distribution of bases. It turns out there is meaningful variation, suggesting at least the possibility that different populations select for different distributions of bases.
Specifically, the method I employed first calculates the distribution of bases at each index in a given population. It then calculates the overall distribution of bases within that population, producing four numbers, that represent the densities of the four bases, within that population, that should generally sum to 1, save for missing bases. This is done for each population, and there are 76 populations in this dataset, producing a 76 x 4 matrix. I then calculated the standard deviation in each column, as a measure of the variation in the densities of each base, across each of the 76 populations. The standard deviations of the distributions of A,T,G, and C, across all populations, are 0.0464%, 0.0367%, 0.0295%, and 0.0914%, respectively. These are small percentages, but they’re not uniform, and moreover, they’re not totally negligible quantities.
One initial observation, the Guanine-Cytosine bond is the strongest, and G is plainly more variable than the other bases. This is consistent with selection around the density of Guanine and Cytosine. Why would the distribution of bases matter? Well, the cytoplasm contains at any given moment a fixed distribution of bases, from which all genomes will draw upon to effectuate replication. As a consequence, there is by definition competition among genomes, and perhaps other parts of the cell, for the free-floating bases in the cytoplasm. As a consequence, the distribution of bases within a genome could impact cellular function and health. If the supply of bases in the cytoplasm is adequate to effectuate replication without any shortfalls, then an excess of a particular base in any genome, would cause an excess of that base in the cytoplasm. This could in turn affect cellular function and health. If the supply of bases in the cytoplasm is not adequate to effectuate replication without any shortfalls, in light of an excess of a particular base in a given genome, then this could again impact cellular function and health. As a general matter, intuition suggests a parity between the supply of bases in the cytoplasm and demand for bases based upon the distribution of bases in the genomes generally. However, as noted, even in the absence of such a parity, the distribution of bases in the genomes and the cytoplasm could impact cellular function and health.
This however does not explain why there we would be greater selection for Guanine-Cytosine bonds (i.e., the standard deviation is significantly higher for C), and moreover, why the standard deviation for C is the lowest of them all. One simple explanation is that there’s a handedness to the genome, and that the side being read by the sequencer is at least partially determinative of the distribution. Note that the opening sequence of bases in the dataset below is the same, and as a consequence, the sequencer is always reading the same side of the genome. One simple theory to explain the handedness of mtDNA is that it’s circular, and as a consequence, the genome has an “inside” and an “outside”. Now of course the other side of the genome must have a corresponding distribution (i.e., the other side of the genome must have a highly variable density of Guanine). However, the point is, the variability of Cytosine is much higher than the other bases, and as a consequence, it’s at least consistent with greater selection for Guanine-Cytosine bonds. The Guanine-Cytosine bonds happen to be the strongest (i.e., they are stronger than Adenine-Thymine bonds), and so it’s perfectly reasonable that they e.g., impact the ease of replication, and the integrity of the genome generally, which could present both advantages and disadvantages.
Note that the distribution at a given index in the genome will not directly determine the total distribution of bases in the genome. That is, if you e.g., swap two indexes in the genome, the overall distribution is unchanged, and therefore, all of the arguments presented above regarding the distribution in the cytoplasm are still true. However, this does not rule out selection at each index, since as noted, Guanine-Cytosine bonds are stronger than Adenine-Thymine bonds, and as a consequence, there could be advantages and disadvantages with respect to the integrity of the genome, based upon the distribution of bases along the genome indexes. This is a fascinating and to my knowledge, unexplored corner of genetics, that relates to selection based upon the overall distribution of bases (for purposes of cellular function), and their location along the genome (for purposes of genome integrity). This could explain the empirical fact that statistical imputation for mtDNA is categorically superior to sequential imputation. See Section 7 of A New Model of Computational Genomes. That is, there is significant selection taking place based upon whole-genome functions (i.e., random bases), beyond protein production (i.e., sequential bases). This suggests at least the possibility that whole-genome functions are more aggressively selected for than protein production. At first this sounds counter-intuitive, but it makes perfect sense:
If the cell can’t function, and the genome’s not stable, then it doesn’t matter what proteins are coded for, because that’s like writing on a burning piece of paper.
Putting it all together, once a population-specific mutation occurs, the genome’s integrity as a whole is subject to the possibility of deterioration, for the simple reason that a complex set of bonds has by definition been disrupted. This should therefore, cause intense selection in the rest of the genome, that allows for the mutation to exist in a structurally sound context. This would cause whole-genome selection, and explain the empirical fact that statistical imputation is stronger than sequential imputation, at least with respect to mtDNA. Moreover, the arguments above regarding the cytoplasm support the same hypothesis. If e.g., a mutation drastically changes the distribution of bases in a genome, then replication will draw differently on the bases floating in the cytoplasm, potentially causing disruptions to cellular function and health. As a consequence, whole-genome selection could remedy any drastic changes to the draw on the distribution of bases in the cytoplasm. However, unlike the structural integrity of a given genome, which requires whole-genome selection, the overall distribution of bases in the cytoplasm could itself adjust to the mutation, or other genomes could adjust to the mutation.
Attached is the code and the dataset:
https://www.dropbox.com/s/lrgw7wn3im0zcri/Calc_Base_Density_CMNDLINE.m?dl=0
https://www.dropbox.com/s/zwt1bcqqmqkleca/mtDNA.zip?dl=0