I introduced an algorithm that can predict ethnicity using mtDNA alone, with an accuracy of about 80%. See Section 5 of A New Model of Computational Genomics [1]. This is already really impressive, because mtDNA is inherited strictly from the mother, with basically no changes from one generation to the next. So how could it be that you can predict ethnicity, using mtDNA alone, since ethnicity is the product of both maternal and paternal lineage? I have some ideas, and you can see [1] generally for a discussion. However, I realized that I was accidentally providing information about the testing dataset into the training dataset, which is obviously not good practice in Machine Learning. I corrected this oversight, and the accuracy consistently increased, from about 80%, to about 85%. This might seem small, but it’s significant, and it’s consistent, suggesting that the process of removing information about the testing dataset from the training dataset actually improved accuracy. I’m not wasting my time figuring out the specifics of what happened, but it is worth noting, that in the abstract, this is an instance of an algorithm that is better at predicting exogenous data, than endogenous data. This is really strange, and counterintuitive, because it simply must be the case that including a given row actually harms predictions about that row, and improves predictions about other rows. Said otherwise, rows contribute positively to predictions about other rows, and negatively to predictions about themselves. To make the math work, it must be the case that the net contribution of a given row is positive to the predictive power of the algorithm overall, but this does not preclude the possibility a given row subtracting from the accuracy of predictions about itself.
As a general matter, it suggests a framework where each observation contributes information to a model, and that information relates to either the observation itself, or some other observations. And this would be an instance of the case where each observation carries negative information about itself, and positive information about other observations. Upon reflection, this is what mtDNA must do: it must provide information about paternal genes, which are most certainly not contained in mtDNA. That is, mtDNA carries information about an exogenous set of observations, specifically, where the father’s line is from. Similarly, each genome carries information about the other genomes, and its inclusion somehow damages predictions about itself.
This is counterintuitive, until you accept the abstract idea that an observation conveys information, and then there’s a secondary question about whether that information relates to the observation itself, or some other observations. Statements can of course be self-referential, or not, since, e.g., I can tell you that, “John is wearing a rain coat, and I am simply carrying an umbrella.” That statement conveys information about John and myself, and so it provides exogenous and endogenous information. What’s counterintuitive in the case of this genetics algorithm, is that viewing the observation as a statement, it provides reliable exogenous information, and noise as endogenous information, thereby contributing positively to a whole, yet subtracting from the knowledge already stored about itself. This is consistent with the fact that the Nearest Neighbor algorithm performs terribly, compared to the method I introduce in Section 5 of [1]. See Section 3 of [1], where accuracy is around 39%. Specifically, this seems to be the result of the fact that each row contributes negative information about itself, and though the Nearest Neighbor algorithm relies somewhat upon context, since I can e.g., insert a new row which could change the nearest neighbors, the prediction is ultimately of course limited to a single row. In contrast, the method introduced in Section 5 of [1] first processes the dataset, producing a new dataset, based upon averages taken over the entire dataset. As a consequence, each row directly contributes to a new whole, whereas in the Nearest Neighbor method, there is no new product based upon the whole.
Applied physically, you arrive at something like Quantum Entanglement, coupled with Quantum Uncertainty. Specifically, making an observation would in this case reduce your knowledge about the system you’re observing, thereby increasing your uncertainty, and at the same time, providing information about some exogenous, and presumably entangled system. All of this suggests at least the possibility of some fundamental accounting of the sort presented in, Information, Knowledge, and Uncertainty [2]. Specifically, it seems at least possible that the less information a given observation conveys about itself, the more it conveys about some other system. If this is physically true, then the consequences would be astonishing, in that measurement of X, literally destroys information about X, yet reveals information about Y. This could cause e.g., systems to actually vanish from observation, and thereby reveal others. Applying this to Quantum Entanglement, measurement of one entangled particle causes the other entangled particle to change, because of exactly this accounting: that is, the second particle changes, to destroy the information had prior to the measurement of the first particle, thereby preserving a zero-sum accounting. If the particles happen to have a mathematical relationship, then you might still know the state of both, through the change made to one. The point being, however, this theory implies the possibility of a change to one entangled particle causing the other to change to an unknown state. So it’s not concerned with your subjective knowledge, and instead, the mechanic involved is one that relates to the destruction of information.
Obviously, this is abstract, and not supported by this limited example, but this type of expanded Ramsey Theory, that I’m starting to think of as the physical meaning of mathematics, has in my experience practical physical meaning. Specifically, [2] is a very serious paper that allowed me to write A.I. software that runs so much faster than anything I’ve heard of before, that I’m not sure anything else is useful.