Unsupervised Classification and Knowledge

I’ve never been able to prove formally why my unsupervised classification algorithm works, and in fact, I’ve only been able to provide a loose intuition, rooted in how I discovered it: as you tighten the focus of a camera lens, the changes near the correct focus are non-linear, in that the object quickly comes into focus. And so I searched for the greatest change in the structure of a dataset as a function of discernment, which works incredibly well, especially for an unsupervised algorithm. In contrast, the supervised version of that algorithm has a simple proof, which you can find in my paper Analyzing Dataset Consistency. However, it just dawned on me, I think I explained why it works, though it’s not a formal proof, in my other paper, Information, Knowledge and Uncertainty. Specifically, the opening example I give is a set of boxes, one of which contains a pebble, where the task is to guess which box the pebble is in. If someone tells you that the pebble is not in the i-th box, then your uncertainty is reduced. But the reason it’s reduced is because the system is now equivalent to a system with one less box. In contrast, the rest of the examples I give in that paper, deal with static observations that have some fixed uncertainty.

Applying this to my unsupervised clustering algorithm, the point at which the entropy changes the most (i.e., Uncertainty), is also the point at which your Knowledge changes the most, due to the simple equation I = K + U. As a consequence, my unsupervised clustering algorithm finds the point at which your knowledge changes the most as a function of the structure of the dataset. All points past that, reduce the size, and therefore information content of the clusters, without materially adding to Knowledge. Specifically, if you unpack the equation a bit more, I = N\log(N), where N is the number of states of the system. In the case of the box example, N is the number of boxes when there’s one pebble. In the case of a distribution, it’s the total number of elements in the distribution. And as you’re increasing the threshold for inclusion in a cluster, the cluster size shrinks, thereby decreasing N. If it turns out that the size of the problem space generally decreases faster than the entropy (i.e., Uncertainty U), then your Knowledge actually decreases as the problem space decreases in size. As a consequence, the unsupervised algorithm finds the point where the entropy of the problem space changes the most as a function of the threshold for inclusion, which is the point where you get the most Knowledge per unit of change. I suppose upon reflection, the correct method is to find the point where the entropy of the problem space changes the most as a function of the size of the problem space. That said, my software plainly works, so there’s that.

In any case, this is not a proof, but it is a mathematical explanation. What I’m starting to come around to, is the idea that some phenomena, perhaps even some algorithms, function as a consequence of epistemological truths of reality itself. You can definitely accuse me of laziness, in that I can’t formally prove why the algorithm works, but that dismisses the possibility that some things might be true from first principles that defy any further logical justification, in that they form axioms consistent with reality itself. In that case, there is no proof beyond the empirical fact that Knowledge changes sub-optimally past the point identified by the algorithm. The reason I believe this is possible, is because the equation I = K + U, follows solely from the tautology that all things are either in a set, or not, and there is, as far as I know, no other proof that this is true. Moreover, the equation works, empirically, so it is in this view an equation that has no further logical justification, that operates like an equation of physics. The more general premise at least suggests the possibility of algorithms that defy further logical justification beyond empiricism.

The reason I thought of this, is because I was working on clustering populations on the basis of mtDNA, and I noticed the same thing happen that happened when I first started my work in A.I. –

There was a massive discontinuous change in cluster entropy, as a function of the inclusion threshold. When I looked at the results, it produced meaningful population clusters, where e.g., both Japanese and Finnish people are treated as homogenous, and basically everyone else is heterogenous. This was totally unsupervised, with no information at all, other than raw mtDNA, and it’s obviously correct. Moreover, Sweden and Norway produced basically the same heritage profile, and even terminate at exactly the same iterator value –

This is consistent with the fact that Norwegians and Swedes are genetically closer to each other than they are to the Finns. The Finns also speak a totally different, Uralic language, whereas Norwegian and Swedish are both Germanic, and so in this case, heritage follows language, which is not necessarily always true, for the simple reason of conquest. For example, the Swedes and Norwegians had their own alphabet, the Runic Scripts, and now they don’t, they use the Latin alphabet like everyone else, because of what is basically conquest.

Above are the plots for the Finnish, Japanese, Swedish, and Norwegian heritage profiles I mentioned, and below is the code and a link to the dataset. You might be wondering how it is that of the few populations that map to Finland, Nigeria is among them. Well, it turns out, 87% of the complete Finnish mtDNA genome maps to a 2,000 year old Ancient Egyptian genome. They also map basically just as closely to modern Egyptians. All of this data is from the National Institute of Health, and you can find all of it by entering the following search query into the NIH Database:

ETHNICITY AND ddbj_embl_genbank[filter] AND txid9606[orgn:noexp] AND complete-genome[title] AND mitochondrion[filter]

Just replace ETHNICITY with Norway, Egypt, etc. Isn’t life something when you actually do the work.

Here’s the code:


Here’s the dataset:


The dataset now includes 19 ethnicities (listed below), and it’s simply fascinating to dig into, and there’s a bunch of software in the previous posts you can use to probe it.

Kazakh, Nepalese, Iberian Roma, Japanese, Italian, Finnish, Hungarian, Norwegian, Sweden, Chinese, Ashkenazi Jewish, German, Indian, Switzerland, Nigerian, Egyptian, Turkish, English, Russian.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s