# Variance-Based Clustering

I’ve long noted the connections between standard deviation, entropy, and complexity, and in fact, my first image clustering algorithm was based upon the connections between these measures (See Section 2 of my paper, “A New Model of Artificial Intelligence“). However, the focus of my work shifted to entropy-based clustering, simply because it is so effective. I’ve since discovered an even simpler, third model of clustering, that is based in counting the number of points in a cluster, as you iterate through cluster sizes, and so it is incredibly efficient, since it requires only basic operations to work.

However, it requires a number that allows us to express distance in the dataset, in order to operate. Specifically, the algorithm operates by beginning at a given point in the dataset, and moving out in quantized distances, counting all the points that are within that distance of the original point. Any sufficiently small value will eventually work, but it will obviously affect the number of iterations necessary. That is, if your quantized distance is too small for the context of the dataset, then your number of iterations will be extremely large, causing the algorithm to be slow. If your quantized distance is too big, then you’re going to jump past all the data, and the algorithm simply won’t work.

As noted in this article on identifying macroscopic objects, I initially used my value “delta”, which you can read about in my original paper linked to above, that is reasonably fast to calculate, but nonetheless can start to take time as your dataset has hundreds of thousands, or millions of vectors. Since the goal of this round of articles is to build up a model of thermodynamics, I need to be able to quickly process millions of vectors, preferably tens of millions, to get a meaningful snapshot of the microstates of a thermodynamic system.

What I realized this morning, is that you can take the difference between adjacent entries in the dataset, after it’s been sorted, and this will give you a meaningful measure of how far apart items in the dataset really are. What’s more important, is that this is a basically instantaneous calculation, which in turn allows my new clustering algorithm to run with basically no preprocessing.

Five statistical spheres, after being clustered, colored by cluster.

The results are simply astonishing:

Using a dataset of 2,619,033 Euclidean 3-vectors, that together comprise 5 statistical spheres, the clustering algorithm took only 16.5 seconds to cluster the dataset into exactly 5 clusters, with absolutely no errors at all, running on an iMac.

Complexity of the Algorithm

Sort the dataset by row values, and let $X_{min}$ be the minimum element, $X_{max}$ be the maximum element, and let $N$ be the number of elements. Then take the norm of the difference between adjacent entries, $Norm(i) = ||X(i) - X(i+1)||$, and let $\mu$ be the average over that set of norms.

The complexity is worst-case, $O(N \frac{||X_{min} - X_{max}||}{\mu})$. However, if the dataset consists of $K$ clearly defined objects, then its complexity is $O(K \frac{||X_{min} - X_{max}||}{\mu})$, and is therefore, independent of the number of vectors in the dataset.

This assumes that all vectorized operations are truly parallel, which is probably not the case for extremely large datasets run on a home computer. However, while I don’t know the particulars of the implementation, it is clear, based upon actual performance, that languages such as MATLAB and Octave successfully implement vectorized operations in a parallel manner, even on a home computer.

Octave Algorithms:

8_9CMNDLINE

EMC_anchor_clustering

generate_sphere

Full A.I. Library:

ResearchGate