Using Sorting in Clustering

I use sorting all the time in my algorithms to simulate running nearest neighbor (see Section 2.4 of Vectorized Deep Learning), and what just dawned on me, is that I actually proved formally, that a list is sorted if and only if its adjacent entries have the minimum possible distance (see Theorem 2.1 of Sorting, Information, and Recursion). This implies, the resultant sorted list, provides you with the nearest neighbors of each element in the list. This in turn allows for a trivial adaptation of my core algorithms, where rather than take the norm of the distance between a given vector and all others, you simply take the norm of the difference between a given vector and the vectors in the order in which they’re sorted in the list. The advantage in that case, is that if you’re not running the algorithms truly in parallel (which is the case on consumer devices when you have too many rows), then you’re only performing one operation per comparison. Attached is an example using my supervised clustering algorithm, which increases the radius of a sphere, until it hits its first error, which in this case means simply increasing the index of a sorted list, until you encounter a classifier that is unequal to the classifier in question (i.e., the origin of the sphere). This produces really fast runtimes, running in about 10 seconds given 100,000 rows with 15 columns – This is pretty serious stuff, and will be included in the Massive Version of Black Tree AutoML, for just $999. A mutually exclusive version (i.e., non-intersecting clusters) would typically produce even faster runtimes, since the size of the effective dataset can reduce each iteration.

For a testing dataset, you could simply combine the training and testing datasets, store the entries of the testing rows, and then go out some radius from each testing row by checking the classifiers of the rows to the left and right of each testing row. Applying the attached approach (i.e., first error), you would proceed until you encountered more than one class. You could instead proceed by no more than some fixed distance, or some fixed number of entries. You could report the modal class, or simply report the entire cluster of classes as a prediction. This will be extremely fast, since you’re operating only on the testing rows and the adjacent training rows, rather than the entire training dataset (save for the sorting step). I’ve attached code that implements this method, which seems to work really well, though more testing is required. I’ve included a basic confidence metric that also seems to work, in that accuracy increases as a function of confidence. This code is applied to the MNIST Fashion Dataset, which makes use of image preprocessing algorithms you can find in my A.I. Library on ResearchGate, but you can also simply ignore the preprocessing, as everything past the heading, “Runs Prediction”, is generalized and requires only a dataset.

Here is a plot of accuracy as a function of confidence over the MNIST Fashion Dataset:

Advertisement

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s