# Nearest-Neighbor on Massive Datasets

In a previous article, I introduced an algorithm that can cluster a few hundred thousand N-dimensional vectors in about a minute or two, depending upon the dataset, by first compressing the data down to a single dimension. The impetus for that algorithm was thermodynamics, specifically, clustering data expanding about a point, e.g., a gas expanding in a volume. That algorithm doesn’t work for all datasets, but it is useful in thermodynamics, and probably object tracking as well, since it lets you easily identify the perimeter of a set of points.

Below is a full-blown clustering algorithm that can nonetheless handle enormous datasets efficiently. Specifically, attached is a simple classification example consisting of two classes of 10,000, 10-dimensional vectors each, for a total of 20,000 vectors.

The classification task takes about 14 seconds, running on an iMac, with 100% accuracy.

I’ve also tested it on two UCI datasets (the “Wine” and “Iris” datasets), and the accuracy was comparable, around 100%, with equally impressive efficiency.

In addition to clustering the data, a compressed representation of the dataset is generated by the classification algorithm, that in turn allows for the utilization of the nearest-neighbor method, which is an incredibly efficient method for prediction, that is in many real world cases, mathematically impossible to beat, in terms of accuracy.

Said otherwise, even though nearest-neighbor is extremely efficient, with a dataset of this size, it could easily start to get slow, since you are still comparing an input vector to the entire dataset. As a result, this method of clustering allows you to utilize nearest-neighbor on enormous datasets, simply because the classification process generates a compressed representation of the entire dataset.

In the specific case attached below, the dataset consists of 20,000 vectors, and the compressed dataset fed to the nearest-neighbor algorithm consists of just 4 vectors.

Predictions occurred at a rate of about 8,000 predictions per second, with absolutely no errors at all, over all 20,000 vectors.

Octave Code:

8-1CMNDLINE

vectorized_EMC_clustering_N

Note that you’ll need to download my full archive to run this code, which you can find on ResearchGate.