# Updated Mass-Scale Classification Algorithms

I’ve updated my mass-scale classification algorithms to be easier to read, and more consistent with my prior algorithms. The runtime seems to be about the same as the other algorithms I introduced in my other recent articles, and the accuracy seems about the same as well –

The difference is that it’s much easier to read this code, and that the dataset doesn’t need to be sorted beforehand.

Two statistical spheres, colored by clusters.

Attached is a simple command line script that calls the new algorithms, using a dataset of 104,866 Euclidean 3-vectors, that together comprise two statistical spheres. The classification task is to correctly cluster the points within the spheres, which in this case took 4.0479 minutes, with an accuracy of 100%, though as you can see, there is significant local clustering, in that the spheres are plainly subdivided into smaller clusters. There are in this case 254 clusters. Note that points in the same cluster are colored the same in the image above.

8_8CMNDLINE

genereate_EMC_cluster

optimize_EMC_cluster

# Identifying Macroscopic Objects in Massive Datasets

Introduction

Following up on yesterday’s article, in which I introduced an efficient method for clustering points in massive datasets, below is an algorithm that can actually identify macroscopic objects in massive datasets, with perfect precision. That is, it can cluster points into objects such that the clusters correspond perfectly to the actual objects in the scene, with no classification errors, and exactly the correct number of objects.

The goal of this work is to build up to an object identification and tracking algorithm, that can operate with an enormous number of observations, as an analog of my earlier object tracking algorithm, but extended to massive datasets, making use of my latest work.

The first step is to cluster the data using the algorithm I introduced yesterday, which will produce not only an initial set of clusters, but also a value called “delta”, that is the in-context minimum difference necessary to justify distinguishing between to two vectors. There’s a lot of thinking behind this, and it is part of the core of my work in information theory, but the big picture idea is, if we have two vectors $x$ and $y$, that are part of some dataset $A$, my algorithms will generate a value associated with $A$, $\delta_A$, such that if $||x - y|| > \delta$, then $x$ and $y$ are distinct vectors, in the context of the dataset $A$.

If you’re interested, you can read a full explanation of the theory and mathematics of this process in my original article on A.I., “A New Model of Artificial Intelligence“.

Another property of the value delta, is that it quantizes distance in a dataset. That is, we can express the distance between two vectors as an integer multiple of delta, using the ceiling or floor operator. Now you could of course say, “so what”, since you can do that with any number, but the difference is, delta tells you how far apart things need to be in order to say they’re different in a particular context, which makes it a natural unit of measure for the dataset that generated the value of delta.

The algorithm below makes use of this insight, by doing the following:

First, we generate a value of delta by simply clustering the dataset using the method I introduced yesterday, which will produce additional local clustering. That is, the clusters are accurate, in that points clustered together are actually part of the same object, but the object is overly subdivided. Since the goal is to do object tracking, this is not ideal, and we’d like instead to have clean, macroscopic objects clustered together, that we can then track over some number of frames.

In order to accomplish this, as a second step, we begin with the first vector in the first cluster, which is the minimal element of that cluster. Then, we count the number of vectors that are within $1 \times \delta$ of this first, minimal element. After that, we count the number of vectors that are within $2 \times \delta$, $3 \times \delta$, and so on, up to a number of iterations that is calculated in the code, and likely to be modest in number.

If the dataset is comprised of actual objects, then as you get further away from the minimal element of the object, the number of new points should drop off suddenly at the boundary of the object –

Just imagine beginning in the center of a sphere, and proceeding outwards towards the perimeter, counting the number of points along the way. As you do this, the number of new points will increase at a reasonably steady rate, until you reach the perimeter, where the number of new points will drop off to basically zero.

This algorithm tracks the rate of change in new points as you increment the integer multiple of delta, and then finds the point at which the rate of increase drops off, and that’s the boundary of the object.

This is then repeated until the dataset is exhausted.

This could of course work using some other measure of distance, rather than delta, but the utility of this approach is that delta can be calculated quickly for any dataset, so unless you have a different number that you can quickly produce, that works for this purpose, you can’t use this algorithm, despite the fact that it will obviously work.

Application to Data

In this case, the dataset consists of 39,314 Euclidean 3-vectors, containing 5 statistical spheres.

A dataset of five statistical spheres after being clustered.

The initial clustering took 37.2266 seconds, running on an iMac, with no errors, in that if two points are clustered together, then they are part of the same sphere.

However, as noted above, this process produces additional local clustering that we’d like to eliminate, since the goal is to identify and track macroscopic objects. You can visualize this using the image above, in which points colored the same are part of the same cluster, plainly showing multiple colors within the same sphere. The number of clusters produced in this first process is 1,660, suggesting significant local clustering, since there are only 5 spheres.

The dataset of five spheres after being subject to macro-clustering.

The next step is to use the value of delta generated in the first step, to try and consolidate these clusters into macroscopic objects.

This next round of clustering took 0.0396061 seconds, running on an iMac, again with perfect accuracy. Moreover, the number of clusters produced is 5, exactly the number of spheres in the dataset, implying that the performance is perfect.

Octave Code:

generate_sphere

vectorized_EMC_clustering_N

EMC_anchor_clustering

Macro_Clustering_CMNDLINE

Full Library:

ResearchGate

# Clustering Objects in Massive Datasets

In a previous article, I introduced an algorithm that can quickly find the perimeter of a statistical object that consists of several hundred thousand observations. The goal of this work is to build up to an object identification and tracking algorithm that can operate with an enormous number of observations, as an analog of my earlier object tracking algorithm, but extended to massive datasets, making use of my latest algorithms.

Eventually, I plan to make use of natural language object identification and tracking. For example, the algorithm I’m working on will scan the first scene in a sequence and say, “this is a sphere of radius r”, and then track the scenes going forward accordingly, using the kind of analysis a human being would. That is, rather than cluster points, once all of the objects have been described using human labels in the first scene, those objects will then be searched for in each following scene, using those same human labels, by, for example, looking for a sphere with radius r in each scene, and if there’s more than one such sphere, taking the one that is closest to the original as the next instance of that sphere in the next scene.

The actual identification process will be done using the extremal points of the shape, which can be identified quickly using the algorithm I introduced in my previous article, and then building up a dataset of objects, and the properties of their extremal points. This avoids having to rotate objects for purposes of comparison, and instead, tests for statistical properties that define the shape. For example, the extremal set of a sphere consists of a set of points that are all roughly equidistant from the origin – i.e., the surface of the sphere. You don’t need to rotate the extremal set to figure this out, and as a result, this process should increase efficiency.

I’m probably about a week away from fully developing this algorithm, but as a next step in that process, below is code that clusters objects in a scene, given an enormous number of points in Euclidean 3-space.

Specifically, in this case, the dataset consists of 78,215 Euclidean 3-vectors, that together produce 15 statistical spheres. The classification task is to then cluster the data, so that points within each sphere are clustered together, without any overlap between spheres. This took about 5.4500 minutes, running on an iMac, and you can visualize the resultant clustering below, where points in the same cluster have the same color.

The data colored by cluster.

There is some sub-clustering within each sphere, which is a consequence of the algorithm, which generates a large number of small clusters, because the data is sorted beforehand for efficiency. However, the actual number of classification errors is exactly zero, in that if two points are clustered together, then they are part of the same sphere.

The next step is to produce follow up code that then clusters the anchors for each cluster, which should produce clusters based upon actual macroscopic objects, rather than the underlying point data, which as you can see, produces some additional local clustering.

Octave Scripts:

vectorized_EMC_clustering_N

generate_sphere

8_4CMNDLINE

Full Library:

ResearchGate

# Finding the Perimeter of an Object

In a previous article, I introduced a clustering algorithm capable of quickly handling hundreds of thousands of vectors, that first compresses the vectors to a single dimension using the norm operator.

Though originally intended to cluster velocity scalars, I’ve also used this algorithm to find the perimeter of a statistical sphere, that isn’t smooth or completely solid, which is what you’ll likely end up with using real-world sensor data. In this case, the sphere consists of 327,627 Euclidean 3-vectors. Clustering the data took 4.6819 minutes, running on an iMac, which is all that’s necessary to identify the perimeter of the sphere, since you can then simply select the last cluster in the list of clusters, and color those points in the display. In the image below, the maximum cluster is colored blue, the rest colored gray.

A statistical sphere, with the perimeter colored blue.

Octave Code:

8_2_CMNDLINE

Full Library:

ResearchGate

# Nearest-Neighbor on Massive Datasets

In a previous article, I introduced an algorithm that can cluster a few hundred thousand N-dimensional vectors in about a minute or two, depending upon the dataset, by first compressing the data down to a single dimension. The impetus for that algorithm was thermodynamics, specifically, clustering data expanding about a point, e.g., a gas expanding in a volume. That algorithm doesn’t work for all datasets, but it is useful in thermodynamics, and probably object tracking as well, since it lets you easily identify the perimeter of a set of points.

Below is a full-blown clustering algorithm that can nonetheless handle enormous datasets efficiently. Specifically, attached is a simple classification example consisting of two classes of 10,000, 10-dimensional vectors each, for a total of 20,000 vectors.

The classification task takes about 14 seconds, running on an iMac, with 100% accuracy.

I’ve also tested it on two UCI datasets (the “Wine” and “Iris” datasets), and the accuracy was comparable, around 100%, with equally impressive efficiency.

In addition to clustering the data, a compressed representation of the dataset is generated by the classification algorithm, that in turn allows for the utilization of the nearest-neighbor method, which is an incredibly efficient method for prediction, that is in many real world cases, mathematically impossible to beat, in terms of accuracy.

Said otherwise, even though nearest-neighbor is extremely efficient, with a dataset of this size, it could easily start to get slow, since you are still comparing an input vector to the entire dataset. As a result, this method of clustering allows you to utilize nearest-neighbor on enormous datasets, simply because the classification process generates a compressed representation of the entire dataset.

In the specific case attached below, the dataset consists of 20,000 vectors, and the compressed dataset fed to the nearest-neighbor algorithm consists of just 4 vectors.

Predictions occurred at a rate of about 8,000 predictions per second, with absolutely no errors at all, over all 20,000 vectors.

Octave Code:

8-1CMNDLINE

vectorized_EMC_clustering_N

Note that you’ll need to download my full archive to run this code, which you can find on ResearchGate.