# Clustering Objects in Massive Datasets

In a previous article, I introduced an algorithm that can quickly find the perimeter of a statistical object that consists of several hundred thousand observations. The goal of this work is to build up to an object identification and tracking algorithm that can operate with an enormous number of observations, as an analog of my earlier object tracking algorithm, but extended to massive datasets, making use of my latest algorithms.

Eventually, I plan to make use of natural language object identification and tracking. For example, the algorithm I’m working on will scan the first scene in a sequence and say, “this is a sphere of radius r”, and then track the scenes going forward accordingly, using the kind of analysis a human being would. That is, rather than cluster points, once all of the objects have been described using human labels in the first scene, those objects will then be searched for in each following scene, using those same human labels, by, for example, looking for a sphere with radius r in each scene, and if there’s more than one such sphere, taking the one that is closest to the original as the next instance of that sphere in the next scene.

The actual identification process will be done using the extremal points of the shape, which can be identified quickly using the algorithm I introduced in my previous article, and then building up a dataset of objects, and the properties of their extremal points. This avoids having to rotate objects for purposes of comparison, and instead, tests for statistical properties that define the shape. For example, the extremal set of a sphere consists of a set of points that are all roughly equidistant from the origin – i.e., the surface of the sphere. You don’t need to rotate the extremal set to figure this out, and as a result, this process should increase efficiency.

I’m probably about a week away from fully developing this algorithm, but as a next step in that process, below is code that clusters objects in a scene, given an enormous number of points in Euclidean 3-space.

Specifically, in this case, the dataset consists of 78,215 Euclidean 3-vectors, that together produce 15 statistical spheres. The classification task is to then cluster the data, so that points within each sphere are clustered together, without any overlap between spheres. This took about 5.4500 minutes, running on an iMac, and you can visualize the resultant clustering below, where points in the same cluster have the same color. The data colored by cluster.

There is some sub-clustering within each sphere, which is a consequence of the algorithm, which generates a large number of small clusters, because the data is sorted beforehand for efficiency. However, the actual number of classification errors is exactly zero, in that if two points are clustered together, then they are part of the same sphere.

The next step is to produce follow up code that then clusters the anchors for each cluster, which should produce clusters based upon actual macroscopic objects, rather than the underlying point data, which as you can see, produces some additional local clustering.

Octave Scripts:

vectorized_EMC_clustering_N

generate_sphere

8_4CMNDLINE

Full Library:

ResearchGate