I introduced an abstraction algorithm a while back that can take a dataset, and generate a representative abstraction of the dataset using a nearest neighbor technique. We can take this process a step further, to generate a set of representations for a given dataset. Specifically, if any of the elements are not within delta of the abstraction, then we gather those elements up into a new dataset. We then generate a new abstraction for that leftover set. We repeat this process until every element of the original dataset is within delta of an abstraction.

The number of abstractions needed to do this can serve as a measure of the complexity of the dataset, since the more abstractions you need, the more unique elements there are.

It’s similar to the idea of a basis in linear algebra –

A minimum set of representative elements that allow me to express the fundamental elements of a dataset.

We could also calculate the value of delta by iterating through different values, repeating this process, and selecting the value that causes the entropy of the categorization structure to change the most, treating each element associated with a given abstraction as a category.

Below is some updated code, implementing the algorithm described above:

You can place this code in a loop through values of delta to use this approach to generate an independent value of delta, though this will be more time consuming than the approach I’ve used, which is to use my categorization algorithm to generate delta, and then apply the abstraction algorithm.

I’m still tweaking this, as it’s still a bit inefficient – the better way to do it is to store each abstraction, count the number of matches generated by that abstraction.

Then, sort the abstractions in order of the number of matches.

Then, start with the abstraction that generated the greatest number of matches, and take all of the elements in that category as fixed in that category.

Then, check the category with the second greatest number of matches, and take the intersection of the top category. Any elements not in that intersection go in the second category.

Continue this until all elements are in some category.

The end result is the same, but this is more efficient, since you don’t repeatedly call the categorization algorithm, and instead, you run an intersection test until all elements have been assigned to a category. The intersection operation can be implemented efficiently in this case by using the integer indices for the elements in the dataset, rather than the actual vectors. That is, rather than test for the intersection over a set of vectors, test for the intersection over a set of integer indices that correspond to where the vectors are in the dataset.