In a previous post entitled, “Using Information Theory to Create Categories with No Prior Information”, I presented an algorithm that can quickly construct categories given a linear dataset with no other prior information. In this post, I’ll present a generalization of that algorithm that can be applied to a dataset of n-dimensional vectors, again with no prior information. Like all of the other work I’ve presented as part of this project, this algorithm is again rooted in information theory and computer theory. For an explanation as to why this algorithm works, you should see the post below, as this is really just a generalization of the previous algorithm applied to higher dimensional data, with no meaningfully new theory.

In this post, I’m going to focus on the results of this algorithm, which are quite good. I’ll also discuss the runtime complexity of the algorithm, which I believe to be worst-case , where D is the dimension of the dataset, making this algorithm exceptionally fast for what it is, which is a data categorization algorithm.

As a general matter, the only thing that this algorithm requires is a dataset that has a measure function that maps to a real number. In simpler terms, all we need in order to make use of this algorithm is a function that can compare the “distance” between any two data points in the dataset. This means that we can apply this algorithm to any n-dimensional dataset simply by making use of the norm of the difference between any two vectors in the dataset, and that we can also apply this algorithm to non-Euclidean spaces, so long as we have a well-defined measure function on the space.

In a follow up post, I’ll show how we can actually use this algorithm to produce a well-defined measure in classification problems by iterating through different measures until we produce the “right” measure that results in the maximum change in the structure of our classification.

**Runtime Complexity**

The first thing that this algorithm does is test the symmetry of the dataset using a function called “partition_array”, which is attached together with all of the other scripts you need to run this algorithm. Partition_array repeatedly subdivides the space in which the dataset exists into smaller and smaller “chunks”, until each chunk contains a maximally different amount of information. In this case, just imagine breaking a cube of 3-space into smaller and smaller equally sized subcubes. As we do this, the number of data points within each subcube will vary as a function of the number of subdivisions, with the number of data points per subcube generally decreasing as a function of the number of subdivisions, simply because each subcube will get smaller as we increase the number of subdivisions.

For reasons explained below, this algorithm stops once it finds the partition size that maximizes the standard deviation of the information contained within each subcube. This will produce a number N, which is how many times we’ve subdivided each dimension in order to achieve this maximization. It turns out that N is a measure of how symmetrically the data is distributed within its own space. If N is large, then the data is symmetrically distributed. If N is small, then the data is asymmetrically distributed.

The partition_array algorithm begins by making an initial guess as to the value of N, which it sets to the log of the number of the items in the dataset. Then, depending upon the characteristics of the dataset, it will either increase its guess, or decrease its guess, by either multiplying its first guess by 2, or dividing its first guess by 2, respectively. It will continue doing this until it finds the value of N that maximizes the standard deviation of the information contained in each subcube, which will cause the value of N to increase, or decrease, exponentially. However, the algorithm prevents N from increasing past the number of items in the dataset, and also prevents it from decreasing below 1. As a result, if n is the number of items in the dataset, then the partition_array algorithm runtime complexity cannot exceed log(n).

For each value of N generated by partition_array, another function called “test_entropy” is called. This is the function that actually tests the information content of each subcube. As part of this function, we will have to test each item in the dataset to determine which subcube it belongs to. Because we divide each dimension N times, which generates subcubes, and N is necessarily less than or equal to n, it follows that the number of subcubes cannot exceed . As a result, we’ll have to do at most comparisons each time we call test_entropy.

As a general matter, this implies that the runtime complexity of this phase of the algorithm is necessarily less than,

,

where D is the dimension of the space. In this case, D = 3, and therefore, the worst-case runtime complexity is .

Though this is only the preprocessing phase of the algorithm, it is actually the part of the algorithm with the highest worst-case runtime complexity. As a result, this algorithm has an overall worst-case runtime complexity of , making it extremely fast for what it does.

In a follow up post, I’ll present a vectorized categorization algorithm that omits this step, allowing for extremely fast categorization of high-dimensional data.

Anecdotally, I’ve noticed that the algorithm makes short work of data that actually has structure, and struggles with data that is truly randomized. This is not surprising, since truly randomized data should be highly symmetrical, forcing the preprocessing stage into the worst-case runtime (i.e., producing a very large value of N).

**Application to Data**

I’ve attached a set of scripts that make it easy to test the algorithm, which should be called from the Octave command line as follows:

**pkg load image**

**pkg load communications**

**data_array = generate_n_random_categories_3D(base, adjustment, min_spread, max_spread, num_items, num_categories);**

**data_categories_array = optimize_categories_3D(data_array);**

**[X Y Z S C] = display_categorization_3D(data_categories_array,num_items);**

**figure, scatter3(X,Y,Z,S,C);**

The generate_n_random_categories_3D script does exactly what its name suggests, which is to generate categories of random vectors in 3-space. The script creates what we know to be categories of data by randomly generating a set of seed values, around which it will generate some random data points.

So, for example, if num_categories = 2, and num_items = 1000, then it will create 2 categories of data that each contain 500 data points. This is accomplished by generating 2 seed locations, about which each category is centered. These seed locations can be anywhere from [0 0 0] to [base base base]. As a result, the greater the value of base is, the greater the space in which the data exists. Therefore, a small number of categories and a large value of base increases the probability that our categories will be very far apart.

The min_spread and max_spread control how diffuse the data is around each category’s central point. The first category generated has the minimum diffusion, and the last category generated has the maximum diffusion, with diffusion increasing linearly with each category generated. A high value of adjustment in essence forces each category’s central point to be generated along the same line, and a zero value of adjustment allows the central points to be generated randomly throughout the entire space from [0 0 0] to [base base base].

The “optimize_categories_3D” algorithm takes the data, and sorts it into categories, using the process I described above, which are then formatted for display by the “display_categorization_3D” function.

Data points are represented by the little ‘o’ rings in the attached images. When two data points are part of the same category, the display algorithm paints them with the same color. However, because category colors are randomly generated, it’s possible for two different categories of data to be assigned approximately the same color, and, though extremely unlikely, it’s even possible they’re assigned the exact same color. As a result, local colors are what to look for if you want to determine what data points are part of the same category.

The size of the ring representing a data point is determined by the size of the category to which the data point belongs. If the data point is part of a big category, then the ring will be big. If the data point is part of a small category, then the ring will be small.

I’ve attached graphs from two series of datasets, one for which the value of adjustment is high, forcing categories to be generated along a roughly straight line, and another for which the value of adjustment is zero, allowing categories of data to be spontaneously generated anywhere within the space bounded by [0 0 0] to [base base base].

The first series consists of 11 datasets, with all of the inputs to the random data generator remaining constant for each data set, except the value of max_spread, which increases from the first dataset to the last dataset. Each dataset in the first series consists of 50 categories, with 10 points in each category, for a total of 500 data points per data set. The first dataset in this series looks like a series of tightly packed rings, since for the first dataset, the min_spread equals the max_spread, meaning that each of the categories in this dataset should be roughly identically distributed, and just seeded at different locations in the space. As we move through this series of datasets, max_spread increases, meaning that the first category in each dataset will be more tightly packed than the last category in that same data set. This is why the last dataset in this series looks like a lot like a confetti cannon, since the first category in that data set is tightly packed, and the last category is extremely diffuse, representing the maximum possible spread for that entire data set.

Looking at the results, you can clearly see that the algorithm generates intuitively correct categorizations, grouping tightly packed clusters into a small number of relatively large categories, and diffuse clusters into a large number of relatively small categories. The last two charts show how many categories the algorithm actually generates for each series. Specifically, as max_spread increases, the number of categories generated for each of the 11 datasets in this series is as follows:

46 42 46 66 73 78 88 86 106 85 126

Note that 50 categories is in some sense the “correct” answer, since that is the number of clusters generated by the underlying random data generator. However, as the diffusion of each cluster increases, the clusters will become more diffuse, causing each cluster to lose its structure, and causing neighboring clusters to overlap. As a result, we should expect the number of categories identified by the algorithm to increase as a function of diffusion, which is exactly what happens.

The next series of datasets is generated using exactly the same variables, except the value of adjustment is set to 0, and the difference between the min and max spread is smaller to account for the smaller space the data occupies (this is a consequence of adjustment being set to 0). As a result, in this case, the seed values for the clusters propagate randomly throughout the entire space. This causes the datasets to start out in little tightly packed clusters that are randomly distributed, and eventually diffuse into generally unstructured data with a few small local clusters that the algorithm does a great job of categorizing together, when appropriate.

As max_spread increases, the number of categories generated for each of the 11 data sets is as follows:

46 180 265 321 261 297 331 329 365 375 246

Again, 50 is arguably the “correct” answer. However, in this case, the algorithm is clearly far more sensitive to diffusion in the data, which isn’t surprising, since in this case, we haven’t constrained the diffusion at all, but instead, have allowed the data to propagate randomly throughout the entire space.

To generate the “confetti cannon” data, use the following:

**base = 10;**

**adjustment = 50;**

**min_spread = 1;**

**num_items = 500;**

**num_categories = 50;**

**for i = 0 : 10**

**max_spread = 1 + 15*i;**

**data_array = generate_n_random_categories_3D(base, adjustment, min_spread, max_spread, num_items, num_categories);**

**data_categories_array = optimize_categories_3D(data_array);**

**[X Y Z S C] = display_categorization_3D(data_categories_array,num_items);**

**figure, scatter3(X,Y,Z,S,C)**

**endfor**

To generate the randomly clustered data, use the following:

**base = 25;**

**adjustment = 0;**

**min_spread = .1;**

**num_items = 500;**

**num_clusters = 50;**

**for i = 0 : 10**

**clear data_categories_array**

**max_spread = .1 + .45*i;**

**data_array = generate_n_random_categories_3D(base,adjustment, min_spread, max_spread, num_items, num_clusters);**

**tic; data_categories_array = optimize_categories_3D(data_array);toc**

**[X Y Z S C] = display_categorization_3D(data_categories_array, 500);**

**figure, scatter3(X,Y,Z,S,C)**

**endfor**

In a follow up post, I’ll show how we can categorize new data, and predict to which category new data fits best, using the opposite approach: that is, we find the dataset of best-fit for new data by including the new data into a series of datasets, and the dataset that changes the least in terms of the structure of its categories is the dataset to which the new data fits best. We measure the change in structure of the categories by using the entropy of the categorization, just as we did above. In short, new data fits best where it disturbs category structure the least, and we measure change in category structure using the entropy of the categorization.

The relevant Octave scripts are available here:

generate_n_random_categories_3D