Sort-Based Classification

I’ve introduced three algorithms that will form the basis of Black Tree Massive, one is a more efficient but mathematically identical version of my Supervised Delta Algorithm, and the other two are sort-based classification algorithms that have runtimes that completely eclipse even my prior work, which already has unprecedented runtimes, and are themselves almost certainly the fastest algorithms on the market. The Supervised Delta Algorithm is rooted in my paper, Analyzing Dataset Consistency, in which I prove that polynomial time algorithms can produce literally perfect clustering and classification, under certain reasonable assumptions. This typically translates into very high and at times perfect accuracy for benchmark datasets. In another paper, Sorting, Information, and Recursion, I proved a surprising connection between sorting and the nearest neighbor algorithm, specifically, that if a list of vectors is sorted, then the adjacent entries in the sorted list are nearest neighbors of each other (see, Theorem 2.1). As a consequence, if you begin at a given entry in a sorted list, and proceed to adjacent entries in both directions, you will encounter the set of points that are included in some sphere in the Euclidean space of the dataset. This must be true, since the distance between two vectors in a sorted list, can never exceed the sum of the distances over the adjacent pairs that are between them, since this sum is by definition either equal to or greater than the straight-line distance between the two vectors. As a result, you can cluster data by simply collecting vectors in a sorted list, proceeding in both directions, from a given index in the list. Using this method, you don’t have to calculate the norm of the difference between a given vector, and all other vectors in order to find the vectors that are within a given sphere (i.e., within a given radius). This saves significant time during processing, leading to algorithms that can classify at a rate of approximately 4,500 Testing Rows over 25,500 Training rows, in roughly 3 seconds (running on a MacBook Air).

Because the runtimes are so fast, you can therefore run the algorithms hundreds of times, and still produce a practical, and very fast runtime, which in turn allows you to calculate a meaningful measure of confidence that I introduced in yet another paper, Information, Knowledge, and Uncertainty. This process requires running classification a few hundred times in order to generate a distribution of confidence intervals, that are then applied to a final prediction step. Ideally, accuracy should increase as a function of confidence, and this is empirically what happens. This allows you to take the solid raw accuracies produced by the initial sort-based classification algorithms (typically about 80%), and filter predictions, increasing required confidence until you achieve a maximum (or desired) accuracy (typically maximized between 90% and 100%). The net result is very high accuracy, making use of algorithms that can handle simply enormous datasets, running on a consumer device.

This is likely a fundamental breakthrough in computer science, and A.I.

Here are links to notes regarding the algorithms in question, which in each case, includes the relevant Octave code:

Massive Supervised Delta Classification

Massive Supervised Classification

Massive Unsupervised Modal Classification

Updated Unsupervised Sort-Based Classification

Attached is the code for the unsupervised analog of the algorithm I presented yesterday. The accuracy seems solid, though it needs more testing, and the runtimes are ballpark the same (i.e., about 10 seconds for 4,500 Testing Rows classified over 25,500 Training Rows).

Updated Supervised Classification (Sort-Based)

Attached is the full command line code for the algorithm I mentioned yesterday, which now includes confidence calculations, allowing for higher accuracy. The method is analogous to the ideas discussed in my paper, Information, Knowledge, and Uncertainty [1], in that a confidence metric is assigned to every prediction, and then accuracy is calculated as a function of confidence, though this entire series of algorithms also makes use of sorting, as a faster method to implement a pseudo-nearest neighbor algorithm (see, Theorem 2.1 [2]). What’s interesting is that a much simpler metric for confidence, which is simply the number of elements in the cluster associated with a prediction, works really well, despite not having the same rigorous theoretical basis as the information-based confidence metric I introduce in [1]. This could be because this algorithm is supervised, producing homogeneous clusters (i.e., every cluster consists of a single class), and so you could argue the only relevant factor is the size of the cluster. If this is true, the equation I presented in [1] is wrong, despite the fact that it works, in that there would be another equation that ignores the dataset as a whole, and looks only to the cluster in question. I can’t say whether or not I tested this possibility in the past, and it’s not in my interests to test it now, because I have software that works, and so the academics will have to wait until Black Tree Massive is complete.

As I noted, confidence is calculated for every prediction, in this case twice, once based upon the information metric I introduce in [1], and then again using only the size of the cluster. As you increase confidence, in both cases, you eliminate predictions, leaving some surviving percentage, which is also listed below. The Raw Accuracy is the accuracy prior to filtering based upon confidence, but includes “rejections”, which is a concept from my original algorithms that is carried over here. The runtime of this particular algorithm is simply astonishing, classifying 4,500 testing rows over 25,500 training rows, in about 3 seconds, running on a MacBook Air, which totally demolishes even my prior work, and basically makes a joke of everyone else’s.

DatasetRaw AccuracyMax Inf.-Based AccuracySurviving Percentage (Inf.)Max Size-Based AccuracySurviving Percentage (Size)
UCI Credit74.85%81.01%0.47%83.33%0.60%
UCI Ionosphere81.32%85.12%5.50%100.0%0.18%
UCI Iris97.98%100.0%1.36%100.0%0.44%
UCI Parkinsons80.70%85.16%24.0%83.66%7.30%
UCI Spam79.00%80.00%34.0%100.0%0.13%
UCI Wine80.64%85.71%0.50%96.00%1.30%

A Note on Normalization and Averaging

I’m working on Black Tree Massive, and was trying to improve my Supervised Classification Algorithm, and it turns out the accuracies are much better than I initially reported, but that’s not why I’m writing. I am writing because I noticed something remarkable in the process, which is that my normalization algorithm causes in some cases the average value over all columns / dimensions of a dataset to be exactly the same. This at first caused me think that there was something wrong with my command line code, or something wrong with my normalization algorithm, but neither is the case. It instead makes perfect sense upon reflection, because my normalization algorithm forces all dimensions to have exactly the same number of digits. As a consequence, if all dimensions are drawn from the same distribution, they will produce the same average value, after normalization, and this is exactly what happens, in at least some cases.

This is rather remarkable, since independent dimensions of a dataset typically report independent aspects of a system, but it turns out that as a matter of practice, despite reporting different aspects, the underlying distribution appears to be, at least at times, the same. Specifically, for the UCI Parkinson’s Dataset the average value is, astonishingly, .1776, in all cases except one, for which it is -.1776. This at first had me convinced that I had made a mistake, since I have no idea what Independence Day has to do with Parkinson’s Disease, and it turns out, that’s probably not what’s driving the behavior, as it is instead likely due to each dimension of the dataset being drawn from the same underlying distribution, which is astonishing in its own right. Ironically, it turns out my proposed improvement, which led to this discovery, which was itself theoretically quite elegant (specifically, sort the dataset using the largest average valued dimension, then descending averages for tie-breakers), was terrible, and did not improve performance, at least not consistently. But discovering things almost invariably (in my experience) involves being wrong.

I’ve updated the actual classification algorithm, which ended up somewhere theoretically elegant, reasonably accurate, and incredibly fast, classifying 4,500 testing rows over 25,500 training rows in an average of three seconds. The tie-breakers for the sorting step are ordered according to the total difference between adjacent terms in each dimension, when that dimension is sorted. In other words, you sort each dimension individually, then you take the difference between |a_i - a_{i+1}|, for all i, take the sum over that absolute difference, and do exactly that for each dimension. The greater the total difference for a given dimension, the higher the priority in the sorting tie-breaker, and the reason is, you want to get as close as possible to sorting the vectors in the orders of their nearest neighbors (see Theorem 2.1). This is nonetheless not true sorting, and I’m doing exactly this for reasons of efficiency, since actually sorting is much slower than this approximation. All the code you need to run this classification algorithm is attached.

Here’s a screen shot of the command line code that produces the average values (not attached, it’s quite simple):

Here’s the code:

Updated Label-Based Classification

This is a faster version of an algorithm I introduced a while back, but it turns out it doesn’t help solve anything new, and it’s not as fast as my sort-based algorithms, so I’m not including it in Black Tree Massive. The idea is that human generated classifier labels might be to some extent arbitrary, even though they define real world classes. Specifically, that individual classes actually have sub-classes that share a top-level human generated classifier label. This algorithm finds the “natural” classes on an unsupervised basis, and uses those for prediction. The prediction step is in the previous article.

Complex Number Automata

I’ve seen some research online about using complex numbers in automata, and a while back in Stockholm, I noticed that you can form a closed set over S = \{1,-1,i,-i\}, by simply multiplying any element of the set by any other, and I haven’t seen this used in automata anywhere else. I’ve obviously been a bit busy working on my A.I. software, Black Tree AutoML, but today I took some time out to put together a simple script that generates automata using this set. There’s only one rule, which is to simply take the product over a subset of S, which means the behavior is determined entirely by the initial conditions, and the boundary conditions (i.e., the top and bottom and therefore immutable rows). This produces 16 possible outcomes using uniform values for the initial conditions and the boundary conditions (i.e., the initial conditions and boundary conditions are independent of each other, and are each elements of S). Here are the 16 possible outputs, with 250 rows, and 500 iterations (i.e., columns):

Each output, again, has two variables: the initial row, set to a single value taken from S, and the value of the top and bottom row, also taken from S, which produces 16 possible outputs. Unfortunately, the file names, which had the initial and boundary values in them, got screwed up during upload, and I’m not redoing it, but you can produce all of these using the code attached below, as there’s only 16 of them. You’ll note CA (1,1) and CA (-1,-1) are opposites (i.e., initial row = 1, boundary rows = 1; and initial row = -1, boundary rows = -1), in that one is totally white, and one is totally black. This follows from basic arithmetic, since if all initial cells are 1 (white), then any product among them will be 1. Similarly, if all initial cells are -1 (black), then any three cells will produce a product of -1 (the algorithm takes the product over exactly 3 cells). I also noticed what seem to be other algebraic relationships among them, where cells flip from green to blue, and black to white.

The shapes are simply beautiful, achieved using nothing other than multiplication, but beyond aesthetics, the notion more closely tracks an intuitive model of physics, in that there should be only one rule, that is then applied to some set of conditions, which is exactly what you get using this model. In contrast, typical cellular automata have many rules (e.g., for a three bit rule, there are eight possible combinations, producing 2^8 = 256 possible rules). As a consequence, there’s a lot of interplay between the initial conditions, and the rule, when using typical automata. In contrast, in this model, there’s exactly one rule, and so the behavior is determined entirely by the initial conditions, which is exactly what you would expect in the real world, where I think it’s fair to say, we assume the existence of one, comprehensive rule of physics (see, e.g., Section 1.4 of my paper, A Computational Model of Time-Dilation). I’m not suggesting this set of automata define the fundamental rule of physics. The point is instead, if you want to model physics, using automata, and you’re being intellectually honest, you should probably use something like this, unless of course, you’re just trying to be practical, which is a different goal.

A Note on Language / Complier Theory

It dawned on me tonight, while working on some code, that there are statements that will compile in a language, but cannot be expressed without running the complier twice, once to generate code, and the second time, to run the generated code. I’ll simply provide a more academic version of the exact problem I had, which was specifying a statement that could be generated using the text processing functions of a language, but cannot be directly expressed in the language itself as a general matter. To make things concrete, consider a set S of natural numbers, and assume that S is an infinite, and non-computable subset of the natural numbers, which must exist, since the set of all subsets of the natural numbers is uncountable, whereas the set of all programs is of course countable. Now assume you want to generate exactly the following statement:

x = [S_1, S_2, \ldots, S_k];,

where S_i is the i-th number in the set S. Because S is a non-computable subset, there is no general equation that will allow you to generate the entries in S, and so as a consequence, you cannot generate the statement above, as a general matter, without first generating the first k entries of S. As a consequence, as a general matter, you must run the compiler twice, in order to generate a statement that is nonetheless accepted in the language of the complier. This might be a known result, but I’ve never heard of it before, and as a general matter, it suggests a useful method of programming, that uses the text processing functions of a language to generate statements that will compile, but would otherwise be impossible to generate. And though it’s not impossible to generate the code in my A.I. software, it turns out, this is exactly how the front-end GUI for Black Tree works, in that the user makes a series of selections, which causes the GUI to generate the relevant code.

Massive Unsupervised Modal Classification

This is a sort-based version of the algorithm I discuss in Information, Knowledge, and Uncertainty, that uses the modal class of a cluster to predict the class of its geometric origin. I’m still testing it, but the accuracy seems excellent so far. It’s the exact same technique as the other sort-based algorithms, that uses sorting as a substitute for Nearest Neighbor. I proved that sorting has a deep connection to the Nearest Neighbor method in Sorting, Information, and Recursion, which forms the theoretical basis for these algorithms. The accuracies and runtimes shown below are taken on average, over 100 iterations. The testing percentage is set to 15% for all datasets (i.e., 100 rows produces 85 training rows, and 15 testing rows). Accuracy generally increases as a function of confidence, and there are two measures, one information-based, using the equations I presented in the paper Information, Knowledge, and Uncertainty, and the other size-based, which simply treats the cluster size itself as a measure of confidence (i.e., the measure of confidence is literally given by the cluster size, causing larger clusters to be treated as more reliable than smaller clusters).

DatasetRaw AccuracyMax Accuracy (Information-Based Conf.)Max Accuracy (Size-Based Conf.)No. RowsRuntime (Seconds)
UCI Abalone53.88%100.0%80.72%41770.460
UCI Credit72.12%100.0%79.69%25000.081
UCI Ion98.21%100.0%100.0%3510.016
UCI Iris93.66%100.0%100.0%1500.009
UCI Parkinsons97.78%100.0%100.0%1950.006
UCI Spam80.18%100.0%100.0%46010.370
UCI Wine100%100.0%100.0%1780.004

Here’s the code, any missing functions can be found in my library on ResearchGate.