Sort-Based Classification

I’ve introduced three algorithms that will form the basis of Black Tree Massive, one is a more efficient but mathematically identical version of my Supervised Delta Algorithm, and the other two are sort-based classification algorithms that have runtimes that completely eclipse even my prior work, which already has unprecedented runtimes, and are themselves almost certainly the fastest algorithms on the market. The Supervised Delta Algorithm is rooted in my paper, Analyzing Dataset Consistency, in which I prove that polynomial time algorithms can produce literally perfect clustering and classification, under certain reasonable assumptions. This typically translates into very high and at times perfect accuracy for benchmark datasets. In another paper, Sorting, Information, and Recursion, I proved a surprising connection between sorting and the nearest neighbor algorithm, specifically, that if a list of vectors is sorted, then the adjacent entries in the sorted list are nearest neighbors of each other (see, Theorem 2.1). As a consequence, if you begin at a given entry in a sorted list, and proceed to adjacent entries in both directions, you will encounter the set of points that are included in some sphere in the Euclidean space of the dataset. This must be true, since the distance between two vectors in a sorted list, can never exceed the sum of the distances over the adjacent pairs that are between them, since this sum is by definition either equal to or greater than the straight-line distance between the two vectors. As a result, you can cluster data by simply collecting vectors in a sorted list, proceeding in both directions, from a given index in the list. Using this method, you don’t have to calculate the norm of the difference between a given vector, and all other vectors in order to find the vectors that are within a given sphere (i.e., within a given radius). This saves significant time during processing, leading to algorithms that can classify at a rate of approximately 4,500 Testing Rows over 25,500 Training rows, in roughly 3 seconds (running on a MacBook Air).

Because the runtimes are so fast, you can therefore run the algorithms hundreds of times, and still produce a practical, and very fast runtime, which in turn allows you to calculate a meaningful measure of confidence that I introduced in yet another paper, Information, Knowledge, and Uncertainty. This process requires running classification a few hundred times in order to generate a distribution of confidence intervals, that are then applied to a final prediction step. Ideally, accuracy should increase as a function of confidence, and this is empirically what happens. This allows you to take the solid raw accuracies produced by the initial sort-based classification algorithms (typically about 80%), and filter predictions, increasing required confidence until you achieve a maximum (or desired) accuracy (typically maximized between 90% and 100%). The net result is very high accuracy, making use of algorithms that can handle simply enormous datasets, running on a consumer device.

This is likely a fundamental breakthrough in computer science, and A.I.

Here are links to notes regarding the algorithms in question, which in each case, includes the relevant Octave code:

Massive Supervised Delta Classification

Massive Supervised Classification

Massive Unsupervised Modal Classification

Updated Unsupervised Sort-Based Classification

Attached is the code for the unsupervised analog of the algorithm I presented yesterday. The accuracy seems solid, though it needs more testing, and the runtimes are ballpark the same (i.e., about 10 seconds for 4,500 Testing Rows classified over 25,500 Training Rows).

Updated Supervised Classification (Sort-Based)

Attached is the full command line code for the algorithm I mentioned yesterday, which now includes confidence calculations, allowing for higher accuracy. The method is analogous to the ideas discussed in my paper, Information, Knowledge, and Uncertainty [1], in that a confidence metric is assigned to every prediction, and then accuracy is calculated as a function of confidence, though this entire series of algorithms also makes use of sorting, as a faster method to implement a pseudo-nearest neighbor algorithm (see, Theorem 2.1 [2]). What’s interesting is that a much simpler metric for confidence, which is simply the number of elements in the cluster associated with a prediction, works really well, despite not having the same rigorous theoretical basis as the information-based confidence metric I introduce in [1]. This could be because this algorithm is supervised, producing homogeneous clusters (i.e., every cluster consists of a single class), and so you could argue the only relevant factor is the size of the cluster. If this is true, the equation I presented in [1] is wrong, despite the fact that it works, in that there would be another equation that ignores the dataset as a whole, and looks only to the cluster in question. I can’t say whether or not I tested this possibility in the past, and it’s not in my interests to test it now, because I have software that works, and so the academics will have to wait until Black Tree Massive is complete.

As I noted, confidence is calculated for every prediction, in this case twice, once based upon the information metric I introduce in [1], and then again using only the size of the cluster. As you increase confidence, in both cases, you eliminate predictions, leaving some surviving percentage, which is also listed below. The Raw Accuracy is the accuracy prior to filtering based upon confidence, but includes “rejections”, which is a concept from my original algorithms that is carried over here. The runtime of this particular algorithm is simply astonishing, classifying 4,500 testing rows over 25,500 training rows, in about 3 seconds, running on a MacBook Air, which totally demolishes even my prior work, and basically makes a joke of everyone else’s.

DatasetRaw AccuracyMax Inf.-Based AccuracySurviving Percentage (Inf.)Max Size-Based AccuracySurviving Percentage (Size)
UCI Credit74.85%81.01%0.47%83.33%0.60%
UCI Ionosphere81.32%85.12%5.50%100.0%0.18%
UCI Iris97.98%100.0%1.36%100.0%0.44%
UCI Parkinsons80.70%85.16%24.0%83.66%7.30%
UCI Spam79.00%80.00%34.0%100.0%0.13%
UCI Wine80.64%85.71%0.50%96.00%1.30%

A Note on Normalization and Averaging

I’m working on Black Tree Massive, and was trying to improve my Supervised Classification Algorithm, and it turns out the accuracies are much better than I initially reported, but that’s not why I’m writing. I am writing because I noticed something remarkable in the process, which is that my normalization algorithm causes in some cases the average value over all columns / dimensions of a dataset to be exactly the same. This at first caused me think that there was something wrong with my command line code, or something wrong with my normalization algorithm, but neither is the case. It instead makes perfect sense upon reflection, because my normalization algorithm forces all dimensions to have exactly the same number of digits. As a consequence, if all dimensions are drawn from the same distribution, they will produce the same average value, after normalization, and this is exactly what happens, in at least some cases.

This is rather remarkable, since independent dimensions of a dataset typically report independent aspects of a system, but it turns out that as a matter of practice, despite reporting different aspects, the underlying distribution appears to be, at least at times, the same. Specifically, for the UCI Parkinson’s Dataset the average value is, astonishingly, .1776, in all cases except one, for which it is -.1776. This at first had me convinced that I had made a mistake, since I have no idea what Independence Day has to do with Parkinson’s Disease, and it turns out, that’s probably not what’s driving the behavior, as it is instead likely due to each dimension of the dataset being drawn from the same underlying distribution, which is astonishing in its own right. Ironically, it turns out my proposed improvement, which led to this discovery, which was itself theoretically quite elegant (specifically, sort the dataset using the largest average valued dimension, then descending averages for tie-breakers), was terrible, and did not improve performance, at least not consistently. But discovering things almost invariably (in my experience) involves being wrong.

I’ve updated the actual classification algorithm, which ended up somewhere theoretically elegant, reasonably accurate, and incredibly fast, classifying 4,500 testing rows over 25,500 training rows in an average of three seconds. The tie-breakers for the sorting step are ordered according to the total difference between adjacent terms in each dimension, when that dimension is sorted. In other words, you sort each dimension individually, then you take the difference between |a_i - a_{i+1}|, for all i, take the sum over that absolute difference, and do exactly that for each dimension. The greater the total difference for a given dimension, the higher the priority in the sorting tie-breaker, and the reason is, you want to get as close as possible to sorting the vectors in the orders of their nearest neighbors (see Theorem 2.1). This is nonetheless not true sorting, and I’m doing exactly this for reasons of efficiency, since actually sorting is much slower than this approximation. All the code you need to run this classification algorithm is attached.

Here’s a screen shot of the command line code that produces the average values (not attached, it’s quite simple):

Here’s the code:

Updated Label-Based Classification

This is a faster version of an algorithm I introduced a while back, but it turns out it doesn’t help solve anything new, and it’s not as fast as my sort-based algorithms, so I’m not including it in Black Tree Massive. The idea is that human generated classifier labels might be to some extent arbitrary, even though they define real world classes. Specifically, that individual classes actually have sub-classes that share a top-level human generated classifier label. This algorithm finds the “natural” classes on an unsupervised basis, and uses those for prediction. The prediction step is in the previous article.