I’ve introduced three algorithms that will form the basis of Black Tree Massive, one is a more efficient but mathematically identical version of my Supervised Delta Algorithm, and the other two are sort-based classification algorithms that have runtimes that completely eclipse even my prior work, which already has unprecedented runtimes, and are themselves almost certainly the fastest algorithms on the market. The Supervised Delta Algorithm is rooted in my paper, Analyzing Dataset Consistency, in which I prove that polynomial time algorithms can produce literally perfect clustering and classification, under certain reasonable assumptions. This typically translates into very high and at times perfect accuracy for benchmark datasets. In another paper, Sorting, Information, and Recursion, I proved a surprising connection between sorting and the nearest neighbor algorithm, specifically, that if a list of vectors is sorted, then the adjacent entries in the sorted list are nearest neighbors of each other (see, Theorem 2.1). As a consequence, if you begin at a given entry in a sorted list, and proceed to adjacent entries in both directions, you will encounter the set of points that are included in some sphere in the Euclidean space of the dataset. This must be true, since the distance between two vectors in a sorted list, can never exceed the sum of the distances over the adjacent pairs that are between them, since this sum is by definition either equal to or greater than the straight-line distance between the two vectors. As a result, you can cluster data by simply collecting vectors in a sorted list, proceeding in both directions, from a given index in the list. Using this method, you don’t have to calculate the norm of the difference between a given vector, and all other vectors in order to find the vectors that are within a given sphere (i.e., within a given radius). This saves significant time during processing, leading to algorithms that can classify at a rate of approximately 4,500 Testing Rows over 25,500 Training rows, in roughly 3 seconds (running on a MacBook Air).
Because the runtimes are so fast, you can therefore run the algorithms hundreds of times, and still produce a practical, and very fast runtime, which in turn allows you to calculate a meaningful measure of confidence that I introduced in yet another paper, Information, Knowledge, and Uncertainty. This process requires running classification a few hundred times in order to generate a distribution of confidence intervals, that are then applied to a final prediction step. Ideally, accuracy should increase as a function of confidence, and this is empirically what happens. This allows you to take the solid raw accuracies produced by the initial sort-based classification algorithms (typically about 80%), and filter predictions, increasing required confidence until you achieve a maximum (or desired) accuracy (typically maximized between 90% and 100%). The net result is very high accuracy, making use of algorithms that can handle simply enormous datasets, running on a consumer device.
This is likely a fundamental breakthrough in computer science, and A.I.
Here are links to notes regarding the algorithms in question, which in each case, includes the relevant Octave code:
Massive Supervised Delta Classification