Attached is the full command line code for the algorithm I mentioned yesterday, which now includes confidence calculations, allowing for higher accuracy. The method is analogous to the ideas discussed in my paper, Information, Knowledge, and Uncertainty [1], in that a confidence metric is assigned to every prediction, and then accuracy is calculated as a function of confidence, though this entire series of algorithms also makes use of sorting, as a faster method to implement a pseudo-nearest neighbor algorithm (see, Theorem 2.1 [2]). What’s interesting is that a much simpler metric for confidence, which is simply the number of elements in the cluster associated with a prediction, works really well, despite not having the same rigorous theoretical basis as the information-based confidence metric I introduce in [1]. This could be because this algorithm is supervised, producing homogeneous clusters (i.e., every cluster consists of a single class), and so you could argue the only relevant factor is the size of the cluster. If this is true, the equation I presented in [1] is wrong, despite the fact that it works, in that there would be another equation that ignores the dataset as a whole, and looks only to the cluster in question. I can’t say whether or not I tested this possibility in the past, and it’s not in my interests to test it now, because I have software that works, and so the academics will have to wait until Black Tree Massive is complete.
As I noted, confidence is calculated for every prediction, in this case twice, once based upon the information metric I introduce in [1], and then again using only the size of the cluster. As you increase confidence, in both cases, you eliminate predictions, leaving some surviving percentage, which is also listed below. The Raw Accuracy is the accuracy prior to filtering based upon confidence, but includes “rejections”, which is a concept from my original algorithms that is carried over here. The runtime of this particular algorithm is simply astonishing, classifying 4,500 testing rows over 25,500 training rows, in about 3 seconds, running on a MacBook Air, which totally demolishes even my prior work, and basically makes a joke of everyone else’s.
Dataset | Raw Accuracy | Max Inf.-Based Accuracy | Surviving Percentage (Inf.) | Max Size-Based Accuracy | Surviving Percentage (Size) |
UCI Credit | 74.85% | 81.01% | 0.47% | 83.33% | 0.60% |
UCI Ionosphere | 81.32% | 85.12% | 5.50% | 100.0% | 0.18% |
UCI Iris | 97.98% | 100.0% | 1.36% | 100.0% | 0.44% |
UCI Parkinsons | 80.70% | 85.16% | 24.0% | 83.66% | 7.30% |
UCI Spam | 79.00% | 80.00% | 34.0% | 100.0% | 0.13% |
UCI Wine | 80.64% | 85.71% | 0.50% | 96.00% | 1.30% |