This is a sort-based version of the algorithm I discuss in Information, Knowledge, and Uncertainty, that uses the modal class of a cluster to predict the class of its geometric origin. I’m still testing it, but the accuracy seems excellent so far. It’s the exact same technique as the other sort-based algorithms, that uses sorting as a substitute for Nearest Neighbor. I proved that sorting has a deep connection to the Nearest Neighbor method in Sorting, Information, and Recursion, which forms the theoretical basis for these algorithms. The accuracies and runtimes shown below are taken on average, over 100 iterations. The testing percentage is set to 15% for all datasets (i.e., 100 rows produces 85 training rows, and 15 testing rows). Accuracy generally increases as a function of confidence, and there are two measures, one information-based, using the equations I presented in the paper Information, Knowledge, and Uncertainty, and the other size-based, which simply treats the cluster size itself as a measure of confidence (i.e., the measure of confidence is literally given by the cluster size, causing larger clusters to be treated as more reliable than smaller clusters).
|Dataset||Raw Accuracy||Max Accuracy (Information-Based Conf.)||Max Accuracy (Size-Based Conf.)||No. Rows||Runtime (Seconds)|
Here’s the code, any missing functions can be found in my library on ResearchGate.