Variation on Supervised Delta

This is somewhere in my code library, but here it is again, as I might use it for the Massive Version of Black Tree, which should be ready early to mid-June. All it does is test accuracy as a function delta, rather than run the original supervision step. The idea here is, you produce a single value that generates the highest accuracy, or the desired number of rejections.

Updated Sort-Based Classification

This is exactly the same technique I posted the other day, that uses sorting to achieve an approximation of the nearest neighbor method. Then, clusters are generated by beginning at a given index in the sorted list, and including rows to the left and right of the index, iterating through increasing boundary indexes, simulating expanding the radius of a sphere (i.e., as you include more rows to the left and right of a given index, you are effectively including all rows within a sphere of increasing radius from the origin / index). The only substantive change in this case, is the use of volume in calculating confidence (the other changes were corrections). Specifically, uncertainty increases as a function of volume, for the simple reason that, if you have a cluster contained within a large volume, and few rows in the cluster, then you have more uncertainty (c.f., a large number of rows contained in a small volume). In my original confidence-based classification algorithm, the spheres are all fixed sizes, so you could argue it didn’t matter in that case (especially since it works). I came up with yet another method, that makes use of a combinatorial measure of uncertainty that I mentioned in Footnote 14 of Information, Knowledge, and Uncertainty, which also should work for variable sized clusters, which I should be done with soon.

This particular algorithm is incredibly fast, classifying 4,500 testing rows, over 25,500 training rows, in about two seconds, with accuracy that often peaks at close to perfect (i.e., accuracy increases as a function of confidence). The chart below shows accuracies and runtimes for this method, on average over 100 iterations. All of the testing percentages are fixed at 15% (i.e., 30,000 rows implies 4,500 Testing Rows and 25,500 Training Rows). The Raw Accuracy is the accuracy of the method on its own, with no confidence filtering, which is almost always terrible. The Max Accuracies are given for both the information-based confidence metric, which diminishes as a function of volume, and the size-based confidence metric, which does not, and simply looks to the number of elements in each cluster (i.e., the more cluster elements, the higher the confidence). Confidence filtering is achieved by simply ignoring all predictions that carry a confidence of less than x, which generally causes accuracy to increase as a function of x. All of the max accuracies are excellent, except UCI Sonar, which is a dataset that’s given me problems in the past. I plan to address UCI Sonar separately with an analog to my Supervised Delta Algorithm, that will also make use of sorting.

Dataset

Raw Acc.

Max Acc. (Inf.)

Max Acc. (Size)

Runtime (Seconds)

No. Rows

UCI Credit

38.07%

78.30%

100.0%

3.290

30,000

UCI Abalone

17.37%

83.50%

71.30%

0.322

4,177

UCI Spam

40.83%

98.61%

100.0%

0.469

4,601

UCI Ion

53.00%

94.30%

100.0%

0.057

351

UCI Iris

32.50%

100.0%

86.20%

0.024

150

UCI Parkinsons

47.20%

95.60%

100.0%

0.035

195

UCI Wine

37.50%

94.20%

100.0%

0.029

178

UCI Sonar

20.16%

58.11%

34.80%

0.048

208

There’s more to come, with simple variations on this overall method, that just reapplies my core work using a sorted list. Once that’s complete, I’ll write something more formal about these methods.

Here’s the code, you can find the normalization algorithm a few posts below. The Command Line is set up as a function, for convenience (MASSTempFunction), so just change the dataset path in that file, and you’re good to go.

 

Chaos and Information

Chaos is a word that gets thrown around a lot, and I know nothing about Chaos Theory, which is at this point probably a serious branch of mathematics, but again, I know nothing about it. However, it dawned on me, that you can think about chaos in terms of information, using Cellular Automata as the intuition. Specifically, consider the initial conditions of a binary automaton, which together form a single vector of N binary digits. Then you have the rule, R, which takes those initial conditions, and mechanistically generates the rows that follow. Let X be a particular set of initial conditions (i.e., a particular binary vector), so that M = R(X,K) is the result of applying R to the initial conditions X, for some fixed number of K iterations. Now change exactly one bit of X, producing \bar{X}, which in turn produces \bar{M} = R(\bar{X},K). Now count the number of unequal bits between \bar{M} and M, which must be at least one, since we changed exactly one bit of X to produce \bar{X}. Let a be the number of unequal bits between \bar{M} and M, and let b be the number of bits changed in X, to produce \bar{X}. We can measure how chaotic R is in bits, as the ratio,

\Theta = \frac{a}{b}.

Said in words, \Theta is the ratio of the total number of bits that change divided by the number of bits changed in the initial conditions.

Using Sorting in Clustering

I use sorting all the time in my algorithms to simulate running nearest neighbor (see Section 2.4 of Vectorized Deep Learning), and what just dawned on me, is that I actually proved formally, that a list is sorted if and only if its adjacent entries have the minimum possible distance (see Theorem 2.1 of Sorting, Information, and Recursion). This implies, the resultant sorted list, provides you with the nearest neighbors of each element in the list. This in turn allows for a trivial adaptation of my core algorithms, where rather than take the norm of the distance between a given vector and all others, you simply take the norm of the difference between a given vector and the vectors in the order in which they’re sorted in the list. The advantage in that case, is that if you’re not running the algorithms truly in parallel (which is the case on consumer devices when you have too many rows), then you’re only performing one operation per comparison. Attached is an example using my supervised clustering algorithm, which increases the radius of a sphere, until it hits its first error, which in this case means simply increasing the index of a sorted list, until you encounter a classifier that is unequal to the classifier in question (i.e., the origin of the sphere). This produces really fast runtimes, running in about 10 seconds given 100,000 rows with 15 columns – This is pretty serious stuff, and will be included in the Massive Version of Black Tree AutoML, for just $999. A mutually exclusive version (i.e., non-intersecting clusters) would typically produce even faster runtimes, since the size of the effective dataset can reduce each iteration.

For a testing dataset, you could simply combine the training and testing datasets, store the entries of the testing rows, and then go out some radius from each testing row by checking the classifiers of the rows to the left and right of each testing row. Applying the attached approach (i.e., first error), you would proceed until you encountered more than one class. You could instead proceed by no more than some fixed distance, or some fixed number of entries. You could report the modal class, or simply report the entire cluster of classes as a prediction. This will be extremely fast, since you’re operating only on the testing rows and the adjacent training rows, rather than the entire training dataset (save for the sorting step). I’ve attached code that implements this method, which seems to work really well, though more testing is required. I’ve included a basic confidence metric that also seems to work, in that accuracy increases as a function of confidence. This code is applied to the MNIST Fashion Dataset, which makes use of image preprocessing algorithms you can find in my A.I. Library on ResearchGate, but you can also simply ignore the preprocessing, as everything past the heading, “Runs Prediction”, is generalized and requires only a dataset.

Here is a plot of accuracy as a function of confidence over the MNIST Fashion Dataset: