Updated Sort-Based Classification

This is exactly the same technique I posted the other day, that uses sorting to achieve an approximation of the nearest neighbor method. Then, clusters are generated by beginning at a given index in the sorted list, and including rows to the left and right of the index, iterating through increasing boundary indexes, simulating expanding the radius of a sphere (i.e., as you include more rows to the left and right of a given index, you are effectively including all rows within a sphere of increasing radius from the origin / index). The only substantive change in this case, is the use of volume in calculating confidence (the other changes were corrections). Specifically, uncertainty increases as a function of volume, for the simple reason that, if you have a cluster contained within a large volume, and few rows in the cluster, then you have more uncertainty (c.f., a large number of rows contained in a small volume). In my original confidence-based classification algorithm, the spheres are all fixed sizes, so you could argue it didn’t matter in that case (especially since it works). I came up with yet another method, that makes use of a combinatorial measure of uncertainty that I mentioned in Footnote 14 of Information, Knowledge, and Uncertainty, which also should work for variable sized clusters, which I should be done with soon.

This particular algorithm is incredibly fast, classifying 4,500 testing rows, over 25,500 training rows, in about two seconds, with accuracy that often peaks at close to perfect (i.e., accuracy increases as a function of confidence). The chart below shows accuracies and runtimes for this method, on average over 100 iterations. All of the testing percentages are fixed at 15% (i.e., 30,000 rows implies 4,500 Testing Rows and 25,500 Training Rows). The Raw Accuracy is the accuracy of the method on its own, with no confidence filtering, which is almost always terrible. The Max Accuracies are given for both the information-based confidence metric, which diminishes as a function of volume, and the size-based confidence metric, which does not, and simply looks to the number of elements in each cluster (i.e., the more cluster elements, the higher the confidence). Confidence filtering is achieved by simply ignoring all predictions that carry a confidence of less than x, which generally causes accuracy to increase as a function of x. All of the max accuracies are excellent, except UCI Sonar, which is a dataset that’s given me problems in the past. I plan to address UCI Sonar separately with an analog to my Supervised Delta Algorithm, that will also make use of sorting.

Dataset

Raw Acc.

Max Acc. (Inf.)

Max Acc. (Size)

Runtime (Seconds)

No. Rows

UCI Credit

38.07%

78.30%

100.0%

3.290

30,000

UCI Abalone

17.37%

83.50%

71.30%

0.322

4,177

UCI Spam

40.83%

98.61%

100.0%

0.469

4,601

UCI Ion

53.00%

94.30%

100.0%

0.057

351

UCI Iris

32.50%

100.0%

86.20%

0.024

150

UCI Parkinsons

47.20%

95.60%

100.0%

0.035

195

UCI Wine

37.50%

94.20%

100.0%

0.029

178

UCI Sonar

20.16%

58.11%

34.80%

0.048

208

There’s more to come, with simple variations on this overall method, that just reapplies my core work using a sorted list. Once that’s complete, I’ll write something more formal about these methods.

Here’s the code, you can find the normalization algorithm a few posts below. The Command Line is set up as a function, for convenience (MASSTempFunction), so just change the dataset path in that file, and you’re good to go.

 

Advertisement

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s