A Note on Normalization and Averaging

I’m working on Black Tree Massive, and was trying to improve my Supervised Classification Algorithm, and it turns out the accuracies are much better than I initially reported, but that’s not why I’m writing. I am writing because I noticed something remarkable in the process, which is that my normalization algorithm causes in some cases the average value over all columns / dimensions of a dataset to be exactly the same. This at first caused me think that there was something wrong with my command line code, or something wrong with my normalization algorithm, but neither is the case. It instead makes perfect sense upon reflection, because my normalization algorithm forces all dimensions to have exactly the same number of digits. As a consequence, if all dimensions are drawn from the same distribution, they will produce the same average value, after normalization, and this is exactly what happens, in at least some cases.

This is rather remarkable, since independent dimensions of a dataset typically report independent aspects of a system, but it turns out that as a matter of practice, despite reporting different aspects, the underlying distribution appears to be, at least at times, the same. Specifically, for the UCI Parkinson’s Dataset the average value is, astonishingly, .1776, in all cases except one, for which it is -.1776. This at first had me convinced that I had made a mistake, since I have no idea what Independence Day has to do with Parkinson’s Disease, and it turns out, that’s probably not what’s driving the behavior, as it is instead likely due to each dimension of the dataset being drawn from the same underlying distribution, which is astonishing in its own right. Ironically, it turns out my proposed improvement, which led to this discovery, which was itself theoretically quite elegant (specifically, sort the dataset using the largest average valued dimension, then descending averages for tie-breakers), was terrible, and did not improve performance, at least not consistently. But discovering things almost invariably (in my experience) involves being wrong.

I’ve updated the actual classification algorithm, which ended up somewhere theoretically elegant, reasonably accurate, and incredibly fast, classifying 4,500 testing rows over 25,500 training rows in an average of three seconds. The tie-breakers for the sorting step are ordered according to the total difference between adjacent terms in each dimension, when that dimension is sorted. In other words, you sort each dimension individually, then you take the difference between |a_i - a_{i+1}|, for all i, take the sum over that absolute difference, and do exactly that for each dimension. The greater the total difference for a given dimension, the higher the priority in the sorting tie-breaker, and the reason is, you want to get as close as possible to sorting the vectors in the orders of their nearest neighbors (see Theorem 2.1). This is nonetheless not true sorting, and I’m doing exactly this for reasons of efficiency, since actually sorting is much slower than this approximation. All the code you need to run this classification algorithm is attached.

Here’s a screen shot of the command line code that produces the average values (not attached, it’s quite simple):

Here’s the code:

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s