How I Think About Machine Learning

I’m obviously working on the Pro Version of my AutoML software, Black Tree AutoML, and most of the work is simply restating what I’ve done in a manner that interacts with the GUI I’ve developed (pictured below). This in turn causes me to reconsider work I’ve already done, and so I thought it worthwhile to present again my basic views on Machine Learning, nearly six years after I started working on A.I. and Physics full-time, as they’ve plainly evolved quite a bit, and have been distilled into what is, as far as I know, humanity’s fastest Deep Learning software.

Basic Principles

In a series of Lemmas (see, Analyzing Dataset Consistency [1]), I proved that under certain reasonable assumptions, you can classify and cluster datasets with literally perfect accuracy (see Lemma 1.1). Of course, real world datasets don’t perfectly conform to the assumptions, but my work nonetheless shows, that worst-case polynomial runtime algorithms can produce astonishingly high accuracies:

DatasetAccuracyTotal Runtime (Seconds)
UCI Credit (2,500 rows)98.46%217.23
UCI Ionosphere (351 rows)100.0%0.7551
UCI Iris (150 rows)100.0%0.2379
UCI Parksinsons (195 rows)100.0%0.3798
UCI Sonar (208 rows)100.0%0.5131
UCI Wine (178 rows)100.0%0.3100
UCI Abalone (2,500 rows)100.0%10.597
MNIST Fashion (2,500 images)97.76%25.562
MRI Dataset (1,879 images)98.425%36.754
Table 1. The results above were generated using Black Tree’s “Supervised Delta Classification” algorithm, running on an iMac 3.2 GHz Core i5, and all datasets can be downloaded through blacktreeautoml.com, together with a Free Version of Black Tree capable of producing these results on any Mac with OS 10.10 or later.

This informs my work generally, which seeks to make maximum use of data compression, and parallel computing, taking worst-case polynomial runtime algorithms, producing, at times, best-case constant runtime algorithms, that also, at times, run on a small fraction of the input data. The net result is astonishingly accurate and efficient Deep Learning software, that is so simple and universal, it can run in a point-and-click GUI:

The GUI for the Free Version of Black Tree AutoML, available at, blacktreeautoml.com

Even when running on consumer devices, Black Tree’s runtimes are simply incomparable to typical Deep Learning techniques, such as Neural Networks, and the charts below show the runtimes (in seconds) of Black Tree’s fully vectorized “Delta Clustering” algorithm, running on a MacBook Air 1.3 GHz Intel Core i5, as a function of the number of rows, given datasets with 10 columns (left) and 15 columns (right), respectively. In the worst case (i.e., with no parallel computing), Black Tree’s algorithms are all polynomial in runtime as a function of the number of rows and columns.

Data Clustering

Spherical Clustering

Almost all of my classification algorithms first make use of clustering, and so I’ll spend some time describing that process. My basic clustering method is in essence really simple:

1. Fix a radius of delta (I’ll describe the calculation below);

2. Then, for each row of the dataset (treated as a vector), find all other rows (again, treated as vectors), that are within delta of that row (i.e., contained in the sphere of radius delta, with an origin of the vector in question). This will generate a spherical cluster, for each row of the dataset, and therefore, a distribution of classes within each such cluster.

The iterative version of this method has a linear runtime as a function of (M - 1) \times N, where M is the number of rows and N is the number of columns (note, we simply take the norm of the difference between a given vector, and all other vectors in the dataset). The fully vectorized version of this algorithm has a constant runtime, because all rows are independent of each other, and all columns are independent of each other, and you can, therefore, take the norm of the difference between a given row and all other rows, simultaneously. As a result, the parallel runtime is constant.

Calculating Delta

The simplest clustering method is the supervised calculation of delta: you simply increase delta, some fixed number of times, using a fixed increment, until you encounter your first error, which is defined by the cluster in question containing a vector that is not of the same class as the origin vector. This will of course produce clusters that contain a single class of data, though it could be the case, that you have a cluster of one for a given row (i.e., the cluster contains only the origin). This requires running the Spherical Clustering algorithm above, some fixed number of K times, and then searching in order through the K results for the first error. And so for a given row, the complexity of this process is, in parallel, O(K \times C + K) = O(C), again producing a constant runtime. Because all rows are independent of each other, we can run this process on each row simultaneously, and so you can cluster an entire dataset in constant time.

What’s notable, is that even consumer devices have some capacity for parallel computing, and languages like Matlab and Octave, are apparently capable of utilizing this, producing the astonishing runtimes above. These same processes run on a truly parallel machine would reduce Machine Learning to something that can be accomplished in constant time, as a general matter.

Supervised Classification

Now posit a Testing Dataset, and assume we’ve first run the process above on a Training Dataset, thereby calculating a supervised value of delta. My simplest prediction method first applies the Nearest Neighbor method to every row of the Testing Dataset, returning the nearest neighbor of a given vector in the Testing Dataset, from the Training Dataset. That is, for an input vector A in the Testing Dataset, the Nearest Neighbor method will return B in the Training Dataset, for which z = norm(A - B) is minimum over the Training Dataset (i.e., B is the Training Dataset vector that is closest to A).

The next step asks whether z \leq delta. If that is the case, then the vector A would have been included in the cluster for vector B, and because the clustering algorithm is supervised, I assume that the classifiers of A and B are the same. If z > delta, then that prediction is “rejected”, and is treated as unknown, and therefore not considered for purposes of calculating accuracy. This results in a prediction method that can determine ex ante, whether or not a prediction should be treated as “good” or “bad”.

I proved in [1], that if a dataset is what I call “locally consistent”, then this algorithm will produce perfect accuracy (see Lemma 2.3 of [1]). In practice, this is in fact the case, and as you can see, this algorithm produces perfect or nearly perfect accuracies for all of the datasets in Table 1 above.

Confidence-Based Prediction

Modal Prediction

In the example above, we first clustered the dataset, and then used the resultant value of delta to answer the question, of whether or not a prediction should be “rejected”, and therefore treated as an unknown value. We could instead look at the contents of the cluster in question, for information regarding the hidden classifier of the input vector. In the case of the supervised clustering algorithm above, the class of all of the vectors included in the cluster for B are the same, since the algorithm terminates upon discovering an error, thereby generating clusters that contain vectors of internally consistent classes. We could instead use an unsupervised process, that utilizes a fixed, sufficiently large value of delta, to ensure that as a general matter, clusters are large enough to generate a representative distribution of the classes near a given vector. We could then look to the modal class (i.e., the most frequent class), in a cluster, as the predicted class for any input vector that would have been included in that cluster.

This is the process utilized in my Confidence-Based Prediction algorithm, though there is an additional step that calculates the value of Confidence for a given prediction, which requires a bit of review of very basic information theory. Ultimately, this process allows us to place a given prediction on a continuous [0,1] scale of Confidence, unlike the method above, which is binary (i.e., predictions are either accepted or rejected). This in turn allows a user, not the algorithm, to say what predictions they would like to accept or reject.

Calculating Confidence

In another paper (see, Information, Knowledge, and Uncertainty [2]), I introduced a fundamental equation of epistemology, that connects Information (I), Knowledge (K), and Uncertainty (U), using the following equation:

I = K + U.

This simply must be the case, since all information about a given system is either known to you (your Knowledge), or unknown to you (your Uncertainty). It turns out that using the work of Claude Shannon, we can calculate U exactly, which he proved to be proportional to the Shannon Entropy of the distribution of states of a system (see Section 6 of, A Mathematical Theory of Communication [3]). Using my work in information theory, you can calculate I, which is given by the logarithm of the number of states of a system (see Section 2 of [2]). This in turn allows us to calculate Knowledge as the difference between these two quantities,

K = I - U.

I’ve shown that we can apply this equation to Euclidean vectors that contain integer values, in particular, an observed distribution of classes within a cluster (i.e., a vector of class labels) (see Section 3 of [2]). Since the modal prediction process above maps each row of a dataset to a cluster, we can then consider the related vector of classes (i.e., the classes of each row in the cluster), and therefore, assign a value of Knowledge to each row of a dataset.

Probabilities of course exist over the [0,1] interval of the Reals, whereas the formula for Knowledge above is unbounded over the Reals. We can, however, normalize the values of Knowledge for a given dataset, by simply dividing by the maximum value of Knowledge over the rows of the dataset, which will of course, normalize these values of Knowledge to the [0,1] interval.

I’ve also shown that modal prediction is more reliable as a function of increasing Knowledge and increasing modal probability (see Graph 1 below). That is, as both the modal probability of a given prediction, and the Knowledge associated with the underlying vector of classes, increase, the accuracy of the prediction also increases. This in turn implies the following equation for Confidence,

C = Minimum\{K,P\},

Where P is the modal probability of a prediction, and K is the normalized value of Knowledge of the prediction.

We can then plot accuracy, as a function of C, and as noted, I’ve shown that, as a general matter, accuracy increases as a function of Confidence.

Accuracy as a function of Confidence (not normalized, the x-axis shows the iteration number, with Confidence increasing each iteration) for the UCI Credit Dataset.
Advertisement

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s