Cluster-Based Classifier Labels

February 26, 2022February 26, 2022 / erdosfan

This algorithm first clusters the entire dataset, generating mutually exclusive clusters. Then, it adds an additional classifier, that is not hidden, but is generated by the algorithm itself, and so the algorithm is unsupervised, overall. The label in question is produced by clustering the dataset, where if row i is clustered in the cluster for row j, then the additional classifier for row i is the number j. These clusters are mutually exclusive, ensuring a unique additional classifier for each row.

The idea is, you assign labels to rows using the actual structure of the dataset. That is, these labels are defined by the clusters to which each row belongs, and nothing else, and so there’s no human error possible, because the algorithm clusters the dataset on its own, on an unsupervised basis, which in turn defines the labels.

Then, prediction is run, where a new cluster is pulled for each row of the dataset. These clusters are not mutually exclusive. The modal additional class label is treated as the best prediction for a given row. So e.g., if the modal additional class for row i is j (i.e., among the rows in the cluster for row i, the most frequent additional classifier is j), then the predicted class for row i is the class for row j. In this second round of clustering, rows are never contained in their own clusters, and so the algorithm is totally unsupervised.

The accuracy is excellent, and apparently perfect for the following datasets:

UCI Credit

UCI Ionosphere

UCI Iris

UCI Parkinsons

UCI Sonar

UCI Wine

You can download all of these datasets through the Black Tree website.

Like many of my algorithms, bad predictions are flagged ex ante and “rejected”, though this is of course also unsupervised, and you can read about the process generally in my paper, Vectorized Deep Learning. Finally, this algorithm is much slower than my core algorithms, which take small fractions of a second per row of data to run. This is because this process is not vectorized, and cannot be fully vectorized, because the clusters are not mutually exclusive, which means there has to be an order in which clusters are assembled. Any code that isn’t below, can be found in my previous post on the topic.

cluster-based-labels Download

iterative_clustering_unsup Download

Information Overload

Cluster-Based Classifier Labels

Leave a comment Cancel reply

Share this:

Related

Leave a comment Cancel reply