Skin Cancer Classification

I’m going to write something formal over the coming week, but in the short run, here are the runtime and accuracy results of the methods introduced in a prior article on medical imaging classification using my software, as applied to a Skin Cancer Dataset from Harvard.

Summary of Results

Original Dataset Size: 7470 RGB images of various dimensions;

Compressed Dataset Size: 7470 x 198;

Preprocessing time: 37.9 seconds;

Supervision Training time: 55.9 seconds;

Prediction time: 13 seconds, on average (run 25 times);

Prediction accuracy:

Worst case, 85.542% (no supervision);

Best case, 95.8337% (highest level supervision, rejecting all but 62 rows).

Bottom line: Reliable diagnosis for over 7,000 patients, on a home computer, in about 2 minutes.

Summary of Process

The dataset consists of just over 10,000 images of legions. Each legion belongs to one of seven classes of legions, three of which are malignant. The algorithm consolidates all malignant classes into one, and consolidates all benign classes into one. It removes all duplicate images, leaving only one image per patient. All images are then compressed, and fed to a supervised algorithm that finds the minimum and maximum distances over which classification labels are consistent within the dataset. Then prediction is applied using decreasingly sensitive criteria for flagging predictions as outside the scope of the training dataset.

I’ve also attached a “STAT SUPERVISION” script that can be applied without consolidating classes, and generates about 80% accuracy (also using rejections). This is the same algorithm I introduced in Section 1.4 of this paper, for the “Statistical Spheres” dataset, the only difference here is the clusters don’t have the same classifier, but the algorithm is exactly the same.

There’s another method called “Isolate Classes”, the code for which is also attached, that I’ll explain fully in a separate post (that doesn’t quite yet work), which was actually the original approach, which is to isolate a single class, and try to identify which rows are in that class. This works out nicely on parallel machines, because you run tests for each class simultaneously, but this is not something you can do on a PC.



Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s