Visualizing Datasets (Algorithm)

As it turns out, the original idea I had for visualizing datasets in the plane doesn’t quite work (as implemented literally), but after some minor tweaks, it actually works great (so far). I’ve included code that allows for display in either the plane or in three dimensions, but I think it’s much better to look at in the plane. The code itself generates all kinds of data that I didn’t unpack yet, that allows for answering questions like, which classes are the most similar in terms of their positions in their original Euclidean space. I’m going to add to the code below, to allow for more detailed analysis that includes these things like this, and I’m also going to apply it to more datasets, in particular the MNIST Datasets, in the hope that visually similar classes (e.g., ‘1’ and ‘7’) are mapped closer together than those that aren’t (e.g. ‘2’ and ‘6’).

embedding2D

This is a supervised algorithm, so I don’t think it has any substantive value for prediction or classification, but it’s useful because it allows you to get an intuitive sense of how classes are organized in the dataset, by visualizing the dataset in a two or three dimensional space.

As an example, above is the output of the algorithm as applied to the UCI Iris Dataset, which has three classes, embedded in the plane. The number of points in the plane equals the number of points in the dataset, and the number of points in each class equals the number of points in the underlying classes (each colored differently). Between classes in the embedding, (a) their relative distances and (b) spread, are determined by (a) the relative distances of the average vectors for each underlying class and (b) the standard deviations of each underlying class, respectively. Note that it is mathematically impossible to preserve the relative distances between points as a general matter, since we are reducing dimensions (e.g., there is no way to embed four equidistant points in the plane). But the gist is, you use Monte Carlo style evaluation to come up with a best-fit embedding in the plane or three-space, as applicable.

MATLAB CODE:

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s