Visualizing Datasets

I just came up with what I hope to be a great algorithm for visualizing datasets in the plane. Datasets typically have dimensions that exceed three-space, and often times are in the hundreds, if not thousands, which makes visualization difficult and likely arbitrary. Because you’re often starting with such a large dimension, you can’t be sure that such an embedding even exists in the plane. For example, just consider four equidistant points –

There’s no such thing in the plane, though you could of course have such a thing in four-dimensions, where you have a pyramid.

This implies that you necessarily have some error associated with any embedding of a dataset in the plane, but that’s fine, if what you’re trying to do is get a visual intuition for what your data actually looks like, in terms of how it’s arranged in its native space. The overall idea would be to preserve relative distances as closely possible, knowing that in some cases, it’s literally impossible to get it right, which is fine, because, again, this is a visualization tool, not an instrument of measurement.

So here’s the algorithm, which I’ll actually code tonight.

(1) Take all pairs of distances between the vectors in the dataset.

This step can be vectorized fully using my implementation of the nearest neighbor algorithm.

(2) Find the maximum distance between any two pairs of vectors among that set of all distances.

(3) Divide all such distances by that maximum, which will generate a set of distances over [0,1].

(4) Now generate random matrices of (x,y) pairs, one for each row of the dataset.

This step is trivial to vectorize in Matlab and Octave, simply create some number of pages of matrices as follows:

Test_Matrix = rand(num_rows, 2, num_simulations);

Where num_simulations is the number of pages, which is, plainly, the number of Monte Carlo simulations.

(5) Select the matrix of (x,y) pairs that produces distances that have the least error when compared to the related distances generated above in Step (3).

To be clear, the idea is to map each row of a given page in Test_Matrix, each of which is a matrix of dimension (num_rows, 2), to a row of the original dataset, which presumably has a much higher number of columns.

This can be implemented by running the nearest neighbor algorithm on each page of the Test_Matrix. This will produce a set of distances for each pair of (x,y) coordinates in that matrix. However, you will have to run the the nearest neighbor algorithm on each page in Test_Matrix individually, which will require a for loop, that has a number of iterations equal to num_simulations.

And that’s it, this is your best embedding, and obviously, the more simulations you can run, the better your embedding will be. You could of course also use this to visualize graphs on datasets, which could be particularly useful if you want to get a visual sense of connectivity among vectors in the dataset.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s