Partitioning Using Spatial Entropy

In a previous article, I introduced an efficient way to calculate spatial entropy, which is otherwise difficult to calculate (as far as I can tell). The main reason I use spatial entropy is to partition images, and so you can use this method to do exactly that, or more generally partition a space, by simply calculating the spatial entropy of your dataset first (just use method from my previous article, code below), which will be some value H. Then, simply take the inverse log of H, and take the floor or ceiling of that value. Note that the code I’ve attached automatically returns an integer value.  You can then use this number as the number of regions you partition your space into. So for example, if K = log^{-1}(H), then you would partition the space in question into K equally sized regions, which would require K^{1/N} regions per dimension, where N is the dimension of the space in question.

Advertisement

Visualizing Datasets (Algorithm)

As it turns out, the original idea I had for visualizing datasets in the plane doesn’t quite work (as implemented literally), but after some minor tweaks, it actually works great (so far). I’ve included code that allows for display in either the plane or in three dimensions, but I think it’s much better to look at in the plane. The code itself generates all kinds of data that I didn’t unpack yet, that allows for answering questions like, which classes are the most similar in terms of their positions in their original Euclidean space. I’m going to add to the code below, to allow for more detailed analysis that includes these things like this, and I’m also going to apply it to more datasets, in particular the MNIST Datasets, in the hope that visually similar classes (e.g., ‘1’ and ‘7’) are mapped closer together than those that aren’t (e.g. ‘2’ and ‘6’).

embedding2D

This is a supervised algorithm, so I don’t think it has any substantive value for prediction or classification, but it’s useful because it allows you to get an intuitive sense of how classes are organized in the dataset, by visualizing the dataset in a two or three dimensional space.

As an example, above is the output of the algorithm as applied to the UCI Iris Dataset, which has three classes, embedded in the plane. The number of points in the plane equals the number of points in the dataset, and the number of points in each class equals the number of points in the underlying classes (each colored differently). Between classes in the embedding, (a) their relative distances and (b) spread, are determined by (a) the relative distances of the average vectors for each underlying class and (b) the standard deviations of each underlying class, respectively. Note that it is mathematically impossible to preserve the relative distances between points as a general matter, since we are reducing dimensions (e.g., there is no way to embed four equidistant points in the plane). But the gist is, you use Monte Carlo style evaluation to come up with a best-fit embedding in the plane or three-space, as applicable.

MATLAB CODE:

Calculation of Error in my Deck

I just realized it’s probably more fair to report the error for my core clustering algorithm (not the others) differently than I do in my deck, while working on my embedding algorithm. I disclosed exactly how I calculate error, so it’s not a lie, but it’s not right, in the sense that even though the clusters are not mutually exclusive, the better answer is to say that the error is given as follows:

1 - (numerrors/clustersize),

whereas in the deck, I say the error is given by,

1 - (numerrors/numrows).

The latter measure does capture something, in that the number of rows is the number of opportunities to make errors, but that’s not what you want in this context, which is the accuracy of the cluster itself, regardless of how many rows there are.

Using A.I. to Generate Controls

You can probably use A.I. to tune (A) the behavior of a physical system to (B) some other system which you are observing, using the environmental controls you have over (A). You would under this view let the machine figure out how to set the environmental controls over (A) to model the behavior of (B), given the analogous inputs, and a set of analogous measurements.

The rational path seems to be to give the machine an enormous set of measurements over both, and allow the machine to discover correlations on its own. This would require delivering simultaneous signals to both (A) and (B), and allowing the machine to discover these correlations. This will eventually allow the machine to discover what set of signals that control (A) correspond to analogous signals that control (B). You can then use this to figure out how to achieve a desired end state of (B) given (A), or more generally, model the behavior of (B), given (A), without in anyway disrupting the status of (B). This is a longwinded way of saying that we probably don’t need to test anything on human beings anymore, until we’re well into the realm of confidence in a particular solution, and in that case, you get their consent.

This could allow you to build a small model of a massive system, that behaves in a scaled manner.

Visualizing Datasets

I just came up with what I hope to be a great algorithm for visualizing datasets in the plane. Datasets typically have dimensions that exceed three-space, and often times are in the hundreds, if not thousands, which makes visualization difficult and likely arbitrary. Because you’re often starting with such a large dimension, you can’t be sure that such an embedding even exists in the plane. For example, just consider four equidistant points –

There’s no such thing in the plane, though you could of course have such a thing in four-dimensions, where you have a pyramid.

This implies that you necessarily have some error associated with any embedding of a dataset in the plane, but that’s fine, if what you’re trying to do is get a visual intuition for what your data actually looks like, in terms of how it’s arranged in its native space. The overall idea would be to preserve relative distances as closely possible, knowing that in some cases, it’s literally impossible to get it right, which is fine, because, again, this is a visualization tool, not an instrument of measurement.

So here’s the algorithm, which I’ll actually code tonight.

(1) Take all pairs of distances between the vectors in the dataset.

This step can be vectorized fully using my implementation of the nearest neighbor algorithm.

(2) Find the maximum distance between any two pairs of vectors among that set of all distances.

(3) Divide all such distances by that maximum, which will generate a set of distances over [0,1].

(4) Now generate random matrices of (x,y) pairs, one for each row of the dataset.

This step is trivial to vectorize in Matlab and Octave, simply create some number of pages of matrices as follows:

Test_Matrix = rand(num_rows, 2, num_simulations);

Where num_simulations is the number of pages, which is, plainly, the number of Monte Carlo simulations.

(5) Select the matrix of (x,y) pairs that produces distances that have the least error when compared to the related distances generated above in Step (3).

To be clear, the idea is to map each row of a given page in Test_Matrix, each of which is a matrix of dimension (num_rows, 2), to a row of the original dataset, which presumably has a much higher number of columns.

This can be implemented by running the nearest neighbor algorithm on each page of the Test_Matrix. This will produce a set of distances for each pair of (x,y) coordinates in that matrix. However, you will have to run the the nearest neighbor algorithm on each page in Test_Matrix individually, which will require a for loop, that has a number of iterations equal to num_simulations.

And that’s it, this is your best embedding, and obviously, the more simulations you can run, the better your embedding will be. You could of course also use this to visualize graphs on datasets, which could be particularly useful if you want to get a visual sense of connectivity among vectors in the dataset.

Measuring Dataset Consistency

In a previous article, I showed that you can use the nearest neighbor algorithm to build a simple graph based upon which vectors in a dataset are nearest neighbors of one another. In simple terms, if two vectors are nearest neighbors, then you connect them with an edge. Once you do this for all vectors in the dataset, you will have a graph. It dawned on me earlier today that you can define a measure of dataset consistency by analyzing the properties of the resultant graph. Specifically, take the set of maximal paths in the dataset S, and label the vertices with their classifiers. Now, for each such path in S, simply take the entropy of the distribution of classifier labels. This is exactly the technique I introduced in another previous article, but I suppose I didn’t make it plain that it can apply to any dataset. You can also perhaps use this to measure whether any errors are due to truly similar classes by first converting the labels to measure-based classifiers, using the method I introduced in the article before this one. That is, if the errors only appear where two classes are truly similar, then maybe that’s the way it is and that’s the best you can do, or, you can somehow tighten your lens on the dataset, which is something you can actually do in my model of machine learning.

To be perfectly clear, if any given path in S contains more than one classifier, then that path by definition implies that the nearest neighbor algorithm will generate an error, and therefore, the dataset is what I would call not locally consistent. Here’s a brief proof:

Assume that v_i and v_j are both vertices in some path P within the set S. Because they are both part of the same path, there must a subset of P that connects v_i to v_j, which we’ll call P_{i,j}. Assume that the classifier of v_i is m and that the classifier of v_j is k, for m \neq k. It follows that as we traverse P_{i,j}, beginning at v_i, the classifiers of the vertices along that path must at some point change from m to k, and let e = (v_p, v_l) be the edge where this change occurs. Note that it could be the case that v_l = v_j. Because v_l is the nearest neighbor of v_p, and by definition the classifiers of v_p and v_l are unequal, it is the case that the nearest neighbor algorithm generated an error when given the input v_p.

On Measure-Based Classifiers

Classifier labels are typically arbitrary, even when they describe the thing at issue. So for example, using the classifier label “1” for the number 1 in the MNIST Dataset doesn’t tell you how different an image of a 1 is from an image of a 2. This is generally not a question worth answering for prediction and clustering, since you don’t really care about the relative values of labels, and in particular, in this case, it’s probably more important that you report back that you’ve observed, e.g., a 1, rather than some label based upon the properties of the class of images in question.

You could however imagine other cases where you are interested in measuring how different the classes are from one another, for whatever reason. One simple way to do this, is to take the average vector for each class (or otherwise create an abstraction for each class), sort those average vectors, and then take the differences between adjacent entries in the sorted list. Assign a label of zero to the first vector in the list, and use the differences between adjacent entries to generate labels going up through the sorted list, by adding that difference to the previous classifier in the list. So the first average vector has a label of zero, then the next has a label equal to the difference between the first vector and the second vector, and so on, which is exactly what I did in a previous algorithm you can read about in Section 1.3 (“Expanding Gas Dataset”) of my paper Vectorized Deep Learning.

Parody of the Tech Market

This is almost certainly the funniest thing I’ve ever written, and is intended as a parody of the tech market, which everyone is quickly realizing is run by seriously evil people, as are most business lines these days. Though I’ve made jokes about the market, the reality is not funny at all, as major brands like Apple, likely make use of something close to slave labor, and so, I don’t feel bad about saying awful things about awful people, and neither should you –

Who accepts money in exchange for forcing people to work without compensation, which sometimes includes children?

It’s beyond morality, we simply don’t need people that think this is OK anymore.

“How to Succeed In Business (Using Technology)”

For context, this is plainly inspired by David Oreilly’s speech about advertising: