Following up on my note from yesterday, I’ve tweaked the interpolation algorithm, now making use of a simple optimizer that takes the best answer for a given iteration, and narrows the range of the Monte Carlo simulations around it based upon the accuracy, iteratively winnowing down the range and zooming in on a best solution. It works really well, and it’s really fast, achieving 97.3% accuracy in .34 seconds on the UCI Iris dataset. You can tweak the parameters, going for deeper simulations, more simulations per iteration, etc, but this already plainly works. I’ll follow up with one more article on this method as applied to the MNIST Fashion Dataset, since that’s what prompted my curiosity on the matter, at least recently.
Here’s a plot of the accuracy as a function of simulation depth (i.e., the number of times the underlying Monte Carlo simulator is called). This version of the optimizer simply allocates all of the parallel capacity to the best answer, and adjusts the width of the range in which it searches based upon accuracy. However, you could of course also imagine a version where you allocate capacity in proportion to accuracy. That is, rather than compute around only the best answer, compute around all answers, with the amount of computation around an answer determined by the accuracy of the answer.
UPDATE:
I’ve been playing with these algorithms, and it turns out they’re terrible on certain datasets, which led me to ask why, and the answer is rather obvious:
Because the underlying interpolation is linear, assume that you have two points in the dataset and
, for which the correct classifications are the same. If the coefficient for the y-dimension is non-zero, then we can fix
, and increase
arbitrarily, and at some point, we’re going to change the resultant classifier of the function
.
I think the takeaway is, this could be useful for some datasets that are not locally consistent, but that doesn’t mean it’s useful for the complement of the set of locally consistent datasets, but is instead useful for some of them.
It just dawned on me that by mapping to a single classifier, you are destroying information. It instead could be better to map from one Euclidean space to another of the same or higher dimension.
For example, a linear function could map a vector in N-dimensions to another of N-dimensions, with the goal to map vectors of the same class to roughly the same region of the new second space. This of course assumes the original dataset is not locally consistent, otherwise you could just use distance-based clustering.
Moreover, you could map pairs of feature values using a second degree polynomial, and so on, which would take an input vector and map it to a higher dimensional space, again, with the goal of maximizing the classification consistency of that second Euclidean space.
For example, here’s an algorithm that I’m going to code over the next couple days:**
1. Count the number of classes in the dataset, assign each a label from 1 to K.
2. Randomly generate origins for each of the K classes of data in N-space, where N is the dimension of the dataset, such that none of the spheres of radius 1 from each such origin overlap (i.e., generate K non-intersecting spheres of radius 1 in N-space).*
3. Assign each class of vectors a unique origin / sphere.
4. Test for the Euclidean distance between each point in a given class and its assigned origin, and take the total distance for each point, over all classes; do this for each randomly generated set of origins.
5. Select the set of origins that minimizes the total distance calculated in Step 4.
6. Now given the set of origins that minimizes the distance calculated in Step 4, use a Monte Carlo simulation / optimizer to generate an equation (e.g., linear, polynomial, etc.) that maps each point in a given class to a point in its associated sphere.
The net effect of this algorithm is to map possibly locally inconsistent data to a locally consistent space (i.e., a set of non-overlapping spheres that contain only one class of data).
*In some cases you might be able to use fewer dimensions, and in others, you might have to use more, because for example you’re making use of combinations of feature values, which would require additional dimensions in the output space. We could also take a “center of mass approach” for each class, which would save time, and simply assign each class an origin equal to the average value over the vectors in the class, and then size the radius so there’s no overlap.
** The more general idea here seems quite promising, which is that it’s a set of operations that bring you from an initial state to some desired goal state, that is perhaps quite different from the initial state, and is solved for using parallel operations. It is in essence, in this view, a vectorized state space algorithm.