A Simple Model of Infinitesimals

I did some work while I was in Stockholm on distributions over countable sets, but stopped in the interest of finishing up what seemed to be, at the time, more important work in physics and A.I. But yesterday, I realized that you can construct a simple model of probability on countable sets using the work that I had abandoned.

Uniform Distributions on Countable Sets

The primary offender for probabilities on countable sets is the uniform distribution. This is because it is impossible to select a real number probability for a uniform distribution on a countable set, since there is no real number r that satisfies,

\aleph_0  r = 1.

So instead, my proposal is to drop the requirement that the sum over the probabilities equals 1, and instead to require only that the product of (x) the number of trials and (y) the pseudo-probability of the signal equals (z) the frequency of the signal in question.

So, if each element of a countable set of signals appears exactly once in a countably infinite number of observations, then our pseudo-probability p must satisfy:

\aleph_0  p = 1.

Obviously, p cannot be a real number, but this of course doesn’t prevent us from assuming that such a number exists.

There’s an additional wrinkle, which is that for countable sets, there are an infinite number of “uniform distributions”. For example, each element of a countable set could appear exactly twice over a countable number of observations, or three times, etc.

Using the natural numbers as our underlying set of signals, we could observe the uniform distributions,

1,2,3,4,5, \ldots,


1,1,2,2,3,3,4,4,5,5, \ldots.

In both cases, each signal appears an equal number of times over a countable number of observations.

So as a general matter, we can instead simply require that,

\aleph_0 p = K (Equation 1),

where K is the number of times each element appears over a countable number of observations.

So our pseudo-probability is a number, that is not a real number, that when multiplied by the number of observations, which is assumed to be \aleph_0, produces the frequency of the signal in question.

This definition implies some clean, intuitive arithmetic over these numbers. For example, if signals i and j have probabilities p_1 and p_2,  respectively, then their combined frequency in the distribution is given by,

\aleph_0 p_1 + \aleph_0 p_2 =  \aleph_0 (p_1 + p_2) = K_1 + K_2.

There is, however, a special case where each signal appears a countable number of times. That is, it is possible to have an infinite sequence of observations in which each signal appears a countable number of times.

For example, there are an infinite number of primes, and every prime number can be exponentiated a countably infinite number of times, producing a unique, countably infinite set associated with each prime number.

So let’s assume the primes are our underlying set, and our observations are generated by reading the set of integers in increasing order. If we spot a prime, or a power of a prime, we list the prime number itself as an observation. Otherwise, we write nothing. So if we see 2, we write 2; if we see 9, we write a 3; if we see 16, we write 2; if we see 10, we write nothing.

Reading the first 6 numbers, we would write,


Though we read 6 numbers, we made only 5 observations, since 6 is not prime or a power of a prime.

Reading the number line from left to right in this manner will cause us to list each prime number an infinite of number of times, and because there are an infinite number of primes, we will generate a set of observations that contains a countably infinite number of signals, each of which appears a countably infinite number of times.

Returning to Equation (1) above, we would have in this case,

\aleph_0 p = \aleph_0.

Any real number would satisfy this equation, but not only is it not intuitive to use a real number in this case, it also implies that there’s more than one solution to the equation, which is pushing the boundaries a bit far, even for a pseudo-probability concept. As a result, my proposition is to use the number,

\Omega = \log(\aleph_0).

Note that \Omega \neq \aleph_0, since 2^{\aleph_0} \neq \aleph_0, at least under normal assumptions. Also, it’s easy to show that \Omega \aleph_0 = \aleph_0, and therefore, we have a number that is not a real number, but nonetheless satisfies the equation above. This would imply that whenever you have a countably infinite number of observations, and a signal that occurs a countably infinite number of times within that set of observations, the pseudo-probability of that signal is \Omega.

Developing an Intuition for Probabilities on Countable Sets

Intuitively, this makes sense, since if you know that something occurs an infinite number of times in an infinite sequence of observations, the number that describes your expectations for each observation you make in that case should not be a real number probability, since, as shown above, this implies that each real number is just as good as any other real number.

At the same time, the number shouldn’t be an infinitesimal, since the portion of observations in which the signal occurs does not have the same relationship to the total number of observations that an infinitesimal does. That is, if a signal occurs once in a countably infinite sequence, intuitively, if that signal has a well-defined probability, then it should be an infinitesimal, since our expectation is that the event is extraordinarily unlikely to be the current observation, but not truly impossible, since it is actually certain to eventually occur. In contrast, if a signal occurs a countably infinite number of times within a countably infinite sequence of observations, then intuitively, we would not expect the pseudo-probability of that event to be an infinitesimal, since the event could happen often.

For example, if we consider the probability of observing an even number when reading a permutation of the natural numbers, it would be unreasonable to say that the probability is infinitesimal, since there are by definition an infinite number of even numbers in the set that we will eventually observe. At the same time, it’s not quite right to say that the probability is \frac{1}{2} either, because it is possible to observe an arbitrarily long sequence that contains no even numbers at all.

And unlike with a finite set of observations, there is no limiting argument that justifies our intuition that the probability of observing an even number increases with the number of observations. It’s simply not true, since for any given sequence of odd numbers, you can construct a permutation on the natural numbers that contains an even longer sequence of odd numbers. As a result, there are also what I would call philosophical reasons to believe that this probability should not be a real number, or an infinitesimal.

And \Omega is not an infinitesimal, nor is it a real number. It is instead an interesting number that has some of the same algebraic properties of the infinite cardinals, but some other unique properties that are not shared by the infinite cardinals. As a result, I think it’s a good candidate, and on balance, this model of probabilities on countable sets is algebraically clean, simple, and intuitive.

All of this suggests a different notion of expectation when reading an infinite string, versus reading a finite string. If you’re reading an infinite string, its structure is already determined, but unknown to you. In some sense, this might seem overly philosophical, but it changes the mathematics. When observing a source governed by ordinary probabilities, the assumption is that, eventually, the distribution of the observations will converge to some stable probability distribution. Perhaps you know what this distribution is beforehand, and perhaps you don’t. But in either case, your operating assumption is that within finite time, some distribution will become apparent, and roughly stable.

In contrast, when reading an infinite string, the structure of the string is already pre-determined. Moreover, the set of all infinite strings includes strings that have no stable distribution at all, which, therefore, cannot be described by ordinary probabilities. Instead, we can only count the number of instances of each signal within the string, as we have above, and form a different notion of expectation, that is nonetheless useful, but distinct from the notion of expectation used in ordinary probability. Also note that unless we know the distribution of signals within an infinite string ex ante, it would be impossible to have any expectations at all about any observations, other than knowing the set of possible signals (assuming we have that information ex ante).

We can also take the view that infinitesimals are no different than ordinary real numbers. In short, just like the actual number \pi is the result of a countable number of calculations that can be only approximated, and not truly realized on a Turing Machine, similarly, an infinitesimal is a quantity that which, if iteratively added to itself a countable number of times, would produce a real number. The difference being that computable real numbers converge to their ideals, whereas anything short of a countable number of operations on an infinitesimal produces another infinitesimal.


Vectorized Clustering Algorithm

Attached is a completely vectorized, highly efficient version of my original clustering algorithm. Its runtime is drastically shorter, and comparable to my real-time algorithms, though it is based upon my original technique of iterating through different levels of discernment until we find the level that generates the greatest change in the entropy of the categorization. Also attached is a command line script that demonstrates how to apply it to a dataset.





Tracking Moving Objects in 3-Space

Attached is a polynomial-time algorithm that can identify, and then track objects in three-space over time given 3-D point information from a sensor, or other input.

Because the algorithm tracks at the point level, rather than at the object level (i.e., it builds its own objects from point data), it can be used to track objects that have irregular, and even diffuse shapes, like smoke, or steam.

Running on a $200 dollar laptop, it can track 15 Newtonian projectiles at a rate of about 3 frames per second, with an accuracy of about 98% to 100%.

Here’s a plot of one of the sets of projectiles I applied it to:


All of the other functions you need can be found in my full library here.


Autonomous Noise Filtering

I’ve updated Prometheus to include autonomous noise filtering.

This means that you can give it data that has dimensions that you’re not certain contribute to the classification, and might instead be noise. This allows Prometheus to take datasets that might currently produce very low accuracy classifications, and autonomously eliminate dimensions until it produces accurate classifications.

It can handle significant amounts of noise:

I’ve given it datasets where 50% of the dimensions were noise, and it was able to uncover the actual dataset within a few minutes.

In short, you can give it garbage, and it will turn into gold, on its own.

Since I’ve proven that it’s basically mathematically impossible to beat nearest neighbor using real-world Euclidean data, and I’ve come up with a vectorized implementation of nearest neighbor, this version of Prometheus uses only nearest neighbor-based methods.

As a result, the speed is insane.

If you don’t use filtering, classifications occur basically instantaneously on a cheap laptop. If you do have some noise, it still takes only a few minutes for a dataset of a few hundred vectors to be processed, even on a cheap laptop.

Attached should be all the code you need to run it, together with a command line script that demonstrates how to use it. But, it’s probably easier to simply download the my full library as a zip file from my researchgate blog.

Enjoy, and if you’re interested in a commercial version, please let me know.
















On Nearest Neighbor

I’ve been using nearest neighbor quite a bit lately, and I’ve noticed that its accuracy is remarkable, and I’ve been trying to understand why.

It turns out that you can prove that for certain datasets, you actually can’t do better than the nearest neighbor algorithm:

Assume that for a given dataset, classifications are constant within a radius of R of any data point.

Also assume that each point has a neighbor that is within R.

That is, if x is a point in the dataset, then there is another point y in the dataset such that ||x-y|| < R.

In plain terms, classifications don’t change unless you travel further than R from a given data point, and every point in the dataset has a neighbor that is within R of that point.

Now let’s assume we’re given a new data point from the testing set that is within the boundaries of the training set.

By definition, the new point is within R of some point from the training set, which implies it has the same class as that point from the training set.

This proves the result.

This means that given a sufficiently dense, locally consistent dataset, it is mathematically impossible to make predictions that are more accurate than nearest neighbor, since it will be 100% accurate in this case.

Unless you’re doing discrete mathematics, which can be chaotic (e.g., nearest neighbor obviously won’t work for determining whether a number is prime) your dataset will probably be locally consistent for a small enough value of R.

And since we have the technology to make an enormous number of observations, we can probably produce datasets that satisfy the criteria above.

The inescapable conclusion is that if you have a sufficiently large number of observations, you can probably achieve a very high accuracy simply using nearest neighbor.

On Classifications

I’ve been doing some thinking about the mathematics and theory of classifications, as opposed to actually implementing particular classifications, and I’ve come to a few conclusions that I thought I’d share, in advance of a formal article.

The initial observation comes from the nearest neighbor technique that I’ve been making use of lately: given a dataset, and an input vector, the nearest neighbor algorithm uses the class of the vector that is most similar to the input vector as predictive of the class of the input vector. In short, if you want to predict what the class of a given input vector is, you use the class of the vector to which the input vector is most similar. This is generally measured using the Euclidean norm operator, but you could substitute other operators as well depending upon the context. This method works astonishingly well, and can be implemented very efficiently using vectorized languages like MATLAB and Octave.

The assumption that underlies this model is that the space in which the data exists is locally consistent. That is, there is some sufficiently small neighborhood around any given point, within which all points have the same classifier. The size of this neighborhood is arguably the central question answered by my initial paper on Artificial Intelligence, if we interpret the value of \delta as a radius. Because this method works so well, it must be the case that this assumption is in fact true for a lot of real world data. Common sense suggests that this is a reasonable assumption, since, for example, the color of an object is generally locally consistent; income levels exhibit local consistency by neighborhood; and temperature is generally locally consistent in every day objects. These examples provide some insight into why the nearest neighbor method works so well for a typical dataset.

However, there are very simple examples where this assumption fails, such as the distinction between odd and even numbers. In fact, using the nearest neighbor approach will by definition produce an incorrect answer every time, since the nearest neighbor of an even number is an odd number, and the nearest neighbor of an odd number is an even number. Music is another example where the space is simply not locally consistent, and as a result, substituting a note with its nearest neighbor can produce disastrous results.

We can instead add additional criteria that would allow us to describe these spaces. For example, if we want to determine whether a given input integer is an even or an odd number, we can use an algorithm that searches for numbers within the dataset that have a distance from the input vector of exactly two. Similarly, we could require some minimum distance, and set a maximum distance. That is, vectors x and y have the same classifier if ||x - y|| > \delta_i and ||x - y|| < \delta_j, for some values \delta_i,\delta_j that depend upon the space in question. This would be useful where the space is locally consistent, but only after some initial offset. As a general matter, we could in theory use any computable mathematical operators to construct a function that determines whether two points have the same classifier.

The question is whether we can infer the correct function given the dataset, which might not always be the case. That is, even if there is some computable function that allows us to say whether two points have the same classifier, we might not have enough data to determine whether or not this is the case. That is, the correct function might rely upon information that is exogenous to the dataset itself. Finally, this also suggests the question of whether it is decidable on a UTM (x) that we in fact have enough information to derive the correct function for the dataset in question, and if so, (y) whether or not there is a computable process that generates the correct function.

Discerning Objects in 3-Space

Below is an algorithm that can, in polynomial time, identify objects in a three-dimensional space given point information about the objects in the scene. In short, it takes a set of points in Euclidean 3-space, and clusters them into individual objects. The technique is essentially identical to that used by my original image partition algorithm, but rather than cluster regions in an image, this algorithm clusters points in 3-space.

The algorithm is intended to take in points from sensor data, and then assemble those points into objects that can then be tracked as a whole. The algorithm can be easily generalized to higher-dimensional spaces, but it’s meant to be a proper computer vision algorithm, so it’s written to work in 3-space. In a follow up article, I’ll present a similar algorithm that does the same over time, thereby tracking points over time as individual objects in 3-space. This can be done by taking each point in a given moment in time, and finding its nearest neighbor in the next moment in time.

Here’s the code, together with a command line script that runs a simple example:




Optimized Image Delimiter

Attached is another version of my image delimiter algorithm, that iterates through different grid sizes, selecting the grid size that generates the greatest change in the entropy of the resultant image partition.

It’s essentially a combination of my original image partition algorithm and my image delimiter algorithm. I’m still tweaking it, but it certainly works as is.