Vectorized Image Classification (RGB Version)

Following up on my article from yesterday, here’s a full color version that can classify images using almost entirely vectorized processes, and the runtimes are excellent, though I’m still testing it, and going to apply it to a wider set of datasets –

As applied to a simple dataset I made up of three grocery items, a lime, an avocado, and a grapefruit (30 images total, 10 per class), it takes about 1/10 of one second to identify an item with perfect accuracy, which certainly makes it usable as a scanner at an actual grocery, though again, I’m going to test it more, and write a more formal article on the technique.

Below are three sample images, together with representative super-pixel images actually fed to the classifier algorithm.

Here’s the command line code:

Here’s the code that displays the super-pixel images:

Here’s the full image dataset.

Vectorized Nearest Neighbor

I’m working on an image classification algorithm, and I found an extraneous line of code in my nearest neighbor algorithm, which I’ve now reduced to eleven lines of code:

function nearest_neighbors = NN_fully_vectorized(dataset, N)

dataset = dataset(:,1:N); %removes the hidden classifier
num_rows = size(dataset,1);
temp_matrix = repmat(dataset’, [1 1 num_rows]);
ref_dataset = shiftdim(temp_matrix,2);
diff_matrix = sum((dataset.-ref_dataset).^2,2);
zero_indeces = 1:num_rows+1:num_rows*num_rows;
diff_matrix(zero_indeces) = Inf; %sets the zero entries to infinity
[a b] = min(diff_matrix);
nearest_neighbors = reshape(b,[num_rows 1]);

endfunction

Vectorized Image Preprocessing

In a previous article, I introduced an algorithm that can quickly partition an image into rectangular regions, and calculate the average color in each region, which would then be used for image classification. When I applied it to the MNIST dataset, I got good results, but there was an apparent ceiling of about 87% accuracy that I couldn’t breach, regardless of how many rows of the dataset I used. My original MNIST preprocessing algorithm had much better results, and the prediction algorithm was the same, which suggests a question –

Why is this one not as good?

And I realized, the answer is, I didn’t get rid of the background of the image, which in the aggregate, ends up generating a significant amount of noise, because it’s not exactly the same in every image, yet it’s a substantial portion of each image, which cumulatively, interfered with the prediction algorithm.

As a solution, I simply nixed every dimension that has a standard deviation of less than the average standard deviation across all dimensions, by setting them to zero. What this says is, if a dimension doesn’t really move much across all rows in a dataset, in the context of image processing, it’s probably background. You can’t always say this is correct, but for the MNIST dataset, it’s obviously true, because the background is in roughly the same place in every image, and roughly equal in luminosity. This implies that the background dimensions should have significantly lower standard deviations than the average across all dimensions. And it turns out, this works quite well, and produces accuracy that is on par with my previous preprocessing algorithm (93.35%, given 2,500 rows). However, the difference is, in this case, this is a general approach that should work for all simple image datasets that have a roughly uniform background, unlike my previous preprocessing algorithm, which was specific to the MNIST dataset.

The region identified as background across the MNIST dataset in black (left), and a sample image from the dataset (right).

A Note on the Logarithm of Infinite Cardinals

I thought I had posted something about the logarithm of \aleph_0, which is something I think about at times, but I can’t seem to find it, so I’ll begin from scratch:

Lemma 1. The logarithm of \aleph_0 is not equal to \aleph_0.

Proof. Assume that \aleph_0 = \log(\aleph_0). If we assume that \aleph_1 = |\mathbb{R}| = 2^{\aleph_0}, then it follows that \aleph_1 = 2^{\log(\aleph_0)}, which leads to a contradiction, since 2^{\log(\aleph_0)} = \aleph_0.

Definitions. As a consequence, I’ve defined a new number, \Omega = \log(\aleph_0), that I’ve used in apparently unpublished research to think about the number of bits you’d need to represent a countably infinite system. However, it just dawned on me that you can construct a system that has a number of bits equal to \Omega as follows:

Assume that a system S is comprised of a countable number of rows, each indexed by the natural numbers, starting with 1. Now assume that for each row, you have a number of bits equal to the index of the row. So for example, in row 1, you have 1 bit, which can be in exactly 2^1 states, and in row 2, 2 bits, which can be in exactly 2^2 = 4 states, etc. However, assume S has a tape head, like a UTM, that can move both up and down rows, and left and right, thereby reading or writing to the bits in the row at its current index. So for example, if the head is currently at row 2, it can move left or right, to access the 2 bits that are in that row. Further, assume that if the head moves either up or down, then all of the bits in every row reset to zero. So this means that once the tape head changes rows, all of the information in the entire system is effectively deleted.

The number of states that S can occupy is at least \aleph_0, since it can represent every natural number. However, S is distinct from a countably infinite binary string, since changing the row of the tape head causes the entire system to reset, which implies that only a finite number of bits, all within a single row, can ever be used in combination with one another. The number of bits in S is therefore unbounded, but not equal to \aleph_0, because they cannot be used in combination with one another, outside of a single row.

Lemma  2. The number of states of S cannot be greater than \aleph_0.

Proof. We can assign finite, mutually exclusive subsets of \mathbb{N} to each row i of S with a cardinality of at least 2^i (e.g., use mutually exclusive subsets of powers of primes). Therefore, the number of states of S cannot be greater than the number of elements in \mathbb{N}, which is equal to \aleph_0.

Corollary  3. The number of states of S is equal to \aleph_0.

Proof. Every natural number corresponds to at least one state of S, which implies that the number of states of S is at least \aleph_0. However, Lemma 2 implies that the number of states of S cannot be greater than \aleph_0, which implies that they are equal.

Assumption. The number of bits in a system is equal to the logarithm of the number of states of the system.

This is a standard assumption for finite systems, since it implies that a system with N states is equivalent to \log(N) bits. In this case, it implies that S is equivalent to \Omega = \log(\aleph_0) bits, since S has exactly \aleph_0 states, just like the set of natural numbers itself.

Lemma 4. \Omega + c = \Omega, for all c \in \mathbb{R}.

Proof. Assume \Omega + c = X. It follows that 2^{\Omega + c} = 2^{\Omega} 2^{c} = \aleph_0 2^c = \aleph_0, which implies X = \Omega.

Corollary 5. \Omega \notin \mathbb{N}.

Proof. This follows immediately from Lemma 4, since there is no such natural number.

Note that \Omega - 1 = \Omega, which in turn implies that we can remove any single row from S, without reducing the number of bits in S, which is clearly the case.

Corollary  6. \Omega < \aleph_0.

Proof. We can assign finite, mutually exclusive subsets of \mathbb{N} to each row i of S with a cardinality of at least i (e.g., use mutually exclusive subsets of powers of primes). Therefore, the number of bits in S cannot be greater than the number of elements in \mathbb{N}, which is equal to \aleph_0. However, Lemma 1 implies that \Omega \neq \aleph_0, and therefore, \Omega < \aleph_0.

Corollary  7. \Omega > n, for all n \in \mathbb{N}.

Proof. For any given n \in \mathbb{N}, we can always find a row i of S such that i > n. Therefore, the number of bits in S is greater than n, for all n \in \mathbb{N}.

Corollary  8. There is no set A \subseteq \mathbb{N} with a cardinality of \Omega.

Proof. Assume such a set A exists. Because \Omega > n, for all n \in \mathbb{N}, |A| cannot be finite. Because |A| < |\mathbb{N}|, A cannot be countable. All subsets of \mathbb{N} are comprised of either a finite, or countable number of elements, which completes the proof.

So to summarize, the number of bits in S is sufficient to represent all natural numbers, and the number of states of S is equal to the cardinality of the natural numbers. If we assume as a general matter, that the number of bits in a system is equal to the logarithm of the number of states of the system, then the number of bits in S is \Omega = \log(\aleph_0).

Corollary  9. There is no number of bits X < \Omega, such that X > n, for all n \in \mathbb{N}.

Proof. Assume that such a number exists, and that for some system \bar{S}, the number of states of \bar{S} is 2^X. If 2^X = \aleph_0, then X = \Omega, which leads to a contradiction, so it must be the case that 2^X \neq \aleph_0. Because X < \Omega, the number of states of \bar{S} must be less than the number of states of S, and so 2^X < \aleph_0. This implies that we can assign each state of \bar{S} a unique natural number, which in turn implies the existence of a set of natural numbers A, such that |A| > n, for all n \in \mathbb{N}, and |A| < \aleph_0. Since the existence of \bar{S} implies the existence of A, which leads to a contradiction, \bar{S} does not exist.

Corollary  10. There is no set A such that |A| = \Omega.

Proof. Assume such a set A exists. Since \Omega < \aleph_0, we can assign each element of A a unique natural number, which will in the aggregate generate a set B \subset \mathbb{N}. The existence of B contradicts Corollary 8, which completes the proof.

This work implies that \Omega is the smallest non-finite number, and that it does not correspond to the cardinality of any set.

This in turn implies an elegant theory of number itself:

Begin with a system S_0, that is always in the same state s_0, and assume that the number of states of S_0 is 1, and that the number of bits in S_0 is 0 = \log(1). Now assume the existence of S_1, that is always in the same state s_1, and that this state is distinct from s_0. Further, assume the existence of some system \bar{S} that can be in either s_0 or s_1, and assume that the number of states of \bar{S} is 2, and that the number of bits in \bar{S} is 1 = \log(2). If we allow for unbounded instances of \bar{S} to exist in conjunction with one another, then basic counting principles imply the existence of all natural numbers, since a collection of systems, each equivalent to \bar{S}, is in turn equivalent to some binary string.

Note that in this view, the number 0 always has units of bits, and is never some number of states, just like \Omega must always have units of bits, since it does not correspond to the cardinality of any set, or number of states. It suggests the notion that some numbers must necessarily cary particular units, because they can only be derived from one class of mathematical object, that necessarily implies those units. For example, in this view, 0 must always carry units of bits, and can never cary units of states. In contrast, 1 can be either a number of bits, or a number of states.

Interestingly, Shannon’s equation implies that the logarithm of the reciprocal of a probability, \log(\frac{1}{p}) has units of bits. The work above implies that the logarithm of some number of states, \log(N) also has units of bits. Together, this implies that a probability has units of the reciprocal of some number of states, p = \frac{1}{N}. This is perfectly consistent with a frequentist view of a probability, which would be expressed as some portion of some number of states. Again, all of this work suggests the possibility that numbers have intrinsic units, apart from any physical measurement.

Two Lemmas on Prediction and Complexity

Introduction

In a previous article, I introduced a definition of a random variable that says, in simple terms, a source is random, and observing that source will in turn generate a random variable, if the source generates a sequence of observations that is eventually, approximately, Kolmogorov-random.

Note that my definition of a random variable is close enough to Per Martin-Löf’s, that it’s possible he, or someone else, already proved variations of these results –

I noticed these results in the context of my work on A.I. and physics, which are my primary areas of interest.

The Lemmas

For simplicity, let’s assume that observing the source generates truly Kolmogorov-random strings, since there is some nuance to the definition I offered in the previous article that adds only confusion, with no meaningful upside.

Recall that if a string x is Kolmogorov-random, then the Kolmogorov complexity of x, denoted K(x), which is the shortest input to a UTM, y, that will generate x, is such that the length of y, denoted |y|, is equal to |x| + c, for some constant c that does not depend upon x. For example, if c is the length of the “print” function, then because any string can be generated on a UTM by providing the string in question as an argument to the “print” function, it follows that for all strings x, K(x) \leq |x| + c.

Given a string x, let x_n denote the first n entries of x, and let U(x) denote the output of a UTM when given x as input.

Lemma 1. If a string x_n is Kolmogorov-random, for all n \in \mathbb{N}, then there is no computable function f that can generate the next n - k entries of x, given the first k entries of x, for all n,k \in \mathbb{N}.

Proof. Assume such a function f exists. It follows that, given the first k entries of x, namely, x_k =(x_1,x_2, \ldots, x_k), we can generate the next n - k entries (x_{k+1}, \ldots, x_n) on a UTM, using f, specifying n. Because x_n is Kolmogorov-random, and because we can specify n with at most \log(n) bits, it follows that K(x_n) \leq |f| + k + \log(n), which implies that, n - \log(n) \leq |f| - c + k. Note that the difference |f| - c is constant, and because we can, for any fixed value of k, choose an arbitrarily large value of n, there is, therefore, no such function f that exists for all n,k \in \mathbb{N}, which completes the proof by contradiction.

Lemma 2. If a string x_n is Kolmogorov-random, for all n \in \mathbb{N}, and there is some computable function f that can generate an unbounded number of entries of x, then f is subject to the inequality below.

Proof. Assume such a function f exists. It follows that f can generate some unbounded number of entries of x, namely S = \{x_{n_1}, x_{n_2}, \ldots, \}, where n_i is the index of some entry within x, each listed in order. Let k \in \mathbb{N}. Because f can generate an unbounded number of entries of x, we can provide an argument z, such that U(f,z) will generate exactly k entries of x, in order. It follows that we can construct a substring of x on a UTM, that contains k entries of x, in order, with any missing entries represented by some special character. Call the resultant string s, and assume that n = |s|. Note that K(z) \leq \log(k). This implies that K(s) \leq |f| + \log(k).

Note that we can generate x_n given s, the list of indexes that correspond to the missing entries within s, and a string generated by concatenating the missing entries of s that are within x_n into a single string, y. Because the maximum index of any entry within y is n, it follows that we can specify any such individual index with at most \log(n) bits. Since there are n - k entries in y, it follows that we can specify all of the indexes within y with at most (n - k) \log(n) bits.

This implies that,

K(x_n) \leq K(s) + K(y) + (n - k) \log(n) \leq [|f| + \log(k)] + (n - k) + (n - k) \log(n).

This in turn implies that,

n + c \leq |f| + \log(k) + (n - k) + (n - k) \log(n),

and so,

c \leq |f| + \log(k) - k + (n - k) \log(n).

If such a function f exists, then this is a surprising result, since it implies that you can predict an unbounded number of non-sequential observations of a random variable, albeit subject to constraints. The question is, is there another reason why such a function cannot exist?

Vectorized Image Partitioning

I’ve implemented my vectorized rendition of my original image partition algorithm, and the results are excellent, and remarkably efficient. I’ve yet to apply it to a formal classification problem, but the results seem to be just as good as my original algorithm, and the runtimes are orders of magnitude smaller than the original algorithm. This implies that it should work just as well as my original algorithm as preprocessing for image classification problems, the only difference being that it’s much more efficient.

Here are some examples, which include the original image, the boundaries generated by the partition, and the average color of each partition region. The average color of each partition region would then be fed as a matrix / vector that represents the image (effectively a collection of super pixels) as the input to a classification algorithm.

Below each panel is the original image size, and the runtime of the partition algorithm, as applied to each image.

image size = 960 x 774 x 3
runtime = 0.61713 seconds

image size = 2048 x 1536 x 3
runtime = 3.1197 seconds

image size = 3722 x 1635 x 3
runtime = 7.2745 seconds

Octave Code:

Some Thoughts on Random Variables

I’ve been working on a new model of probability theory in my free time, and though I’m nowhere near done with it, I’ve come up with what I believe to be a fairly elegant definition of a random variable that I think captures what it is that I’m trying to describe, which is ultimately rooted in computer theory and information theory:

The basic idea is that if a source is truly random, then prior observations should have no impact on future observations. We can express this rigorously by saying that a source S generates signals over the alphabet \Sigma, and that for any N observations, regardless of any prior observations, the set of possible sequences of observations is given by the full set of N^{|\Sigma|} strings. What this says is, the set of possible outcomes never changes, regardless of how many observations you’ve already made. One consequence of this is that you always have the same uncertainty with respect to your next observation, which could produce any one of the signals in \Sigma.

However, if the source is truly random, then it should produce a roughly Kolmogorov random string, eventually. If that’s not the case, then there will always be some computable process that can generate the observations in question. For example, the digits of a computable real number like \pi or e might seem superficially random, but they’re not, and are instead entirely determined ex ante by a computable process. If a source is truly random, then intuition suggests that eventually, it will evade modeling by a UTM, which implies that with a significantly large enough number of observations, the string of observations generated by the source should approach the complexity of a Kolmogorov random string.

We can express this rigorously by saying that for every real number \delta, and every number of observations n, there exists a number of observations N > n, for which,

1 - \frac{K(x_N)}{|x_N|}  < \delta,

where x_N is the string generated by making N observations of the source. What this says is, we can always make a number of observations that will bring the resultant string of observations arbitrarily close to a Kolmogorov random string, but this does not require actual convergence in the limit. Note that this definition does not require convergence to any particular distribution either, which is certainly possible for some physical systems, which could for example, simply change distributions as a function of time, or never have a stable distribution of states at all.