# Unsupervised 3D Feature Extraction and Edge Detection Algorithm

In this note, I’ll present an unsupervised algorithm that can extract three-dimensional features from an ordinary two-dimensional image, and detect edges within the image, thereby extracting two-dimensional shape information, in each case in polynomial time.

A research note explaining the algorithm, together with the code, is available on my researchgate homepage here, and I’ve also attached a PDF.

The relevant scripts are also attached as PDFs below. Any missing scripts can be found in previous posts, and in the “Algorithms” tab.

extract_3D_features

identify_fine_fast

maximize_std_dev_fast

test_measures_approx

# Using Information Theory to Explain Color Bias

In a previous post, I showed how information theory can be used to explain why it is that human beings perceive changes in luminosity logarithmically, and not linearly, as a function of actual luminosity (see, “Using Information Theory to Explain Color Perception“). In short, I showed that if you assume that what human beings perceive as luminosity is actually a measure of information content, then perceived luminosity should be a logarithmic function of actual physical luminosity, which is indeed the case (see, for example, Fechner’s Law). In this post, I’ll show how we can use similar concepts to explain why it is that human beings perceive the colors near the boundary of the green and blue portion of the spectrum as brighter than the colors near the red portion of the spectrum.

Luminosity, Energy, Information, and Perception

In a very long paper posted below (that you absolutely do not need to read to understand what follows), I showed that assuming that light literally contains information has a lot of interesting consequences, and even implies that time-dilation will occur. In this post, I’ll take a slightly different approach that nonetheless assumes that the environment with which we interact can be thought of as literally containing information. This doesn’t require us to assume that we’re living in The Matrix, but rather, that at a fundamental level, all of our perceptions require representations of external stimuli to be generated, which means that our bodies are quite literally generating information. Because information is measured in bits, this view implies that we can measure the information content of external stimuli in bits.

Specifically, we can measure the sensory information triggered by a single photon landing on a human eye by making use of some basic principles from contemporary physics. For those that are unfamiliar with Planck’s equation, $E = hf$, in short, it gives us the energy of a single “parcel” of light $(E)$, known as a photon, as a function of its frequency $(f)$. The constant $h$ is known as Planck’s constant, given that Max Planck penned the equation, though Einstein also had a role in popularizing the view that light is “quantized” into photons through his work on the photoelectric effect.

Returning to our analysis, as noted above, human beings perceive changes in luminosity logarithmically as a function of actual physical luminosity. So, for example, if a light source has a luminosity of $L$, then the level of luminosity perceived by a human observer of that source is given by $\log(L)$. We showed below that we can explain this phenomenon, as well as other related phenomena, by assuming that what is perceived as luminosity is actually a representation of the amount of information observed. That is, the perceived brightness of a light source is not a direct representation of the luminosity of the light, but is instead a representation of the amount of information required to represent the luminosity of the source. In crude (and imprecise) terms, if the luminosity of a light source is 10, then we need $\log(10)$ bits to represent its luminosity, meaning that what we perceive as brightness should actually be measured in bits.

In more precise terms, luminosity is also a measure of energy, but it is not a measure of the energy of the individual photons ejected by a light source (like Planck’s equation). Luminosity is instead a measure of the total energy ejected by a light source. As a result, when we take the logarithm of the luminosity of a given light source, we are taking the logarithm of an amount of energy. Generalizing upon this, it should also be the case that the energy of an individual photon incident upon a human eye is also perceived logarithmically. As it turns out, assuming that this is the case again leads to equations that are consistent with known results as to how human beings perceive colors.

The Perceptual Information Content of a Photon

Summarizing the discussion above, my hypothesis is that what is perceived as light is actually a visual representation of the number of bits required to encode the amount of energy observed. In the case of measuring the luminosity of a light source, this hypothesis works out perfectly, since it implies that perceived luminosity is given by the logarithm of actual physical luminosity (see below for the relevant analysis). In the case of an individual photon, it implies that the perceived luminosity of the photon should be given by $log(E) = log(f) + C$.

However, we also know that at a certain point on either side of the visible spectrum of light, perception falls off, meaning that as the frequency of a source increases past blue, our response vanishes, making the light invisible. Similarly, as the frequency of a source decreases below red, the light will again eventually become invisible. This implies that we’ll need to adjust our equation to account for the fact that at both ends of the visible spectrum of light, the perceived luminosity should approach zero. We can accomplish this by adding a “weight” given by,

$W = \sin[\pi (f - f_{min}) / (f_{max}- f_{min})]$,

where $f$ is the frequency of the photon in question, $f_{min}$ is the minimum perceptible frequency (i.e., the minimum visible frequency of red) and $f_{max}$ is the maximum perceptible frequency (i.e., the maximum visible frequency of blue/violet). Combining these two equations (but dropping the constant), we have the following equation relating the frequency of a photon to its perceived luminosity:

$L_P = \log(f) \sin[\pi (f - f_{min}) / (f_{max}- f_{min})]$.

Plugging in the reasonable values of $f_{min} = 350$ THz and $f_{max} = 850$THz,  we obtain the attached graph.

Note that this equation implies that perceived luminosity is maximized around a frequency of 607 THz, right at the boundary of the the green and blue portions of the visible spectrum, which is consistent with known results as to how human beings perceive differences in luminosities across the color spectrum.

In summation, whether or not we’re living in The Matrix, assuming that human beings perceive the world around us using encoded representations implies equations that are consistent with known results as to how we actually perceive the external world. Specifically, if we assume that what we perceive is information itself, and that the information content of a perceptual signal is given by the logarithm of the physical stimulus energy, then it seems that, as a general matter, we produce equations that are consistent with experimental data as to how human beings react to external stimuli.

# Applying My Model of AI to UCI Datasets

I’ve published a working paper demonstrating the accuracy of my model of AI when applied to four UCI datasets, which you can find on my researchgate homepage.

Over the four classification problems, the categorization algorithm had an average success rate of 92.833%, where success is measured by the percentage of categories that are consistent with the hidden classification data. Over the four classification problems, the prediction algorithm had an average success rate of 93.497%, where success is measured by the percentage of predictions that are consistent with the hidden classification data. All of the code necessary to run these algorithms, and apply them to the training data, is available on my researchgate homepage.

I’ve also attached a pdf of the working paper here.

A New Model of Artificial Intelligence: Application to Data I

# UPDATE: A New Model of Artificial Intelligence

The full working paper for my new model of artificial intelligence is available on researchgate here. I’ve also attached a PDF.

Regards,

Charles

A New Model of Artificial Intelligence

# A New Model of Artificial Intelligence

Below I’ll present a vectorized categorization algorithm, and a vectorized prediction algorithm, each of which have a worst-case, low-degree polynomial run time, even in extremely high dimensions. Together, these two algorithms allow for almost instantaneous inferences to be drawn from an enormous number of observations on an ordinary consumer device. As a result, I believe these algorithms will facilitate the commoditization of artificial intelligence, potentially sparking a revolution in the field, and a sea-change in the types of products available to consumers.

Specifically, I’ll show how these algorithms can be used to predict complex random-walks, predict projectile paths in three-dimensions, and classify three-dimensional objects, in each case making use of inferences drawn from millions of underlying data points, all using low-degree polynomial run time algorithms that can be executed on an ordinary consumer device. All of the code necessary to run these algorithms is attached.

Vectorized Categorization and Prediction

In the post below entitled, “Fast N-Dimensional Categorization Algorithm,” I presented a categorization algorithm that can categorize a dataset of N-dimensional vectors with a worst-case run time that is $O(log(m)m^{N+1})$, where $m$ is the number of items in the dataset, and $N$ is the dimension of each vector in the dataset. The categorization algorithm I’ll present in this post is identical to the algorithm I presented below, except this algorithm skips all of the steps that cannot be vectorized. As a result, the vectorized categorization algorithm has a worst-case run time that is $O(m^2)$, where $m$ is the number of items in the dataset. The corresponding prediction algorithm is basically unchanged, and has a run time that is worst-case $O(m)$. This means that even if our underlying dataset contains millions of observations, if it is nonetheless possible to structure these observations as high-dimensional vectors, then we can use these algorithms to produce nearly instantaneous inferences, despite the volume of the underlying data.

As a practical matter, on an ordinary consumer device, the dimension of the dataset will start to impact performance for $N \approx 10000$, and as a result, the run time isn’t truly independent of the dimension of the dataset, but will instead depend upon on how the vectorized processes are implemented by the language and machine in question. Nonetheless, the bottom line is that these algorithms allow ordinary, inexpensive consumer devices to almost instantaneously draw complex inferences from millions of observations, giving ordinary consumer devices access to the building blocks of true artificial intelligence.

Predicting Random Paths

Let’s begin by predicting random-walk style data. This type of data can be used to represent the states of a complex system over time, such as an asset price, a weather system, a sports game, the click-path of a consumer, or the state-space of a complex problem generally. As we work through this example, I’ll provide an overview of how the algorithms work, in addition to discussing their performance and accuracy. Note that if you’d like to test any of the data discussed in this article, you can use the scripts I’ve attached, which contain the same code used to generate the training data presented in the examples.

The underlying dataset will in this case consist of two categories of random paths, though the categorization algorithm and prediction algorithm will both be blind to these categories. Specifically, our training data consists of 1,000 paths, each of which contains 10,000 points, for a total of 10,000,000 observations. Of those 1,000 paths, 500 will trend upward, and 500 will trend downward. The index of the vector represents time, and the actual entry in the vector at a given index represents the $y$ value of the path at that time. We’ll generate the upward trending paths by making the $y$ value slightly more likely to move up than down as a function of time, and we’ll generate the downward trending paths by making the $y$ value slightly more likely to move down than up as a function of time.

For context, Figure 1 shows both a downward trending path and an upward trending path, each taken from the dataset.

Figure 2 shows the full dataset of 1,000 paths, each of which begins at the origin, and then traverses a random path over 10,000 steps, either moving up or down at each step. The upward trending paths move up with a probability of .52, and the downward trending paths move up with a probability of .48.

After generating the dataset, the next step is to construct two data trees that we’ll use to generate inferences. One data tree is comprised of elements from the original dataset, called the anchor tree, and the other data tree is comprised of threshold values, called the delta tree.

The anchor tree is generated by repeatedly applying the categorization algorithm to the dataset. This will produce a series of subcategories at each depth of application. The top of the anchor tree contains a single nominal vector, and is used by the algorithm for housekeeping. The categories generated by the first application of the categorization algorithm have a depth of 2 in the anchor tree; the subcategories generated by two applications of the categorization algorithm have a depth of 3 in the anchor tree, and so on. The algorithm selects an anchor vector as representative of each subcategory generated by this process. As a result, each anchor vector at a depth of 2 in the anchor tree represents a category generated by a single application of the categorization algorithm, and so on.

As a simple example, assume our dataset consists of the integers {1,2,8,10}. In this case, the first application of the categorization algorithm generates the categories {1,2} {8} {10}, with the anchors being 1, 8, and 10. This is the actual output of the categorization algorithm when given this dataset, and it turns out that, in this case, the algorithm does not think that it’s appropriate to generate any deeper subcategories. For a more detailed explanation of how this whole process works, see the post below entitled, “Predicting and Imitating Complex Data Using Information Theory.”

As a result, in this case, the anchor tree will contain the nominal vector at its top, and then at a depth of 2, the anchors 1,8, and 10 will be positioned in separate entries, indexed by different widths.

Represented visually, the anchor tree is as follows:

(Inf)

(1) (8) (10)

Each application of the categorization algorithm also produces a value called delta, which is the maximum difference tolerated for inclusion in a given category. In the example given above, delta is 1.0538. Specifically, if $x$ is in a category represented by the anchor vector $a$, then it must be the case that $||a - x|| < \delta_a$, where $\delta_a$ is the sufficient difference associated with the category represented by $a$. That is, $\delta_a$ is the difference between two vectors which, when met or exceeded, we treat as sufficient cause to categorize the vectors separately. The value of delta is determined by making use of processes I’ve developed that are rooted in information theory, but in short, delta is the natural, in-context difference between elements of the category in question. The value $\delta_a$ has the same depth and width position in the delta tree that the associated anchor vector $a$ has in the anchor tree. So in our example above, 1 and 2 are in the same category, since $|1 - 2| < 1.0538$, but 8 and 10 are not, since $|8 - 10| > 1.0538$.

Since the categorization algorithm is applied exactly once in this case, only one value of delta is generated, resulting in the following delta tree:

(Inf)

(1.0538) (1.0538) (1.0538)

Together, these two trees operate as a compressed representation of the subcategories generated by the repeated application of the categorization algorithm, since we can take a vector from the original dataset, and quickly determine which subcategory it could belong to by calculating the difference between that vector and each anchor, and testing whether it’s less than the applicable delta. This allows us to approximate operations on the entire dataset using only a fraction of the elements from the original dataset. Further, we can, for this same reason, test a new data item that is not part of the original dataset against each anchor to determine to which subcategory the new data item fits best. We can also test whether the new data item belongs in the dataset in the first instance, since if it is not within the applicable delta of any anchor, then the new data item is not a good fit for the dataset, and moreover, could not have been part of the original dataset. Finally, given a vector that contains $M$ out of $N$ values, we can then predict the $N - M$ missing values by substituting those missing values using the corresponding values in the anchors in the anchor tree, and then determining which substitution minimized the difference between the resulting vector and the anchors in the tree.

For example, if $N = 5$, and $M = 3$, then we would substitute the two missing values in the input vector $x = (x_1,x_2,x_3,\ldots)$, using the last two dimensions of an anchor vector $a = (a_1,a_2,a_3,a_4,a_5)$, producing the prediction vector $z = (x_1,x_2,x_3,a_4,a_5)$. We do this for every anchor in the anchor tree, and test whether $||a - z|| < \delta_a$. If the norm of the difference between the anchor vector and the prediction vector is less than the applicable delta, then the algorithm treats that prediction as a good prediction, since the resulting prediction vector would have qualified for inclusion in the original dataset. As a result, all of the predictions generated by the algorithm will contain all of the information from the input vector, with any missing information that is to be predicted taken from the anchor tree.

This implies that our input vector $x$ could be within the applicable delta of multiple anchor vectors, thereby constituting a match for each related subcategory. As a result, the prediction algorithm returns both a single best-fit prediction, and an array of possible-fit predictions. That is, if our input vector produces prediction vectors that are within the applicable delta of multiple anchors, then each such prediction vector is returned in the array of possible fits. There will be, however, one fit for which the difference $||a - z||$ between the input vector and the applicable prediction vector is minimized, and this is returned by the algorithm as the best-fit prediction vector.

Returning to our example, each path is treated as a 10,000 dimensional vector, with each point in a path represented by a number in the vector. So in this case, the first step of the process will be to generate the data trees populated by repeated application of the categorization algorithm to the dataset of path vectors. Because the categorization algorithm is completely vectorized, this process can be accomplished on an ordinary consumer device, despite the high dimension of the dataset, and the relatively large number of items in the dataset. Once this initial step is completed, we can begin to make predictions using the data trees.

We’ll begin by generating a new, upward trending path, that we’ll use as our input vector, that again consists of $N = 10000$ observations.  Then, we’ll incrementally increase $M$, the number of points from that path that are given to the prediction algorithm. This will allow us to measure the difference between the actual path, and the predicted path as a function of $M$, the number of observations given to the prediction algorithm. If $M = 0$, giving the prediction algorithm a completely empty path vector, then the prediction algorithm will have no information from which it can draw inferences. As a result, every path in the anchor tree will be a possible path. That is, with no information at all, all of the paths in the anchor tree are possible.

In this case, the categorization algorithm compressed the 1,000 paths from the dataset into 282 paths that together comprise the anchor tree. The anchor tree has a depth of 2, meaning that the algorithm did not think that subdividing the top layer categories generated by a single application of the categorization algorithm was appropriate in this case. This implies that the algorithm didn’t find any local clustering beneath the macroscopic clustering identified by the first application of the categorization algorithm. This in turn implies that the data is highly randomized, since each top layer category is roughly uniformly diffuse, without any significant in-category clustering.

This example demonstrates the philosophy underlying these algorithms, which is to first compress the underlying dataset using information theory, and then analyze the compressed dataset efficiently using vectorized operations. In this case, since the trees have a depth of 2 and a width of 282, each prediction requires at most $O(282)$ vectorized operations, despite the fact that the inferences are ultimately derived from the information contained in $10000000$ observations.

Beginning with $M = 0$ observations, each path in the anchor tree constitutes a trivial match, which again can be seen in Figure 2. That is, since the algorithm has no observations with which it can generate an inference, all of the paths in the anchor tree are possible. Expressed in terms of information, when the input information is minimized, the output uncertainty is maximized. As a result, the prediction algorithm is consistent with common sense, since it gradually eliminates possible outcomes as more information becomes available.

Note that we can view the predictions generated in this case as both raw numerical predictions, and more abstract classification predictions, since each path will be either an upward trending path, or a downward trending path. I’ll focus exclusively on classification predictions in another example below, where we’ll use the same categorization and prediction algorithms to predict which class of three-dimensional shapes a given input shape belongs to.

Returning to the example at hand, I’ve set $M = 50$, and Figure 3 shows the actual path of the new input vector, the average path generated by taking the simple average of all of the predicted paths, and the best-fit path. In this case, the prediction algorithm has only .5% of the path information, and not surprisingly, the average of all predicted paths is flat, since the array of predicted paths contains paths that trend in both directions. In fact, in this case, the array of predicted paths contains the full set of 282 possible paths, implying that the first 50 points in the input path did not provide much information from which inferences can be drawn, though the best-fit path in this case does point in the right direction.

For $M = 2000$, the actual path, average path, and best-fit path, all clearly trend in the same, correct direction, as you can see in Figure 4. Figure 5 contains the full array of predicted paths for $M = 2000$, which consists of 173 possible paths. Though the average path, and best-fit path both point in the correct, upward trending direction, the set of possible paths clearly contains a substantial number of downward trending paths.

Figure 6 shows the average error between the best-fit path and the actual path as a percentage of the $y$ value in the actual path, as a function of $M$. In this case, Figure 6 reflects only the portion of the data actually predicted by the algorithm, and ignores the first $M$ points in the predicted path, since the first $M$ points are taken from the input vector itself. That is, Figure 6 reflects only the $N - M$ values taken from the anchor tree, and not the first $M$ values taken from the input vector itself. Also note that the error percentages can and do in this case exceed 100% for low values of $M$.

As expected, the best-fit path converges to the actual path as $M$ increases, implying that the quality of prediction increases as a function of the number of observations. Figure 7 shows the same average error calculation for 25 randomly generated input paths.

Despite the overall convergence of the predictions to the actual path, and the ability of the algorithm to quickly classify an input path as trending upward or downward, this type of data is by its nature incapable of exact prediction. As a result, there is an inherent limit to the precision of any predictions made using this type of data, and the prediction algorithm I’ve presented is no exception, and is, therefore, necessarily going to generate some appreciable amount of error given data that is this randomized. In contrast, in the next section, we’ll predict three-dimensional projectile paths, which allow for tighter predictions.

As noted above, the prediction algorithm is also capable of determining whether a given input vector is a good fit for the underlying data in the first instance. In this case, this means that the algorithm can determine whether the path represented by an input vector belongs to one of the categories in the underlying dataset, or whether the input vector represents some other type of path that doesn’t belong to the original dataset. For example, if we generate a new input vector produced by a formula that has a .55 probability of decreasing at any given moment in time, then for $M \approx 7000$, the algorithm returns the nominal vector at the top of the anchor tree as its predicted vector, indicating that the algorithm thinks that the input vector does not belong to any of the underlying categories in the anchor tree.

This ability to not only predict outcomes, but also identify inputs as outside of the scope of the underlying dataset helps prevent bad predictions, since if an input vector is outside of the scope of the underlying dataset, then the algorithm won’t make a prediction at all, but will instead flag the input as a poor fit. This also allows us to expand the underlying dataset by providing random inputs to the prediction function, and then take the inputs that survive this process as additional elements of the underlying dataset. For more on this approach, see the post below entitled, “Predicting and Imitating Complex Data Using Information Theory.”

Categorizations and Predictions in Higher Dimensional Spaces

Predicting Projectile Paths

The same algorithms that we used in the previous section to predict paths in a two-dimensional space can be applied to an N-dimensional space, allowing us to analyze high-dimensional data in high-dimensional spaces. We’ll begin by predicting somewhat randomized, but nonetheless Newtonian projectile paths in three-dimensional space.

In this section, our training data consists of 750 projectile paths, each containing 3000 points in Euclidean three-space, resulting in vectors with a dimension of $N = 9000$. The equations that generated the paths in the training data are Newtonian equations of motion, with bounded, but randomly generated x and y velocities, no z velocity other than downward gravitational acceleration, fixed initial x and y positions, and randomly generated, but tightly bounded initial z positions. Figure 8 shows a few of the paths from the dataset.

Note that in this case, we’re predicting a path that has a simple closed form formula without using any interpolation. Instead, the algorithm uses the exact same approach outlined above, which compresses the dataset into a series of anchor vectors, which in this case represent Newtonian paths, and then takes a new curve and attempts to place it in the resulting anchor tree of vectors.

Just as we did above, first we’ll generate a new curve, which will be represented as a 9,000 dimensional vector that consists of 3,000 points in Euclidean three-space. Then, we’ll incrementally provide points from the new curve to the prediction algorithm, and measure the accuracy of the resultant predictions. Figure 9 shows the average path, best-fit path, and actual path for $M = 1750$.

In this case, the percentage-wise average error between the predicted path and the actual path, shown in Figure 10, is much smaller than it was for the random path data. This is not surprising, given that each curve has a simple shape that is entirely determined by its initial conditions. As a result, so long as there is another curve in the anchor tree that has reasonably similar initial conditions, we’ll be able to accurately approximate, and therefore, predict the shape of the entire input curve.

Shape and Object Classification

We can use the same techniques to categorize three-dimensional objects. We can then take the point data for a new object and determine to which category it best belongs by using the prediction algorithm. This allows us to take what is generally considered to be a hard problem, i.e., three-dimensional object classification, and reduce it to a problem that can be solved in a fraction of a second on an ordinary consumer device.

In this case, the training data consists of the point data for 600 objects, each shaped like a vase, with randomly generated widths that come in three classes of sizes: narrow, medium, and wide. That is, each class of objects has a bounded width, within which objects are randomly generated. There are 200 objects in each class, for a total of 600 objects in the dataset. Each object contains 483 points, producing a vector that represents the object with a total dimension of $N = 3 \times 483 = 1449$. I’ve attached a representative image from each class of object as Figures 11, 12, and 13. Note that the categorization and prediction algorithm are both blind to the classification labels in the data, which are hidden in the $N+1$ entry of each shape vector.

As we did above, we’ll generate a new object, provide only part of the point data for that object to the prediction algorithm, and then test whether the prediction algorithm assigns the new object to a category with an anchor from the same class as the input object.  That is, if the class of the anchor of the category to which the new input object is assigned matches the class of the input object, then we treat that prediction as a success.

I’ve randomly generated 300 objects from each class as a new input to the prediction algorithm, for a total of 900 input objects. I then provided the first $M = 1200$ values from each input vector to the prediction algorithm, producing 900 predictions. There are three possibilities for each prediction: success (i.e., a correct classification); rejection (i.e., the shape doesn’t fit into the dataset); and failure (i.e., an incorrect classification).

The results for each of the three classes is as follows:

Class One:

Success Percentage: 300 / 300 = 100%.

Rejected Data Percentage: 0 / 300 = 0%.

Fail Percentage: 0 / 300 = 0%.

Class Two:

Success Percentage: 300 / 300 = 100%.

Rejected Data Percentage: 0 / 300 = 0%.

Fail Percentage: 0 / 300 = 0%.

Class Three:

Success Percentage: 300 / 300 = 100%.

Rejected Data Percentage: 0 / 300 = 0%.

Fail Percentage: 0 / 300 = 0%.

These results are of course the product of not only randomly generated training data, but also randomly generated inputs. Different datasets, and different inputs could therefore produce different results. Nonetheless, the obvious takeaway is that the prediction algorithm is extremely accurate.

The Future of Computation

The Church-Turing Thesis is a hypothesis that asserts that all models of computation can be simulated by a Universal Turing Machine. In effect, the hypothesis asserts that any proposed model of computation is either inferior to, or equivalent to, a UTM. The Church-Turing Thesis is a hypothesis, and not a mathematical theorem, but it has turned out to be true for every known model of computation, though recent developments in quantum computing are likely to present a new wave of tests for this celebrated hypothesis.

In a post below entitled, “A Simple Model of Non-Turing Equivalent Computation,” I presented the mathematical outlines of a model of computation that is on its face not equivalent to a UTM. That said, I have not yet built any machines that implement this  model of computation, so, not surprisingly, the Church-Turing Thesis still stands.

The simple insight underlying my model is that a UTM cannot generate computational complexity, and this can be proven mathematically (see the link above for a simple proof). This means that the computational complexity (i.e., the Kolmogorov Complexity) of the output of a UTM is always less than or equal to the complexity of the input that generated the output in question. This simple lemma has some alarming consequences, and in particular, it suggests that the behaviors of some human beings might be the product of a non-computable process, perhaps explaining how it is that some people are capable of producing ideas, predictions, and works of art, that appear to come from out of nowhere.

Since we can, as a theoretical matter, map a given input string $x$ to an output string $y$ that has a higher computational complexity than the input string, any device that actually does this as a physical matter over an infinite set of outputs is by definition not equivalent to a UTM. Expressed symbolically, $y = F(x)$, and $K(x) < K(y)$, for some infinite set of outputs $y$, where $K$ is the Kolmogorov complexity. If the set of possible outputs is infinite, then the device can consistently generate complexity, which is not possible using a UTM. If the device generates only a finite amount of complexity, then the device can of course be implemented by a UTM that has some complexity stored in the memory of the device. In fact, this is exactly the model of computation that I’ve outlined above, where a compressed, specialized tree is stored in the memory of a device, and then called upon to supplement the inputs to the device, generating the outputs of the device.

The model of AI that I’ve presented above suggests that we can create simple, presumably cheap devices, that have specialized memories of the type I described above, that can provide fast, intelligent answers to complex questions, but nonetheless also perform basic computations. This would allow for the true commoditization of artificial intelligence, since these devices would presumably be both small and inexpensive to manufacture. This could lead to a new wave of economic and social changes, where cheap devices, possibly as small as transistors, have the power to analyze a large number of complex observations and make inferences from them in real-time.

In the post below entitled, “Predicting and Imitating Complex Data Using Information Theory,” I showed how these same algorithms can be used to approximate both simple, closed-form functions, and complex, discontinuous functions that have a significant amount of statistical randomness. As a result, we could store trees that implement everyday functions found in calculators, such as trigonometric functions, together with trees that implement complex tasks in AI, such as image analysis, all in a single device that makes use of the algorithms that I outlined above. This would create a single device capable of both general purpose computing, and sophisticated tasks in artificial intelligence. We could even imagine these algorithms being hardwired as “smart transistors” that have a cache of stored trees that are then loaded in real-time to perform the particular task at hand.

All of the algorithms I presented above are obviously capable of being simulated by a UTM, but as a general matter, this type of non-symbolic, stimulus and response computation can in theory produce input-output pairs that always generate complexity, and therefore, cannot be reproduced by a UTM. Specifically, if the device in question maps an infinite number of input strings to an infinite number of output strings that each have a higher complexity than the corresponding input string, then the device is simply not equivalent to a UTM, and its behavior cannot be simulated by a UTM with a finite memory, since a UTM cannot, as a general matter, generate complexity.

This might sound like a purely theoretical model, and it might be, but there are physical systems whose states change in complex ways as a result of simple changes to environmental variables. As a result, if we are able to find a physical system whose states are consistently more complex than some simple environmental variable that we can control, then we can use that environmental variable as an input to the system. This in turn implies that we can consistently map a low-complexity input to a high-complexity output. If the set of possible outputs appears to be infinite, then we would have a system that could be used to calculate functions that cannot be calculated by a UTM. That is, any such system would form the physical basis of a model of computation that cannot be simulated by a UTM, and a device capable of calculating non-computable functions.

Octave Scripts:

optimize_categories_N

generate_data_tree_N

predict_best_fit_tree_N

generate_categories_N

vector_entropy

Data Sets:

Random-Walks

3D Projectile

3D Shape Recognition

generate_random_trials

# Fast Unsupervised 3D Feature Extraction

In this post, I’ll present an unsupervised algorithm that can, in polynomial time, extract three-dimensional features from a two-dimensional image. The results are quite good, especially given the run time. This algorithm could have applications in computer vision, in particular medical imaging, and perhaps other fields, but even as a novelty, it’s a fun algorithm that can be run on an ordinary consumer device.

Feature Extraction

Several months ago, I presented an algorithm that can quickly identify features in an image with no prior information, which I explained in some detail in the post below this one entitled, “Fast Unsupervised Image Categorization Algorithm”. In short, the feature identification algorithm works by partitioning an image into maximally distinct regions by making use of methods rooted in information theory.

By changing the equation that governs the stopping condition in the feature identification algorithm, we can allow the algorithm to run longer, and identify fine-grain features in an image, similar to an edge detection algorithm. The fine-grain feature identification algorithm also has a polynomial run-time, but it is much slower, and as a result, I generally do not use it, though it was the original version of the algorithm.

However, I realized that the fine-grained algorithm is useful for making detailed distinctions between foreground and background. This is because the feature identification algorithm assigns a probability to each feature that gives the likelihood that the feature in question is part of the background of an image. It turns out that this probability is reliable, and so the background of an image generally has a probability of approximately 0, and the foreground generally has a probability of approximately 1. By turning this probability into a depth parameter, we can map a two-dimensional image into a three-dimensional space, where a background feature has a depth of 1, and a foreground feature has a depth of 0 (i.e., one minus the foreground probability), in turn producing a three-dimensional rendering of the two-dimensional image, where the background features are recessed at a greater depth than the foreground features. If we use the fine-grain feature identification algorithm, the features will be smaller in scale, allowing for tighter depth information.

The examples I’ve posted were generated using very low-resolution photos, of around around 250 pixels x 250 pixels. Higher resolution photos produce very fine features, and correspondingly rich depth information, but of course take much longer to render.

This is of course not the algorithm of choice if you’re looking for a mathematically exact rendering. But, it could be useful whenever a fast approximation of depth is required, and the only tool available is information generated by an ordinary two-dimensional camera. In any case, it is certainly a useful heuristic for depth given two-dimensional information, and at a minimum, it is a fun algorithm to experiment with that produces visually meaningful results.

I’ve attached four scripts that you’ll need to run the algorithm, and the rest of the scripts you’ll need are available in the post just below this one.

extract_3D_features

find_bg_color

identify_fine_features

maximize_std_dev

# Fast Unsupervised Image Categorization Algorithm

In this post, I’m going to show how we can combine the image feature recognition algorithm I presented a few months ago, with the categorization and prediction algorithms I presented in the two posts below, to produce an algorithm that can quickly categorize images, without any supervision. The final algorithm I’ll present is O(n), where n is the number of pixels in the image we’re attempting to categorize.

This algorithm works so well that it can distinguish between paintings by different artists, in just a few seconds. It can also distinguish between articles of clothing, and other objects, but the focus of this post will be on its ability to distinguish between artists, because, in my opinion, this shows that the algorithm is capable of unsupervised abstraction, allowing it to say without any guidance, what it is that distinguishes a Van Gogh from a Caravaggio.

These algorithms use the same underlying categorization and prediction algorithms that I used below to imitate complex three-dimensional shapes, predict data, and approximate complex functions. In this case, I’ve added some additional scripts that allow the underlying algorithms to analyze images. As a result, the underlying categorization and prediction algorithms are shaping up to be a set of generalized AI algorithms that appear to only require a “hook” into the subject matter in question.

Finally, note that all of the algorithms I’ll present in this post are worst-case polynomial in time as a function of the number of pixels in the image. That is, even the algorithms we’ll use to construct categories in the first instance are polynomial in time as a function of the number of pixels / images in the database. As a result, these algorithms allow for what I believe to be industrial quality image recognition to be implemented on a consumer device, and can probably allow industrial machines to conduct real-time analysis of high-resolution video.

Summary of the Underlying Image Feature Identification Algorithm

In the post below entitled, “Image Recognition with no Prior Information”, I presented an algorithm that can quickly identify features within an image, without any prior information about the image, or the categories to which the image might belong. At a high level, this algorithm takes an image, and partitions it into objectively distinct features. But because the algorithm operates with no prior information at all, it isn’t looking for a particular object, but is instead looking for every objectively distinct feature within an image. In short, this first algorithm attempts to identify all of the features within an image, but because this algorithm has no prior information, it can’t say what these features are, but can nonetheless identify them as distinct from one another. As a result, this algorithm is only the first step of the image processing algorithms I’ll present in this post.

Like all of the other work that is part of this project, the feature recognition algorithm is rooted in information theory, and operates based upon the assumption that objectively distinct features within an image will contain different amounts of information. Specifically, the feature identification algorithm subdivides an image in a way that maximizes the difference in the amounts of information contained in each subdivision of the image. I call each subdivision of an image generated by the algorithm a region. Though it might seem pedantic to distinguish between a region and a feature, I’d like to reserve the term feature to denote an actual feature of an image, as identified by a human, whereas a region is a portion of an image identified by the algorithm that should contain a feature.

The first, second, and third attached images are examples of an image that has been processed by the algorithm. The first image is a photo I took in Lake Como, Italy, and the second image shows the original image after being processed, with the brightness of each region adjusted by the algorithm to reflect how likely it thinks the region is to be part of the foreground of the image. The brighter a given region is, the more likely the algorithm thinks that the region is part of the foreground of the image. You can get a sense of how the algorithm partitions an image by looking at the boundaries of the regions in the second image. The third image shows a specific region identified by the algorithm, which is in this case an individual standing on a diving board, which can be found around the center of the original image.

You can do this yourself using the following code:

[region_matrix N S_mat prob_mat] = identify_features(I);

imshow(ent_contrast_darken_file(I,N,region_matrix, prob_mat));

figure, imshow(extract_region(I,N,region_matrix,27));

Constructing a Tractable Color Palette

At a high level, the technique we’ll employ is straightforward: first we’re going to partition an image using the feature identification algorithm described above, and that’s going to produce a series of regions, which should correspond to features within the image. Then, we’re going to calculate the average color of each region, which will create a dataset of average colors. Then, we’re going to categorize those colors using the vector categorization algorithm I presented in the two posts just below this one.

In this section, I’ll walk through applying this process to a single image. In the next section, we’ll apply it to a series of images. In the last section, I’ll present a vectorized implementation of this entire process that will allow us to categorize new images using an algorithm that is O(n).

I’ve attached a photograph of one of Van Gogh’s iconic self-portraits that we’ll use as our first subject.

First, we’ll count the number of unique colors in the image using a vectorized script I found on Matlab’s website, written by Steve Eddins, which is available here:

https://blogs.mathworks.com/steve/2008/01/31/counting-occurrences-of-image-colors/

rgb_columns = reshape(I, [], 3);

[unique_colors, m, n] = unique(rgb_columns, ‘rows’);

color_counts = accumarray(n, 1);

size(unique_colors, 1)

The image contains 184064 unique colors, and since the image contains 768*608 = 466944 pixels, about 40 percent of the pixels contain unique colors. As a result, the image contains a very rich color palette, and therefore, quite a lot of color information for its size. Nonetheless, a human observer probably wouldn’t have much trouble quickly describing the color palette of the painting, which contains a lot of yellow, due to his large hat, with a pinkish and orange background, and some pale blues in his blazer.

This type of description is a form of compression, since we’ve taken a large amount of color information, and compressed that information into a tractable statement that is nonetheless a meaningful description of its subject. Similarly, the goal of this section will be to produce a compressed color palate that straddles the line the between a set of colors that is digestible by a human, and a set of colors that is useful to a machine. The color sets we’ll produce will be too large to be described in one sentence, but small enough to be expressed visually in a single color bar that can be quickly gleaned by a human. The intended end result is to have the categorization algorithm operate like an enhanced human, that quickly compresses information, but nonetheless has access to details that a human observer would find overwhelming.

Continuing with the example of the self-portrait, first we’ll run the feature identification algorithm, which will produce a set of matrices we can use to partition the image into regions. Then, we’ll run a script that calculates and stores the average color of each region, which will produce a dataset of color vectors. Finally, we’ll run the categorization algorithm on that dataset of color vectors, which will produce a core set of color vectors that can be thought of as forming a basis for the colors in the image. I call these basis vectors the anchors of their color categories, since the categorization algorithm treats each anchor vector as the geometric center of a category, meaning that all other vectors in the category defined by the anchor are within some distance, delta, that the algorithm calculates based upon a variety of factors that I explain in the two posts below this one.

All of this can be done using the following code:

[region_matrix N S_mat prob_mat] = identify_features(I);

figure, imshow(ent_contrast_darken_file(I,N,region_matrix,prob_mat));

data_array = calculate_average_feature_color(I,N,region_matrix, prob_mat, 0, 0);

[data_categories_array category_vec anchor_array H_final delta] = optimize_categories_3D(data_array,0);

color_bar_compressed = display_color_array(anchor_array, data_categories_array);

figure, imshow(color_bar_compressed)

The feature identification algorithm generates 94 regions, and the color bar shows the compressed set of colors generated by the categorization algorithm, which consists of 44 anchor colors. Visually, this set of colors is clearly representative of the underlying painting, despite containing only about .02% of the unique colors in the image. As a result, even though average color is generally a very lossy metric for an image, because we’ve taken the average color of what are objectively distinct features of the image, we’ve ended up with meaningful color information, while nonetheless achieving an enormous amount of compression. Not to boast, but this is actually quite remarkable, since the algorithm has taken over one hundred thousand colors from an image, and produced, without any supervision, a set of 44 colors that are perfectly representative of the underlying image. Moreover, it would almost certainly be impossible for a human to sift through all of the colors in the original image and select a comparable set of colors. At the same time, these are the colors that a human probably would select if they could. As a result, this process generates the same type of compression a human-generated, verbal summary achieves, but on a mechanized scale, and expressed nonverbally, in the form of an image. Interestingly, it also suggests that nonverbal summaries of this type could allow machines to convey complex information to their human counterparts.

This of course does not preclude more sophisticated measures from being used in lieu of average color. Instead, I’m presenting this as a simple example of a more generalized approach to image recognition that consists of three primary steps:

(1) identify the features of an image using the feature identification algorithm described above;
(2) analyze the properties of each resulting region, producing a metric;
(3) categorize the resulting metrics.

If our metric is visually meaningful, then the categories we’ll construct in step three will cause the metrics associated with similar looking features to be categorized together, thereby creating categories of similar looking features. It turns out that average color works quite well, though there are undoubtedly variations on this theme that might be more useful in other contexts.

Categorizing Images

By applying the process above to a category of similar images, we can construct a palette that is representative of an entire category of images. In this case, we’re going to construct two palettes: one generated by analyzing a series of paintings by Van Gogh, and another generated by analyzing a series of paintings by John Singer Sargent. These two painters worked with somewhat similar color schemes, and to a non-expert like myself, a superficially similar impressionistic style. Nonetheless, the paintings I’ve selected are distinguishable, and as a result, the two categories of paintings will produce different color palettes. We’ll analyze a set of 8 paintings by each artist, for a total of 16 images, each of which are attached as the last 16 images.

We can produce the palettes using the following code:

directory_path_A = “[Van Gogh]”;

directory_path_D = “[Sargent]”;

num_images_A = 8;

num_images_D = 8;

[category_tree1 delta_tree1 anchor_tree1 data_array1] = generate_single_path_img_data(directory_path_A, num_images_A);

[category_tree4 delta_tree4 anchor_tree4 data_array4] = generate_single_path_img_data(directory_path_D, num_images_D);

[data_categories_array1 category_vec1 anchor_array1 H_final1 delta1] = optimize_categories_3D(data_array1,0);

[data_categories_array4 category_vec4 anchor_array4 H_final4 delta4] = optimize_categories_3D(data_array4,0);

color_bar_compressed1 = display_color_array(anchor_array1, data_categories_array1);

color_bar_compressed4 = display_color_array(anchor_array4, data_categories_array4);

figure, imshow(color_bar_compressed1)

figure, imshow(color_bar_compressed4)

The Van Gogh set of images generates 711 features, ultimately compressed into a set of 268 anchor colors, while the Sargent set of images generates 600 features, which were compressed into a set of 243 anchor colors. Yet again, these algorithms achieve an incredible amount of compression, taking millions of colors in the underlying images, and generating a comparatively minuscule set of colors that a human being can easily observe and intuitively understand, albeit in a nonverbal manner.

The Van Gogh palette consists of an array of rich, and bright pastels, whereas the Sargent palette consists of what I thought to be a surprisingly dark set of colors, mixed in with brighter colors that are nonetheless closer to the grey portion of the spectrum when compared to the vibrant colors that make up the Van Gogh palette.

If we were to take yet another Van Gogh painting, and compare the color palette of that new painting to the Van Gogh palette we just produced, it would probably be a decent match. Moreover, the color palette of the new Van Gogh would almost certainly be a better match for the existing Van Gogh palette than the Sargent palette. As a result, if we’re given a new image, and we assume that the new image is either a Van Gogh or a Sargent, then by comparing the color palette of the new image to the two color palettes we just produced, we should be able to categorize that new image as either a Van Gogh or a Sargent.

The image categorization algorithm I’ll present in this section is O(n*log(n)), since we’ll need to run the feature identification algorithm on the new input image in order construct its color palette. In the next section, I’ll present a vectorized version of this algorithm that is O(n), that takes into account the color of every pixel in the image. As a result, the vectorized algorithm is not only faster, but it’s also more powerful, since it takes into account all of the color information within an image. Nonetheless, the intuition and inspiration for the vectorized algorithm comes from this longer form process.

Let’s begin by generating a color palette for the beardless Van Gogh self-portrait, which I’ve attached, using the same code above from the previous section. Though this image was not part of the database that generated the Van Gogh palette, just looking at it, we can already see that it’s probably going to fit well. Not surprisingly, the resultant color palette looks a lot like a subset of the color palette for the entire Van Gogh series. As a result, it makes sense to assume that this set of colors will be a better fit for the Van Gogh palette than for the Sargent palette.

We can test this hypothesis using a prediction function I presented in the previous post, that takes a new data item, and attempts to categorize it into an existing dataset that’s already been analyzed using the categorization algorithm. At a high level, what the prediction algorithm does is attempt to place the new data item into an expanded set of anchor vectors that are structured as a tree. If it can place the new data item somewhere in the tree, then the new data item is a good match for the dataset. If it fails to place the new data item in the tree, then the new data item is not a good match for the dataset. For a full explanation as to how and why the prediction algorithm works, see the two posts just below this one.

In this particular case, we can think of the prediction algorithm as attempting to fit each new feature color into an existing dataset of anchor colors. The greater the number of feature colors that match with a given dataset of anchor colors, the more confident we can be that the new image is, as a whole, a match for that dataset in terms of its color palette. Of course, knowing that two images have similar color palettes is not, as a general manner, enough information to conclude that the two images belong to the same category of images or objects. But in this case, we’re tacitly assuming that the painting is either a Van Gogh or a Sargent, and then determining to which category the painting belongs by comparing a compressed representation of its color palette to the Van Gogh and Sargent palettes we just compiled.

This can be done using the following code:

[region_matrix N S_mat prob_mat img_data_array score_vector1 delta_vector1] = AB_test_image(I, category_tree1, anchor_tree1, delta_tree1);

[region_matrix4 N4 S_mat4 prob_mat4 img_data_array4 score_vector4 delta_vector4] = AB_test_image(I, category_tree4, anchor_tree4, delta_tree4);

sum(score_vector1 != Inf)

sum(score_vector4 != Inf)

In short, the underlying prediction algorithm will attempt to place the average color of each region from the new image into each dataset. If it succeeds, it will return a score of 0, and if it fails, it will return a score of Inf, which indicates that the algorithm thinks that the new color doesn’t belong to the dataset. The last two lines of code count how many regions matched with the two datasets. In this case, the new Van Gogh matched with the Van Gogh dataset 50 times, and matched with the Sargent dataset 40 times. Each test involved attempting to place 80 regions, meaning that in both cases, a significant number of regions failed to categorize.

Though this approach works reasonably well, note that it tests whether the average color of an entire region is a good fit for the color palette in question. As a result, the test does not take into account the vast majority of color information in an image, since each region is going to contain a large number of pixels. In the next section, we’ll vectorize this algorithm, which will allow us to quickly test every pixel in an image, thereby creating a simple filter that can test whether the colors in an image fit well within the complex, and discontiguous sets of colors that comprise the palettes we’ve constructed. Ultimately, this will allow us to quickly categorize images using a simple filtering process.

Vectorized Image Categorization

The first three steps in the process above cannot, to my knowledge, be vectorized. That is, if we want to construct a color palette for a database of images, we need to run the feature identification algorithm, calculate the average color of each resulting feature, and then categorize the resulting dataset of colors. However, if we’d like to compare the colors within an image to that color palette, after it’s already been produced, then we can take advantage of the ready-baked vectorized procedures of Matlab and Octave, and possibly other languages as well.

In short, what we’ll do is simply test whether each pixel in an image is roughly within delta of at least one anchor color. If the pixel is within delta of some anchor, then it will pass the filter. If it is not within delta of some anchor, then it will not pass the filter. By calculating the percentage of pixels that survive filtration using this method, we can measure what percentage of pixels in the image are a match for the color palette in question. This in turn allows us to quickly determine to which category of images a given input image belongs, by choosing the category associated with the filter that had the highest pixel survival rate. That is, the input image is best suited in terms of color with the underlying category of images that generated the filter that does the least damage.

For example, if we’d like to determine whether a given image is a Van Gogh or a Sargent, we can use the color palettes we generated above, and test that new image using a Van Gogh filter and a Sargent filter. The filter with the higher pixel survival rate is the better fit, and therefore tells us which of the two painters is most likely to have held the brush.

Returning to the beardless Van Gogh portrait we tested above, we can apply a Van Gogh filter and a Sargent filter to that image using the following code:

%the Van Gogh filter

figure, imshow(I_Post)

[row col x] = size(I);

num_pixels = row*col;

survival_rate1 = num_on_pixels/num_pixels

%the Sargent filter

figure, imshow(I_Post)

[row col x] = size(I);

num_pixels = row*col;

survival_rate4 = num_on_pixels/num_pixels

The survival rate for the Van Gogh filter is 0.29383, or roughly 30% of the pixels, whereas the survival rate for the Sargent filter is 0.13676, or roughly 14% of the pixels.

As a general matter, this approach works astonishingly well, and gives wide margins of survival rates even in cases like this, where we’re distinguishing between similar images that are, frankly, difficult for even some humans to categorize. For example, if we’d like to A/B test for whether an image contains one of two dresses, then this approach will work, assuming that the two dresses have objectively distinct color palettes, even if they’re superficially quite similar to a human observer. Since this approach is not necessarily restricted to color, I believe this algorithm could mark the beginning of an entirely new approach to image recognition.

Octave Scripts:

display_color_array

ab_test_image

ret_avg_color_vec

calculate_average_feature_color

generate_single_path_img_data

find_bg_color

find_max

gather_regions_delta

gen_inf_prob_matrix

generate_s_mat_distribution

identify_features

inf_weighted_prob

left_right_delta

local_test

mw_sq_std_dev

score_image

spec_log

test_measures

find_most_distant_pair

generate_categories_3d

generate_data_space

left_right_delta_3d

optimize_categories_3d

partition_array_3d

partition_array_3d_bigly

spec_log

test_entropy_array_3d

vector_entropy

predict_best_fit_tree

generate_data_tree

extract_region

ent_contrast_darken_file

# Fast Numerical Function Approximation

In a previous post, I introduced an extremely fast prediction algorithm that appears to have an average run time complexity of significantly less than O(log(n)), where n is the number of data points in the underlying data set. In this post, I’m going to show how this algorithm can also be used as a fast and generalized method for numerical function approximation, regardless of the function’s type and complexity.

Unlike interpolation methods, this algorithm does not attempt to approximate the underlying function using continuous functions. Instead, the algorithm is rooted in assumptions derived from information theory that allow it to work with the natural domain and range of the underlying function, facilitating extremely precise approximations of a wide variety of smooth, and discontinuous functions, all under one umbrella algorithm.

In this post, I’m going to explain how you can use the algorithm to approximate any function you’d like, and I’ll also present some data that demonstrates its performance and precision. For a full explanation as to how and why this algorithm works, see the post below entitled, “Predicting and Imitating Complex Data Using Information Theory”. In a follow up post, I’ll show how we can combine this algorithm with my image partition algorithm to construct a fast, unsupervised image recognition algorithm.

How to Use the Algorithm

The only thing you need in order to use the algorithm is a dataset of Matlab / Octave vectors containing the X, Y, and Z values of your function in three separate row vectors.
I’ll give some more examples below that make use of noisy data to demonstrate the versatility of the algorithm, but let’s begin with the simple linear function z = x + y + 1, evaluated along the line y = x, over the interval [-3,3].

First, we generate the x,y, and z vectors using the following code:

clear x y z
x = -3 : .012 : 3;
y = -3 : .012 : 3;
z = x .+ y .+ 1;
figure, scatter3(x,y,z)
view(70,15)

The x and y vectors contain the domain values of the function we’d like to approximate, and the z vector contains the corresponding range values of the function. In this case, we’re approximating the function z = x + y + 1, but you could of course choose any function you’d like, whether or not it has a closed form expression, so long as the data is represented as three separate column vectors.

Once we’ve generated the data for the function we’d like to approximate, we can instruct the algorithm to process the data, generating a set of data trees that it will ultimately use to approximate the function.

This step is done using the following code:

data_array = shape_to_data(x,y,z);
[category_tree delta_tree anchor_tree] = generate_data_tree(data_array);

This is by far the most time-consuming part of the algorithm, but it is nonetheless polynomial in time as a function of the number of data points. Moreover, it’s something we’ll only need to do once, since this is the part of the algorithm that actually constructs our approximating function.

At a very high level, what the algorithm does is construct a data tree that can be used to approximate the underlying function at certain key locations within the domain of the function that are selected by the algorithm using assumptions rooted in information theory. New points in the domain that are subsequently fed to the algorithm are then quickly mapped into the data tree, resulting in very fast and very accurate function approximations.

In the previous post, I showed that this method is so accurate that it can reproduce complex three-dimensional objects it has analyzed when subsequently given random inputs, since the algorithm can quickly discern between new data points that belong to the data set / object, and data points that do not.

Continuing with our example above, let’s begin by approximating the underlying function at the point (x,y) = [2.5 2.5], and compare it to the result given by the original function.

This is done using the following code:

new_data_item{1} = [2.5 2.5 0];
[category_index predicted_z final_delta] = predict_best_fit_tree(anchor_tree, delta_tree, new_data_item, 1, 1);

In short, the algorithm takes the new data point (x,y) = [2.5 2.5], and attempts to place it into the data tree constructed from the domain and range of the underlying function.

In this case, we have,

predicted_z = 5.9920;

The original function gives a value of z = 2.5 + 2.5 + 1 = 6, for an error of approximately 0.008.

The last two arguments to the function are toggles that determine how the prediction algorithm behaves. The first toggle tells the algorithm to ignore the input z-value, but I nonetheless always use a z-value of 0 for consistency. The second toggle must be -1, 0, or 1, and determines how much time is spent searching the data tree for a good fit for the input. In this case, I’ve set the toggle to 1, which instructs the algorithm to search the entire data tree for the absolute best fit possible, in order to maximize the precision of our approximation. This of course increases the run time of the algorithm, but since we’re trying to approximate a function, precision takes precedence, especially given that the algorithm is still very fast.

If instead we set the toggle to -1, then we have,

predicted_z = 6.0400.

If we set the toggle to 0, then we have,

predicted_z = 5.9920.

As I mentioned above, this algorithm can also be used to imitate complex three-dimensional shapes, which requires a much higher volume of random inputs, in which case it might be advantageous to quickly process a large number of new data points approximately, rather than spend the incremental time required for greater precision. If you’d like to do that, simply set the toggle to -1 for the fastest placement and lowest precision, or to 0, for intermediate speed and intermediate precision.

Now let’s test the accuracy of the prediction over a large number of points that were not part of the original domain. The original domain had an increment of .012, so we’ll double the number of points by using an increment of .006, to ensure that we’re testing the algorithm at points in the domain that it didn’t evaluate previously, using the following code:

clear prediction_vector error_vector x y z
search_toggle = 1;
cnt = 1;
x = -3 : .006 : 3;
y = i = -3 : .006 : 3;
[temp num_data_points] = size(x);
tic;
for i = 1 : num_data_points
new_data_item{1} = [x(i) y(i) 0];
z(i) = x(i) + y(i) + 1;
[category_index predicted_z final_delta] = predict_best_fit_tree(anchor_tree, delta_tree, new_data_item, 1, search_toggle);
prediction_vector(i) = predicted_z;
error_vector(i) = z(i) – predicted_z;
endfor
toc
figure, scatter3(x,y,z)
view(40,15)
figure, scatter3(x,y,prediction_vector)
view(40,15)

Superficially, the two plots look very similar, so let’s plot the errors and have a look at some related statistics:

figure, plot(1:num_data_points, error_vector)
std(abs(error_vector(:)))
mean(abs(error_vector(:)))
mean(abs(error_vector(:)))/mean(abs(z(:)))

In this case, the errors have an average absolute value of 0.012659, and are generally tightly bounded, though they are clearly not continuously distributed, since again, the algorithm does not use any type of smooth approximation, but instead makes use of a discrete data tree.

Now let’s apply the algorithm to a discontinuous, noisy sine wave function generated by the following code:

clear x y z
cnt = 1;
for i = 0 : .0125 : 3*pi
x(cnt) = i + .25*rand();
y(cnt) = i + .25*rand();
z(cnt) = sin(x(cnt));
cnt = cnt + 1;
endfor
figure, scatter3(x,y,z)
view(70,15)
data_array = shape_to_data(x,y,z);
[category_tree delta_tree anchor_tree] = generate_data_tree(data_array);

This is a sine wave with a noisy domain around the line y = x, over the interval [0,3*pi]. Since this will create a patchy domain and range, we’ll need to actually monitor the instances where the algorithm fails to map an input to the data tree. We do this by including a conditional that tests for whether the algorithm returns a predicted_z of infinity, which is how the algorithm signals that it thinks that an input is outside of the original domain of the function.

This can be done using the following code:

clear prediction_vector error_vector x y z
search_toggle = 1;
cnt = 1;
fail_count = 0;
x = 0 : .00625 : 3*pi;
y = 0 : .00625 : 3*pi;
[temp num_data_points] = size(x);
tic;
for i = 1 : num_data_points
x(cnt) = x(cnt) + .25*rand();
y(cnt) = y(cnt) + .25*rand();
z(cnt) = sin(x(cnt));
new_data_item{1} = [x(cnt) y(cnt) 0];
[category_index predicted_z final_delta] = predict_best_fit_tree(anchor_tree, delta_tree, new_data_item, 1, search_toggle);
if(predicted_z != Inf)
prediction_vector(cnt) = predicted_z;
error_vector(cnt) = z(cnt) – predicted_z;
cnt = cnt + 1;
else
fail_count = fail_count + 1;
endif
endfor
toc

We can generate the output and related statistics using the following code:

fail_count
x = x(1 : cnt – 1);
y = y(1 : cnt – 1);
z = z(1 : cnt – 1);
figure, scatter3(x,y,z)
view(40,15)
figure, scatter3(x,y,prediction_vector)
view(40,15)
figure, plot(1 : cnt – 1, error_vector)
std(abs(error_vector(:)))
mean(abs(error_vector(:)))
mean(abs(error_vector(:)))/mean(abs(z(:)))

In this case, it turns out the algorithm failed to place the new data 24 times, which is about 1.6% of the total number of data points. But, this can vary upon each run since the domain and range include some random noise. The average absolute error is about 0.027, with oscillating peaks that appear to follow the underlying function.

Finally, let’s apply the same technique to a polynomial with a noisy domain, also around the line y = x, using the following code:

clear x y z
cnt = 1;
for i = 1 : .009 : 8
x(cnt) = i + .25*rand();
y(cnt) = i + .25*rand();
z(cnt) = x(cnt)^7 + y(cnt)^7 + 10;
cnt = cnt + 1;
endfor
figure, scatter3(x,y,z)
view(40,15)
data_array = shape_to_data(x,y,z);
[category_tree delta_tree anchor_tree] = generate_data_tree(data_array);

We can generate the approximation using the following code:

search_toggle = 1;
clear prediction_vector error_vector x y z
cnt = 1;
fail_count = 0;
x = 1 : .007 : 8;
y = 1 : .007 : 8;
[temp num_data_points] = size(x);
tic;
for i = 1 : num_data_points
x(cnt) = x(cnt) + .25*rand();
y(cnt) = y(cnt) + .25*rand();
z(cnt) = x(cnt)^7 + y(cnt)^7 + 10;
new_data_item{1} = [x(cnt) y(cnt) 0];
[category_index predicted_z final_delta] = predict_best_fit_tree(anchor_tree, delta_tree, new_data_item, 1, search_toggle);
if(predicted_z != Inf)
prediction_vector(cnt) = predicted_z;
error_vector(cnt) = z(cnt) – predicted_z;
cnt = cnt + 1;
else
fail_count = fail_count + 1;
endif
endfor
toc

And, finally, we can generate the graphs and error statistics using the following code:

fail_count
x = x(1 : cnt – 1);
y = y(1 : cnt – 1);
z = z(1 : cnt – 1);
figure, scatter3(x,y,z)
view(70,20)
figure, scatter3(x,y,prediction_vector)
view(70,20)
figure, plot(1 : cnt – 1, error_vector)
std(abs(error_vector(:)))
mean(abs(error_vector(:)))
mean(abs(error_vector(:)))/mean(abs(z(:)))

As you can see, in all three cases, the approximations are true to the original functions, both in terms of the domain and range values, with very small errors compared to the range values, despite the fact that the functions are clearly very different in nature.

Octave Scripts:

find_most_distant_pair

generate_categories_3d

generate_data_tree

left_right_delta_3d

optimize_categories_3d

partition_array_3d

predict_best_fit_tree

shape_to_data

test_entropy_array_3d

generate_data_space

vector_entropy

# Predicting and Imitating Complex Data Using Information Theory

I’m going to present two prediction algorithms, one with a run time that is polynomial in the number of items in the underlying dataset, and another that is extremely fast, with an average case run time that appears to be significantly less than O(log(n)), where n is the number of items in the dataset.

I believe this second algorithm could turn out to be a powerful and novel tool in AI, since, in addition to accurately predicting data, it is so fast that it can quickly take an entire dataset, and reproduce it, regardless of the complexity of the original dataset. Another way to think about this is that this prediction algorithm is so good and so fast, that random inputs to the algorithm end up generating a new version of the original dataset, allowing it to expand the domain of a function, and in the extreme case, quickly imitate the structure of a three-dimensional object.

This is possible because these algorithms don’t attempt to interpolate data, and as a result, any inputs that don’t map well to the existing dataset are rejected. This allows us to provide the algorithm with random inputs, some of which will successfully map to the original dataset, which will accumulate over time, and eventually generate a new dataset that is similar to the original. Put in more formal terms, not only can this algorithm predict values in the range of a function, it can also reproduce and expand the domain of a function that it has analyzed.

I believe generalizations of this approach could yield even more impressive results in other areas of AI, beyond spatial pattern recognition. As a result, this is a long post that goes through all of the theory underlying the prediction techniques that allow these algorithms to work.

Categorizing Prior Data

The first step to both prediction algorithms is to use the categorization algorithm I presented below in the previous post entitled, “Fast N-Dimensional Categorization Algorithm”. When applied to a dataset, the categorization algorithm takes a dataset, and, as its name suggests, partitions it into a series of categories. In order to get a better sense of how the prediction algorithms work, I’ll briefly review how the categorization algorithm works.

In addition to categorizing data, the categorization algorithm will also produce two sets of variables: a single value called “delta”, and a series of values called “anchors”. Each category generated by the algorithm has an anchor, which is the data point about which the category is centered. That is, the algorithm searches through the dataset and finds a set of data points about which the dataset is clustered. Each anchor defines a category, since the data within each category is clustered around its anchor. We don’t have to specify how many categories / clusters the algorithm should produce ex ante. Instead, the algorithm decides on its own how many categories are appropriate in the context of the dataset.

As noted, each category is centered around an anchor value. Specifically, all of the vectors within a given category are within delta of the anchor for that category. That is, the norm of the difference between each vector within a given category and the anchor for that category is always less than delta. In short, delta is the natural, in-context distance between elements of the same category, based upon some assumptions rooted in information theory that I explain in the previous post below.

For example, assume we have the following dataset of 3-space vectors:

data_array{1} = [5.19858 0.61943 0.00000];
data_array{2} = [9.82500 3.40620 10.0000];
data_array{3} = [4.76372 0.20080 0.00000];
data_array{4} = [5.54710 9.43410 10.0000];
data_array{5} = [6.91330 2.81581 0.00000];
data_array{6} = [9.05740 9.94230 20.0000];
data_array{7} = [2.86459 5.11068 0.00000];
data_array{8} = [5.99046 0.26573 0.00000];

The categorization algorithm can be applied to this dataset using the following code:

[data_categories_array category_vec anchor_array H_final delta] = optimize_categories_3D(data_array,0);

Application of the categorization algorithm to this dataset yields the following categorization:

data_categories_array =
{
[1,1] =

5.19858 0.61943 0.00000

[1,2] =

4.76372 0.20080 0.00000

[1,3] =

6.91330 2.81581 0.00000

[1,4] =

2.86459 5.11068 0.00000

[1,5] =

5.99046 0.26573 0.00000

}

[1,2] =
{
[1,1] =

9.82500 3.40620 10.00000

}

[1,3] =
{
[1,1] =

5.54710 9.43410 10.00000

}

[1,4] =
{
[1,1] =

9.05740 9.94230 20.00000
}

That is, the algorithm divides the dataset into four categories, which appear to be most sensitive to the z-value of the data in this case. The first category contains 5 elements, and the remaining three categories each contain a single element. Note that the anchors generated by the algorithm are returned in the anchor_array, and in this case, the anchor for the first category is [5.19858 0.61943 0.00000]. The value of delta produced by the algorithm is 5.0711. As a result, all of the vectors in the first category are within delta of the first anchor vector, which you can check by taking the norm of the difference between the anchor and any vector in the first category. Note that applying the algorithm generates a single value of delta for all four categories.

In short, all vectors in the same category are within delta of the category’s anchor. As a result, though the goal of the algorithm is to generate categories, we can recharacterize the algorithm as generating a set of anchors, and a single value of delta, which will together define the categories it ultimately produces. That is, if I identify certain vectors within a dataset as anchors, and specify a value of delta, then I’ve defined a categorization of the dataset. We will use this fact later on to construct the “fast” prediction algorithm that looks only to the anchors and the value of delta. But first, we’ll begin with some assumptions and observations rooted in information theory that will inform our “full” prediction algorithm, which we’ll then approximate with the fast algorithm.

Categorization, Structure, and Entropy

The fundamental observation underlying nearly all of the work in AI that I’ve presented in this project is that if we want to find the “natural” structure of some artifact without any prior information, then we first represent the artifact as a mathematical object capable of producing a measure of entropy, such as a vector, matrix, graph, or set of probabilities, and then repeatedly partition that representation until we find the point where the entropy of the partition changes the most. That is, we take the object, and iteratively partition it into smaller and smaller objects, which will affect our measurement of the object’s entropy. During this process, there’s going to be a level of subdivision where the structure of the partitioned object will “crack”, producing the maximum rate of change to the entropy of the partition, thereby uncovering the actual structure of the artifact which the object represents.

For an intuitive, visual explanation of how and why this process works, see the post below entitled, “Image Recognition with no Prior Information”.

Turning this view on its head, we can measure how well a new data point fits into an existing dataset by measuring how much the entropy of the categorization changes after inclusion of the new data point. That is, if a new data point is a good fit for a dataset, then it shouldn’t disturb the existing category structure much, resulting in a low change to the entropy of the category structure. In contrast, if a new data point is a bad fit for a dataset, then it will significantly disturb the structure of the categories, resulting in a large change to the entropy of the category structure.

Continuing with the example above, assume we’re given the following new data point:

data_array{9} = [5.19858 0.62000 0.00000];

This is obviously very close to the anchor of the first category, and as a result, it’s reasonable to view this data point as a good fit for the existing dataset. To test whether this is actually the case, we insert this new data point into the dataset, and then rerun the categorization algorithm on the resulting dataset. Because we’re rerunning the categorization algorithm, it is possible for the anchors to change. As a result, we can’t predict to which category the new data point will belong, since those categories could change as a result of the inclusion of the new data point. Instead, we measure how well the new data point fits by measuring how much the entropy of the categorization changes as a result of its inclusion.

The entropy of the categorization is calculated by taking the number of data points in each category, and dividing each by the total number of data points, which will produce a probability for each category. We then take the entropy of this set of probabilities as if it represented an actual probability distribution.

In this case, as expected, the new data point does not change the category structure significantly, and instead, the new data point is categorized together with what was the anchor for the first category in the prior run, producing the following category structure.

data_categories_array =
{
[1,1] =
{
[1,1] =

5.19858 0.61943 0.00000

[1,2] =

4.76371 0.20080 0.00000

[1,3] =

6.91329 2.81581 0.00000

[1,4] =

2.86458 5.11067 0.00000

[1,5] =

5.99046 0.26573 0.00000

[1,6] =

5.19858 0.62000 0.00000

}

[1,2] =
{
[1,1] =

9.82500 3.40620 10.00000

}

[1,3] =
{
[1,1] =

5.54710 9.43410 10.00000

}

[1,4] =
{
[1,1] =

9.05740 9.94230 20.00000

}

}

The first category contains 6 elements, and the remaining 3 categories each contain 1 element. As a result, the entropy of the categorization is given by the following:

H = (6/9)*log2(9/6) + 3*((1/9)*log2(9)) = 1.4466.

The entropy of the original categorization is 1.5488. As a result, the change to the entropy of the categorization is in this case small, and negative, -0.10218, since the new data item increases the size of the first category, concentrating more of the data into a single category, and in turn reducing the entropy of the categorization (note that entropy is minimized when one category contains all of the data points).

Now assume that we are given the following new data point:

data_array{9} = [9.83000 3.40700 10.10000];

This is very close to the only vector in the second category, and as a result, this new data point should also be a good fit for the existing data. However, it’s not a good fit for the largest category, but is instead a good fit for a small category. In some sense, this implies that it’s actually not as good of a fit as the first new data point we considered above, since this new data point is similar to only one existing data point, as opposed to being similar to a large number of existing data points.

Application of the categorization algorithm produces the following:

data_categories_array =
{
[1,1] =
{
[1,1] =

5.198583 0.61943 0.00000

[1,2] =

4.76372 0.20080 0.00000

[1,3] =

6.91330 2.81581 0.00000

[1,4] =

2.86459 5.11068 0.00000

[1,5] =

5.99046 0.26573 0.00000

}

[1,2] =
{
[1,1] =

9.82500 3.40620 10.00000

[1,2] =

9.83000 3.40700 10.10000

}

[1,3] =
{
[1,1] =

5.54710 9.43410 10.00000

}

[1,4] =
{
[1,1] =

9.05740 9.94230 20.00000

}

}

As expected, the category structure is not disturbed, and instead, the new data point is inserted into the same category as the existing data point to which it is most similar. In this case, the change in entropy is small, and positive, 0.10895, since the new data point increases the size of a small category, thereby increasing the entropy of the categorization (note that entropy is maximized when all categories contain an equal number of data points). In this case, the magnitude of the change in entropy is greater than it was in the previous example, which is consistent with our observation that while the new data point is certainly a good fit, it is not as a good of a fit as the first example we considered above.

Now assume that we’re given the following data point:

data_array{9} = [45.15000 16.80000 4.90000];

This data point is probably not a good fit for the existing data, as its distance from any anchor is significantly larger than delta. As a result, we should expect the inclusion of this new data point to disturb the category structure.

Application of the categorization algorithm produces the following:

data_categories_array =
{
[1,1] =
{
[1,1] =

5.19858 0.61943 0.00000

[1,2] =

9.82500 3.40620 10.00000

[1,3] =

4.76372 0.20080 0.00000

[1,4] =

6.91330 2.81581 0.00000

[1,5] =

2.86459 5.110678 0.00000

[1,6] =

5.99046 0.26573 0.00000

}

[1,2] =
{
[1,1] =

5.54710 9.43410 10.00000

[1,2] =

9.05740 9.94230 20.00000
}

[1,3] =
{
[1,1] =

45.15000 16.80000 4.90000

}

}

In this case, the change in entropy is significant, and negative, -0.32440, as the new data point forced the existing data into a smaller number of larger categories. Put informally, in the context of the new data point, the rest of the data seems closer together, and as a result, the categorization algorithm increases the value of delta, which is in this case 11.902, thereby concentrating all of the existing data into two categories, and creating a third category for the new data point. It is reasonable to conclude that this new data point is, therefore, not a good fit for the existing data.

Now assume we are given the following new data point:

data_array{9} = [15031.60000 1001.69000 10386.80000];

This is obviously not a good fit for the existing data, and therefore, will probably significantly disturb the existing category structure.

Application of the categorization algorithm yields the following:

data_categories_array =
{
[1,1] =
{
[1,1] =

5.19858 0.61943 0.00000

[1,2] =

9.82500 3.40620 10.00000

[1,3] =

4.76371 0.20080 0.00000

[1,4] =

5.54710 9.43410 10.00000

[1,5] =

6.91329 2.81581 0.00000

[1,6] =

9.05740 9.94230 20.00000

[1,7] =

2.86458 5.11067 0.00000

[1,8] =

5.99046 0.26572 0.00000

}

[1,2] =
{
[1,1] =

15031.60000 1001.69000 10386.80000

}

In this case, the change in entropy is even more significant, and negative, -1.0455, as all of the existing data is forced into a single category due to an increase in delta, which is in this case 299.93.

As a result, this approach allows us to construct an objective binary test for whether or not a new data point is a good fit for an existing dataset. Specifically, if a new data point ends up in a category of one, on its own, then it is not a good fit for the existing data. If it is included in a category that contains at least two data points, then it is a good fit. Further, we can also measure how well the data fits into the dataset by measuring the change in entropy of the categorization. This will in turn allow us to distinguish among good fits, and among bad fits, thereby allowing us to objectively determine best fits and worst fits, and everything in between.

Predicting Unknown Variables

The key takeaway from the analysis in the sections above is that we can measure how well a new data point fits into an existing dataset by measuring the change in entropy of the categorization that results from the inclusion of the new data point. In this case, I’ve used my own categorization algorithm to demonstrate this concept, but we could in theory use any categorization / clustering algorithm, and then test the entropy of the resultant categorization in the same manner.

As a result, the principle is a general one: we can measure how well a new data point fits into a dataset by measuring the change in the entropy of the structure of the dataset that results from inclusion of the new data point.

Now assume that we are given a new data point, but that the z-value is unknown. For example, assume that we’re given the vector [5.19859 0.61944 Z]. Further, assume that we’d like to predict the value of Z. Without any context, there’s no value of Z that’s preferable to any other. That is, there’s no ex ante value of Z that makes more sense than any other, and as a result, we need a dataset from which we can derive a value of Z that makes sense in the context of that dataset. As a result, predicting the value of a single hidden variable is a fundamentally different problem than the other problems I’ve addressed previously in this project, none of which require any prior data. This is because a dataset, or an image, can create its own context from which information can be extracted using the methods I presented in the other posts below. In contrast, a single data point is just that – a point in isolation, with no relationships to consider, and therefore, no context to consider.

At the risk of harping on the wonders of music, which I’m certainly guilty of, this is like asking what the key of a song is when you’re only given one note. The correct answer is in this case, “I don’t know”. If you hear two notes, you have a better idea. Three notes, better still. As the number of notes played increases over time, you can more confidently guess the key of the song, which is the context in which you perceive the notes. Clever composers consciously toy with these ideas, playing the same melody in the context of two different key signatures, and deliberately crafting melodies that have no obvious key signature. Jazz musicians are notorious for doing this, particularly at the end of a song, where they’ll throw in a “surprise ending” that completely changes the context of everything you’ve just listened to. For those that are interested, have a listen to the end of this Ella Fitzgerald song, which I’ve cued up in the following link https://www.youtube.com/watch?v=mshV7ug8cdE&feature=youtu.be&t=438. At the 7:27 mark, the pianist suddenly goes off the reservation, quietly changing the key, and therefore, the context of the entire song.

Similarly, data needs context in order to be understood, and a single data point by definition has only a cosmic, universal context, which I’m certainly not prepared to address. As a result, if we want to predict some unknown variable, we need an assumed context in which we’re predicting that unknown variable.

So let’s consider the new data point in the context of the original dataset we analyzed above. The original dataset generates four categories, which are in turn defined by four anchors. As a result, if the new data point is a good fit for the dataset, then the value of Z has to be similar to the z-value of at least one anchor. Therefore, we can iterate through the z-values of each anchor, treating the z-value of each anchor as an assumed value for Z, insert the resultant data point into the existing dataset, and categorize the resulting new dataset. If there is at least one assumed value of Z that causes the new data point to be categorized together with an existing data point, then the x and y values of the new data point are a good fit for the existing dataset. If there are multiple assumed values of Z that cause the new data point to be categorized with some existing data points, then we can choose the value of Z that minimizes the change in the entropy of the categorization. That is, first we test to see if the new data point is a good fit by testing whether there’s at least one assumed value of Z that causes it to be categorized together with some existing data points, and then, if there’s more than one such value, we choose the value of Z that caused the least disturbance to the existing category structure.

In short, first we test to see if the new data point is a good fit, and if it is a good fit, then we choose the z-value that produces the best fit. We’ll address this topic again at the end of this section.

Let’s begin by assuming that Z = 0.00000, i.e., the z-value of the first anchor. This implies that,

data_array{9} = [5.19859 0.61944 0.00000];

Application of the categorization algorithm yields the following:

data_categories_array =
{
[1,1] =
{
[1,1] =

5.19858 0.61943 0.00000

[1,2] =

4.76372 0.20080 0.00000

[1,3] =

6.91330 2.81581 0.00000

[1,4] =

2.86459 5.11068 0.00000

[1,5] =

5.99046 0.26573 0.00000

[1,6] =

5.19859 0.61944 0.00000

}

[1,2] =
{
[1,1] =

9.82500 3.40620 10.00000

}

[1,3] =
{
[1,1] =

5.54710 9.43410 10.00000

}

[1,4] =
{
[1,1] =

9.05740 9.94230 20.00000

}

}

The change in entropy is in this case -0.10218, and as expected, the new data item fits well into the first category, since it is very close to the anchor for the first category.

Now assume that Z = 10, i.e., the z-value of the second anchor.

Application of the the categorization algorithm yields the following:

data_categories_array =
{
[1,1] =
{
[1,1] =

5.19858 0.61943 0.00000

[1,2] =

4.76372 0.20080 0.00000

[1,3] =

6.91330 2.81581 0.00000

[1,4] =

2.86459 5.11068 0.00000

[1,5] =

5.99046 0.26573 0.00000

}

[1,2] =
{
[1,1] =

9.82500 3.40620 10.00000

}

[1,3] =
{
[1,1] =

5.54710 9.43410 10.00000

}

[1,4] =
{
[1,1] =

9.05740 9.94230 20.00000

}

[1,5] =
{
[1,1] =

5.19859 0.61944 10.00000

}

}

In this case, the new data point is categorized on its own, which implies that assuming Z = 10 does not produce a good fit. The change in entropy is in this case 0.3312, since a new category is created, thereby increasing the entropy of the categorization. Assuming Z = 20 produces the same outcome, together implying that Z = 0 is the only good fit for the (x,y) values [5.19858 0.61944].

Note that this method does not interpolate the value of Z, but instead tests known values of Z taken from the dataset itself. As a result, using this approach on an outlier data point will cause the data point to fail to categorize (i.e., end up in a category by itself) for each assumed value of Z. Put simply, if the new data point fails to categorize for every known value of Z, then in that case, our dataset doesn’t provide any information about the value of Z, and therefore, we’re no better off than we would be in attempting to predict the value of Z in a vacuum.

Approximating the Full Algorithm

Though the method above is useful, theoretically interesting, and reasonably fast, we can approximate it using another approach that is extremely fast. Specifically, note that when the new data point is not an outlier, the category structure is not significantly disturbed, and in the worst case of a quasi-outlier, a new category of one is created, without otherwise disturbing the existing categories. This in turn implies that the value of delta should not change much as a result of the inclusion of the new data point into the dataset, provided it’s a reasonable fit for the dataset. Therefore, we can approximate the behavior of the prediction method described above by simply testing whether or not the new data point is within delta of each anchor, using the assumed z-values of each anchor. If this turns out to be the case, then we know that our new data point is a reasonably good fit for the existing dataset, since there is a category to which it would be included, if we ignore the effects of the inclusion of that new data point. In short, if we “freeze” the existing category structure and the value of delta, we can simply test whether our new data item is within delta of some anchor.

Let’s run through the same example given above, but this time, rather than rerun the categorization algorithm, we’re going to simply test whether or not the new data item is within delta of some anchor, using the original value of delta = 5.0711, using each assumed value of Z.

Our first assumed value of Z is 0, producing the vector [5.19859 0.61944 0]. This new data point is very close to the first anchor, and as a result, the norm of the difference between our new data item and the first anchor is approximately 0, which is less than delta. This implies that assuming Z = 0 causes our new data item to be a good fit for the data, and since the norm of the difference is roughly zero, there’s really no point in testing the rest of the anchors, at least in this case.

Now assume that our new data point is instead [150.60000 89.12000 Z]. Because delta = 5.0711, there is clearly no assumed value of Z that will cause this vector to be categorized using this approach.

As a result, outlier data points will fail to categorize using this approximated approach as well, and so this approach allows us to quickly say whether or not the not a new data point is a good fit for the dataset. As such, it approximates the binary test of the algorithm above, as to whether or not a new data point can be included in the dataset at all, which in turn tells us whether or not our dataset can tell us anything about the unknown value Z.

We can, if we want, also measure how well the new data point fits by measuring the change in entropy that results from inserting the new data point into the dataset. This is useful in the case where the data point is within delta of two anchors, producing what should therefore be two reasonable values of Z. Though we ultimately won’t make use of this technique in the interest of speed, I’ll spend a moment describing how we could do this. First, we again assume that the structure of the dataset is unchanged, but for the inclusion of the new data point. We then measure the change in entropy that results from including the new data point into each category to which it fits. We then chose the category that changes the entropy of the dataset the least. This approach will ensure that we always choose the largest category, as opposed to the tightest fit. Which approach is best will depend upon our goal. If we want our predictions to be as close as possible to the existing data, then we should choose the category for which the distance between our new data item and the anchor is minimized, i.e., the tightest fit. If instead we want to assume that the structure of our dataset should change as little as possible, then we should choose the category that results in the smallest change in entropy.

Though it might seem odd to make use of an operating assumption that the structure of the dataset should change the least, note that if we were to actually include each new data point into the dataset, then if we didn’t make this assumption, eventually we would erode and possibly even destroy the structure of the categories. That is, repeatedly choosing the tightest fit could end up completely changing the structure of the categories. Nonetheless, because we are using this approach to predict data, and not to actually update the data in our dataset, this concern is academic, at least in this case.

Finally, note that in any case, the runtime of this algorithm is O(M), where M is the number of anchors, which is necessarily less than or equal to n, the number of data points in the dataset. As a practical matter, the value of M will depend upon the data. If the data is highly randomized, then M will be approximately equal to n. If the data has structure, then M could be extremely small compared to n, resulting in an extremely fast runtime. For example, it’s not inconceivable that a dataset that comprised of millions of data points consists of only a few top level categories, which would result in a negligible runtime. Nonetheless, when dealing with a large number of data points, it could be useful to further subdivide the dataset into smaller, more detailed categories, beyond the top level categories generated by one level of categorization. This is the approach that we will ultimately take, producing a remarkably fast and powerful prediction algorithm.

Prediction Using a Data Tree

As noted in the previous section, it could in some cases be useful to create deeper categories, beyond the top level categories generated by one single application of the categorization algorithm. Returning to our example above, one application of the algorithm produced a first category that contained five elements. We could in turn run the categorization algorithm again on that category, which will produce the following:

data_categories_array =
{
[1,1] =
{
[1,1] =

5.19858 0.61943 0.00000

[1,2] =

4.76372 0.20080 0.00000

[1,3] =

5.99046 0.26573 0.00000

}

[1,2] =
{
[1,1] =

6.91330 2.81581 0.00000

}

[1,3] =
{
[1,1] =

2.86459 5.11068 0.00000

}

}

That is, because the first category contains multiple elements, it is possible to run the categorization algorithm one more time to see if any further meaningful subdivision is possible. It turns out that in this case, the first category can be further subdivided into a set of three points that are all very close to each other, and two remaining points that are relatively isolated.

The value of delta produced by the categorization algorithm is in this case 0.90492, which is significantly smaller than the value produced at the top level of categorization, 5.0711. As a result, by categorizing one level down, we can produce categories that are significantly tighter, which will in turn allow us to match new data points within much smaller categories, thereby presumably producing tighter predicted values of Z.

By repeatedly applying the categorization algorithm in this fashion, we will produce a hierarchy of categories, beginning with the top layer categories generated by applying the algorithm to the original dataset, and continuing downward until we produce singletons, and categories that simply cannot be further subdivided. Expressed visually, this will produce a tree of categories, and as a result, a tree of anchors, and a tree of values of delta.

If we want to find the tightest fit for a new data point, we can begin at the bottom of the anchor tree, and test whether that new data point fits into any of the bottom level anchors, which will by definition correspond to the smallest values of delta. If that doesn’t work, then we proceed one level up, and test whether the new data point fits within any of the anchors within the second lowest level of the tree.

This method is implemented by the attached algorithms, generate_data_tree, which iteratively applies the categorization algorithm until the data trees are produced, and predict_best_fit_tree, which takes in a new data item and attempts to place it at the lowest possible level in the resultant data trees. Assuming the dataset is fixed, we’ll only have to produce the trees once, meaning that the only algorithm we’ll need to run to predict the z-value of a new data point is predict_best_fit_tree.

We can apply this method to the new data point [5.19000 0.61900 Z] using the following code:

[category_tree delta_tree anchor_tree] = generate_data_tree(data_array);
new_data_item{1} = [5.19 0.619 0];

[category_index predicted_z final_delta] = predict_best_fit_tree(anchor_tree, delta_tree, new_data_item,1);

The returned value predicted_z gives us our desired prediction for the value of z, which will in this case be 0, as expected.

The runtime complexity of predict_best_fit_tree will depend upon the dataset in question, but common sense suggests that it is unlikely to exceed O(log^2(n)), since the number of top level categories is probably not going to exceed M = log(n), and the depth of the tree is probably not going to exceed log(M). As a practical matter, predict_best_fit_tree runs in a tiny fraction of a second on an ordinary commercial laptop. This makes it practical to repeatedly feed it random inputs, ultimately allowing it to reconstruct and expand both the domain and the range of a function that has been analyzed by the generate_data_tree function, which I’ll address in the next section.

Intimating Complex Data

In the big picture, interpolation methods take computationally complex data, and force it into some computationally cheap shape, like a line, curve, or surface. In contrast, the approach I’ve outlined above inserts new data into the roughly natural, and therefore, computationally complex structure of the original dataset, but in a computationally cheap manner. In short, rather than fight the structure of the underlying data, we work with it, and as a result, we won’t have a smooth domain unless the original domain is smooth. Similarly, we won’t have a smooth range unless the original range is smooth. Instead, what the approach above produces is a dataset that is quantized into a discrete number of increasingly smaller spheres, each centered around data points that are, at various levels of categorization, treated as the center points for the categories they define.

As a result, repeated application of the predict_best_fit_tree algorithm to randomly generated new data will allow us to identify new data points that are similar to the original domain and range of the function. This is accomplished by the attached imitate_data script, which, when given the data tree of a function, and the approximate boundaries of its domain, can effectively reproduce the original dataset.

I’ve attached a few examples of how these algorithms can be used to imitate complex datasets, and a text file that contains the Octave code that will let you generate the datasets for yourself. Any point that fails to categorize is painted grey and mapped to the 0-plane. All points that categorize successfully are painted blue.

1. The Cylindrical “Flower Pot”

In this example, I’ve used a built-in Octave function to produce a three-dimensional surface that looks a bit like a flower pot. I then take the underlying point information, process it using the generate_data_tree function, which repeatedly categorizes the point information as if it were a dataset, and then imitate that dataset using the imitate_data function. The results are pretty awesome, and clearly reproduce the overall three-dimensional shape of the surface.

The first attached image shows the original surface, the second image shows the imitated surface generated by the imitate_data function, and the third image shows the top-level categories generated by the categorization algorithm. Proximate ‘o’ rings painted the same color are part of the same category. Large rings are part of a large category, and small rings are part of a small category.

Though this is just a surface, in theory, these algorithms could reproduce both the surface, and the internal structure of a three-dimensional object, suggesting that these algorithms might have applications in 3D printing, autocad, etc.

2. The Noisy Sine Wave

This is a noisy, discontinuous sine curve that I then reproduced using the imitate_data function. Macroscopically, the reproduction almost looks smooth, but up close you can quite plainly see that the algorithm reproduces not only the overall shape and specific values of the function, but also the patchiness of the domain and range of the function.

This is a simple quadratic curve that is smooth compared to the other functions.

The relevant scripts are here:

find_most_distant_pair

generate data samples

generate_categories_3d

generate_data_tree

generate_n_random_categories_3d

imitate_data

left_right_delta_3d

optimize_categories_3d

partition_array_3d

predict_best_fit_tree

shape_to_data

test_entropy_array_3d

display_categorization_3d

# Note on the Riemann Zeta Function

As you know, I’ve been doing research in information theory, and some of my recent work has some superficial connections to the Zeta function, since I was making use of logarithms of complex numbers.

I noticed the following graph:

In short, we’d be looking at the Zeta function using values of s produced by powers of the logarithm function:

$zeta(log^n (-1) /x).$

There appears to be several zeros for n = 6.

You can see it yourself on wolfram using the following link:

But, I don’t know enough about this topic to be sure that this is actually interesting, since I know that at least some of the zeros of this function are considered “trivial”.

Nonetheless, I thought I’d share this in case some of you found it interesting and had insights.

Regards,

Charles