# Charles Davì

## Fast Unsupervised Image Categorization Algorithm

In Uncategorized on January 16, 2019 at 7:06 pm

In this post, I’m going to show how we can combine the image feature recognition algorithm I presented a few months ago, with the categorization and prediction algorithms I presented in the two posts below, to produce an algorithm that can quickly categorize images, without any supervision. The final algorithm I’ll present is O(n), where n is the number of pixels in the image we’re attempting to categorize.

This algorithm works so well that it can distinguish between paintings by different artists, in just a few seconds. It can also distinguish between articles of clothing, and other objects, but the focus of this post will be on its ability to distinguish between artists, because, in my opinion, this shows that the algorithm is capable of unsupervised abstraction, allowing it to say without any guidance, what it is that distinguishes a Van Gogh from a Caravaggio.

These algorithms use the same underlying categorization and prediction algorithms that I used below to imitate complex three-dimensional shapes, predict data, and approximate complex functions. In this case, I’ve added some additional scripts that allow the underlying algorithms to analyze images. As a result, the underlying categorization and prediction algorithms are shaping up to be a set of generalized AI algorithms that appear to only require a “hook” into the subject matter in question.

Finally, note that all of the algorithms I’ll present in this post are worst-case polynomial in time as a function of the number of pixels in the image. That is, even the algorithms we’ll use to construct categories in the first instance are polynomial in time as a function of the number of pixels / images in the database. As a result, these algorithms allow for what I believe to be industrial quality image recognition to be implemented on a consumer device, and can probably allow industrial machines to conduct real-time analysis of high-resolution video.

Summary of the Underlying Image Feature Identification Algorithm

In the post below entitled, “Image Recognition with no Prior Information”, I presented an algorithm that can quickly identify features within an image, without any prior information about the image, or the categories to which the image might belong. At a high level, this algorithm takes an image, and partitions it into objectively distinct features. But because the algorithm operates with no prior information at all, it isn’t looking for a particular object, but is instead looking for every objectively distinct feature within an image. In short, this first algorithm attempts to identify all of the features within an image, but because this algorithm has no prior information, it can’t say what these features are, but can nonetheless identify them as distinct from one another. As a result, this algorithm is only the first step of the image processing algorithms I’ll present in this post.

Like all of the other work that is part of this project, the feature recognition algorithm is rooted in information theory, and operates based upon the assumption that objectively distinct features within an image will contain different amounts of information. Specifically, the feature identification algorithm subdivides an image in a way that maximizes the difference in the amounts of information contained in each subdivision of the image. I call each subdivision of an image generated by the algorithm a region. Though it might seem pedantic to distinguish between a region and a feature, I’d like to reserve the term feature to denote an actual feature of an image, as identified by a human, whereas a region is a portion of an image identified by the algorithm that should contain a feature.

The first, second, and third attached images are examples of an image that has been processed by the algorithm. The first image is a photo I took in Lake Como, Italy, and the second image shows the original image after being processed, with the brightness of each region adjusted by the algorithm to reflect how likely it thinks the region is to be part of the foreground of the image. The brighter a given region is, the more likely the algorithm thinks that the region is part of the foreground of the image. You can get a sense of how the algorithm partitions an image by looking at the boundaries of the regions in the second image. The third image shows a specific region identified by the algorithm, which is in this case an individual standing on a diving board, which can be found around the center of the original image.

You can do this yourself using the following code:

[region_matrix N S_mat prob_mat] = identify_features(I);

imshow(ent_contrast_darken_file(I,N,region_matrix, prob_mat));

figure, imshow(extract_region(I,N,region_matrix,27));

Constructing a Tractable Color Palette

At a high level, the technique we’ll employ is straightforward: first we’re going to partition an image using the feature identification algorithm described above, and that’s going to produce a series of regions, which should correspond to features within the image. Then, we’re going to calculate the average color of each region, which will create a dataset of average colors. Then, we’re going to categorize those colors using the vector categorization algorithm I presented in the two posts just below this one.

In this section, I’ll walk through applying this process to a single image. In the next section, we’ll apply it to a series of images. In the last section, I’ll present a vectorized implementation of this entire process that will allow us to categorize new images using an algorithm that is O(n).

I’ve attached a photograph of one of Van Gogh’s iconic self-portraits that we’ll use as our first subject.

First, we’ll count the number of unique colors in the image using a vectorized script I found on Matlab’s website, written by Steve Eddins, which is available here:

https://blogs.mathworks.com/steve/2008/01/31/counting-occurrences-of-image-colors/

I = imread(“[IMAGE FILE]”); rgb_columns = reshape(I, [], 3);

[unique_colors, m, n] = unique(rgb_columns, ‘rows’);

color_counts = accumarray(n, 1); size(unique_colors, 1)

The image contains 184064 unique colors, and since the image contains 768*608 = 466944 pixels, about 40 percent of the pixels contain unique colors. As a result, the image contains a very rich color palette, and therefore, quite a lot of color information for its size. Nonetheless, a human observer probably wouldn’t have much trouble quickly describing the color palette of the painting, which contains a lot of yellow, due to his large hat, with a pinkish and orange background, and some pale blues in his blazer.

This type of description is a form of compression, since we’ve taken a large amount of color information, and compressed that information into a tractable statement that is nonetheless a meaningful description of its subject. Similarly, the goal of this section will be to produce a compressed color palate that straddles the line the between a set of colors that is digestible by a human, and a set of colors that is useful to a machine. The color sets we’ll produce will be too large to be described in one sentence, but small enough to be expressed visually in a single color bar that can be quickly gleaned by a human. The intended end result is to have the categorization algorithm operate like an enhanced human, that quickly compresses information, but nonetheless has access to details that a human observer would find overwhelming.

Continuing with the example of the self-portrait, first we’ll run the feature identification algorithm, which will produce a set of matrices we can use to partition the image into regions. Then, we’ll run a script that calculates and stores the average color of each region, which will produce a dataset of color vectors. Finally, we’ll run the categorization algorithm on that dataset of color vectors, which will produce a core set of color vectors that can be thought of as forming a basis for the colors in the image. I call these basis vectors the anchors of their color categories, since the categorization algorithm treats each anchor vector as the geometric center of a category, meaning that all other vectors in the category defined by the anchor are within some distance, delta, that the algorithm calculates based upon a variety of factors that I explain in the two posts below this one.

All of this can be done using the following code:

[region_matrix N S_mat prob_mat] = identify_features(I);

figure, imshow(ent_contrast_darken_file(I,N,region_matrix,prob_mat));

data_array = calculate_average_feature_color(I,N,region_matrix, prob_mat, 0, 0);

[data_categories_array category_vec anchor_array H_final delta] = optimize_categories_3D(data_array,0);

color_bar_compressed = display_color_array(anchor_array, data_categories_array);

figure, imshow(color_bar_compressed)

The feature identification algorithm generates 94 regions, and the color bar shows the compressed set of colors generated by the categorization algorithm, which consists of 44 anchor colors. Visually, this set of colors is clearly representative of the underlying painting, despite containing only about .02% of the unique colors in the image. As a result, even though average color is generally a very lossy metric for an image, because we’ve taken the average color of what are objectively distinct features of the image, we’ve ended up with meaningful color information, while nonetheless achieving an enormous amount of compression. Not to boast, but this is actually quite remarkable, since the algorithm has taken over one hundred thousand colors from an image, and produced, without any supervision, a set of 44 colors that are perfectly representative of the underlying image. Moreover, it would almost certainly be impossible for a human to sift through all of the colors in the original image and select a comparable set of colors. At the same time, these are the colors that a human probably would select if they could. As a result, this process generates the same type of compression a human-generated, verbal summary achieves, but on a mechanized scale, and expressed nonverbally, in the form of an image. Interestingly, it also suggests that nonverbal summaries of this type could allow machines to convey complex information to their human counterparts.

This of course does not preclude more sophisticated measures from being used in lieu of average color. Instead, I’m presenting this as a simple example of a more generalized approach to image recognition that consists of three primary steps:

(1) identify the features of an image using the feature identification algorithm described above;
(2) analyze the properties of each resulting region, producing a metric;
(3) categorize the resulting metrics.

If our metric is visually meaningful, then the categories we’ll construct in step three will cause the metrics associated with similar looking features to be categorized together, thereby creating categories of similar looking features. It turns out that average color works quite well, though there are undoubtedly variations on this theme that might be more useful in other contexts.

Categorizing Images

By applying the process above to a category of similar images, we can construct a palette that is representative of an entire category of images. In this case, we’re going to construct two palettes: one generated by analyzing a series of paintings by Van Gogh, and another generated by analyzing a series of paintings by John Singer Sargent. These two painters worked with somewhat similar color schemes, and to a non-expert like myself, a superficially similar impressionistic style. Nonetheless, the paintings I’ve selected are distinguishable, and as a result, the two categories of paintings will produce different color palettes. We’ll analyze a set of 8 paintings by each artist, for a total of 16 images, each of which are attached as the last 16 images.

We can produce the palettes using the following code:

directory_path_A = “[Van Gogh]”;

directory_path_D = “[Sargent]”;

num_images_A = 8;

num_images_D = 8;

[category_tree1 delta_tree1 anchor_tree1 data_array1] = generate_single_path_img_data(directory_path_A, num_images_A);

[category_tree4 delta_tree4 anchor_tree4 data_array4] = generate_single_path_img_data(directory_path_D, num_images_D);

[data_categories_array1 category_vec1 anchor_array1 H_final1 delta1] = optimize_categories_3D(data_array1,0);

[data_categories_array4 category_vec4 anchor_array4 H_final4 delta4] = optimize_categories_3D(data_array4,0);

color_bar_compressed1 = display_color_array(anchor_array1, data_categories_array1);

color_bar_compressed4 = display_color_array(anchor_array4, data_categories_array4);

figure, imshow(color_bar_compressed1)

figure, imshow(color_bar_compressed4)

The Van Gogh set of images generates 711 features, ultimately compressed into a set of 268 anchor colors, while the Sargent set of images generates 600 features, which were compressed into a set of 243 anchor colors. Yet again, these algorithms achieve an incredible amount of compression, taking millions of colors in the underlying images, and generating a comparatively minuscule set of colors that a human being can easily observe and intuitively understand, albeit in a nonverbal manner.

The Van Gogh palette consists of an array of rich, and bright pastels, whereas the Sargent palette consists of what I thought to be a surprisingly dark set of colors, mixed in with brighter colors that are nonetheless closer to the grey portion of the spectrum when compared to the vibrant colors that make up the Van Gogh palette.

If we were to take yet another Van Gogh painting, and compare the color palette of that new painting to the Van Gogh palette we just produced, it would probably be a decent match. Moreover, the color palette of the new Van Gogh would almost certainly be a better match for the existing Van Gogh palette than the Sargent palette. As a result, if we’re given a new image, and we assume that the new image is either a Van Gogh or a Sargent, then by comparing the color palette of the new image to the two color palettes we just produced, we should be able to categorize that new image as either a Van Gogh or a Sargent.

The image categorization algorithm I’ll present in this section is O(n*log(n)), since we’ll need to run the feature identification algorithm on the new input image in order construct its color palette. In the next section, I’ll present a vectorized version of this algorithm that is O(n), that takes into account the color of every pixel in the image. As a result, the vectorized algorithm is not only faster, but it’s also more powerful, since it takes into account all of the color information within an image. Nonetheless, the intuition and inspiration for the vectorized algorithm comes from this longer form process.

Let’s begin by generating a color palette for the beardless Van Gogh self-portrait, which I’ve attached, using the same code above from the previous section. Though this image was not part of the database that generated the Van Gogh palette, just looking at it, we can already see that it’s probably going to fit well. Not surprisingly, the resultant color palette looks a lot like a subset of the color palette for the entire Van Gogh series. As a result, it makes sense to assume that this set of colors will be a better fit for the Van Gogh palette than for the Sargent palette.

We can test this hypothesis using a prediction function I presented in the previous post, that takes a new data item, and attempts to categorize it into an existing dataset that’s already been analyzed using the categorization algorithm. At a high level, what the prediction algorithm does is attempt to place the new data item into an expanded set of anchor vectors that are structured as a tree. If it can place the new data item somewhere in the tree, then the new data item is a good match for the dataset. If it fails to place the new data item in the tree, then the new data item is not a good match for the dataset. For a full explanation as to how and why the prediction algorithm works, see the two posts just below this one.

In this particular case, we can think of the prediction algorithm as attempting to fit each new feature color into an existing dataset of anchor colors. The greater the number of feature colors that match with a given dataset of anchor colors, the more confident we can be that the new image is, as a whole, a match for that dataset in terms of its color palette. Of course, knowing that two images have similar color palettes is not, as a general manner, enough information to conclude that the two images belong to the same category of images or objects. But in this case, we’re tacitly assuming that the painting is either a Van Gogh or a Sargent, and then determining to which category the painting belongs by comparing a compressed representation of its color palette to the Van Gogh and Sargent palettes we just compiled.

This can be done using the following code:

[region_matrix N S_mat prob_mat img_data_array score_vector1 delta_vector1] = AB_test_image(I, category_tree1, anchor_tree1, delta_tree1);

[region_matrix4 N4 S_mat4 prob_mat4 img_data_array4 score_vector4 delta_vector4] = AB_test_image(I, category_tree4, anchor_tree4, delta_tree4);

sum(score_vector1 != Inf)

sum(score_vector4 != Inf)

In short, the underlying prediction algorithm will attempt to place the average color of each region from the new image into each dataset. If it succeeds, it will return a score of 0, and if it fails, it will return a score of Inf, which indicates that the algorithm thinks that the new color doesn’t belong to the dataset. The last two lines of code count how many regions matched with the two datasets. In this case, the new Van Gogh matched with the Van Gogh dataset 50 times, and matched with the Sargent dataset 40 times. Each test involved attempting to place 80 regions, meaning that in both cases, a significant number of regions failed to categorize.

Though this approach works reasonably well, note that it tests whether the average color of an entire region is a good fit for the color palette in question. As a result, the test does not take into account the vast majority of color information in an image, since each region is going to contain a large number of pixels. In the next section, we’ll vectorize this algorithm, which will allow us to quickly test every pixel in an image, thereby creating a simple filter that can test whether the colors in an image fit well within the complex, and discontiguous sets of colors that comprise the palettes we’ve constructed. Ultimately, this will allow us to quickly categorize images using a simple filtering process.

Vectorized Image Categorization

The first three steps in the process above cannot, to my knowledge, be vectorized. That is, if we want to construct a color palette for a database of images, we need to run the feature identification algorithm, calculate the average color of each resulting feature, and then categorize the resulting dataset of colors. However, if we’d like to compare the colors within an image to that color palette, after it’s already been produced, then we can take advantage of the ready-baked vectorized procedures of Matlab and Octave, and possibly other languages as well.

In short, what we’ll do is simply test whether each pixel in an image is roughly within delta of at least one anchor color. If the pixel is within delta of some anchor, then it will pass the filter. If it is not within delta of some anchor, then it will not pass the filter. By calculating the percentage of pixels that survive filtration using this method, we can measure what percentage of pixels in the image are a match for the color palette in question. This in turn allows us to quickly determine to which category of images a given input image belongs, by choosing the category associated with the filter that had the highest pixel survival rate. That is, the input image is best suited in terms of color with the underlying category of images that generated the filter that does the least damage.

For example, if we’d like to determine whether a given image is a Van Gogh or a Sargent, we can use the color palettes we generated above, and test that new image using a Van Gogh filter and a Sargent filter. The filter with the higher pixel survival rate is the better fit, and therefore tells us which of the two painters is most likely to have held the brush.

Returning to the beardless Van Gogh portrait we tested above, we can apply a Van Gogh filter and a Sargent filter to that image using the following code:

%the Van Gogh filter

figure, imshow(I_Post)

[row col x] = size(I);

num_pixels = row*col;

survival_rate1 = num_on_pixels/num_pixels

%the Sargent filter

figure, imshow(I_Post)

[row col x] = size(I);

num_pixels = row*col;

survival_rate4 = num_on_pixels/num_pixels

The survival rate for the Van Gogh filter is 0.29383, or roughly 30% of the pixels, whereas the survival rate for the Sargent filter is 0.13676, or roughly 14% of the pixels.

As a general matter, this approach works astonishingly well, and gives wide margins of survival rates even in cases like this, where we’re distinguishing between similar images that are, frankly, difficult for even some humans to categorize. For example, if we’d like to A/B test for whether an image contains one of two dresses, then this approach will work, assuming that the two dresses have objectively distinct color palettes, even if they’re superficially quite similar to a human observer. Since this approach is not necessarily restricted to color, I believe this algorithm could mark the beginning of an entirely new approach to image recognition.

Finally, any scripts not attached to this post that are required to run the code can be found in the other posts I referenced.

Scripts:

display_color_array

ab_test_image

ret_avg_color_vec

calculate_average_feature_color

generate_single_path_img_data

## Fast Numerical Function Approximation

In Uncategorized on January 7, 2019 at 6:43 pm

In a previous post, I introduced an extremely fast prediction algorithm that appears to have an average run time complexity of significantly less than O(log(n)), where n is the number of data points in the underlying data set. In this post, I’m going to show how this algorithm can also be used as a fast and generalized method for numerical function approximation, regardless of the function’s type and complexity.

Unlike interpolation methods, this algorithm does not attempt to approximate the underlying function using continuous functions. Instead, the algorithm is rooted in assumptions derived from information theory that allow it to work with the natural domain and range of the underlying function, facilitating extremely precise approximations of a wide variety of smooth, and discontinuous functions, all under one umbrella algorithm.

In this post, I’m going to explain how you can use the algorithm to approximate any function you’d like, and I’ll also present some data that demonstrates its performance and precision. For a full explanation as to how and why this algorithm works, see the post below entitled, “Predicting and Imitating Complex Data Using Information Theory”. In a follow up post, I’ll show how we can combine this algorithm with my image partition algorithm to construct a fast, unsupervised image recognition algorithm.

How to Use the Algorithm

The only thing you need in order to use the algorithm is a dataset of Matlab / Octave vectors containing the X, Y, and Z values of your function in three separate row vectors.
I’ll give some more examples below that make use of noisy data to demonstrate the versatility of the algorithm, but let’s begin with the simple linear function z = x + y + 1, evaluated along the line y = x, over the interval [-3,3].

First, we generate the x,y, and z vectors using the following code:

clear x y z
x = -3 : .012 : 3;
y = -3 : .012 : 3;
z = x .+ y .+ 1;
figure, scatter3(x,y,z)
view(70,15)

The x and y vectors contain the domain values of the function we’d like to approximate, and the z vector contains the corresponding range values of the function. In this case, we’re approximating the function z = x + y + 1, but you could of course choose any function you’d like, whether or not it has a closed form expression, so long as the data is represented as three separate column vectors.

Once we’ve generated the data for the function we’d like to approximate, we can instruct the algorithm to process the data, generating a set of data trees that it will ultimately use to approximate the function.

This step is done using the following code:

data_array = shape_to_data(x,y,z);
[category_tree delta_tree anchor_tree] = generate_data_tree(data_array);

This is by far the most time-consuming part of the algorithm, but it is nonetheless polynomial in time as a function of the number of data points. Moreover, it’s something we’ll only need to do once, since this is the part of the algorithm that actually constructs our approximating function.

At a very high level, what the algorithm does is construct a data tree that can be used to approximate the underlying function at certain key locations within the domain of the function that are selected by the algorithm using assumptions rooted in information theory. New points in the domain that are subsequently fed to the algorithm are then quickly mapped into the data tree, resulting in very fast and very accurate function approximations.

In the previous post, I showed that this method is so accurate that it can reproduce complex three-dimensional objects it has analyzed when subsequently given random inputs, since the algorithm can quickly discern between new data points that belong to the data set / object, and data points that do not.

Continuing with our example above, let’s begin by approximating the underlying function at the point (x,y) = [2.5 2.5], and compare it to the result given by the original function.

This is done using the following code:

new_data_item{1} = [2.5 2.5 0];
[category_index predicted_z final_delta] = predict_best_fit_tree(anchor_tree, delta_tree, new_data_item, 1, 1);

In short, the algorithm takes the new data point (x,y) = [2.5 2.5], and attempts to place it into the data tree constructed from the domain and range of the underlying function.

In this case, we have,

predicted_z = 5.9920;

The original function gives a value of z = 2.5 + 2.5 + 1 = 6, for an error of approximately 0.008.

The last two arguments to the function are toggles that determine how the prediction algorithm behaves. The first toggle tells the algorithm to ignore the input z-value, but I nonetheless always use a z-value of 0 for consistency. The second toggle must be -1, 0, or 1, and determines how much time is spent searching the data tree for a good fit for the input. In this case, I’ve set the toggle to 1, which instructs the algorithm to search the entire data tree for the absolute best fit possible, in order to maximize the precision of our approximation. This of course increases the run time of the algorithm, but since we’re trying to approximate a function, precision takes precedence, especially given that the algorithm is still very fast.

If instead we set the toggle to -1, then we have,

predicted_z = 6.0400.

If we set the toggle to 0, then we have,

predicted_z = 5.9920.

As I mentioned above, this algorithm can also be used to imitate complex three-dimensional shapes, which requires a much higher volume of random inputs, in which case it might be advantageous to quickly process a large number of new data points approximately, rather than spend the incremental time required for greater precision. If you’d like to do that, simply set the toggle to -1 for the fastest placement and lowest precision, or to 0, for intermediate speed and intermediate precision.

Now let’s test the accuracy of the prediction over a large number of points that were not part of the original domain. The original domain had an increment of .012, so we’ll double the number of points by using an increment of .006, to ensure that we’re testing the algorithm at points in the domain that it didn’t evaluate previously, using the following code:

clear prediction_vector error_vector x y z
search_toggle = 1;
cnt = 1;
x = -3 : .006 : 3;
y = i = -3 : .006 : 3;
[temp num_data_points] = size(x);
tic;
for i = 1 : num_data_points
new_data_item{1} = [x(i) y(i) 0];
z(i) = x(i) + y(i) + 1;
[category_index predicted_z final_delta] = predict_best_fit_tree(anchor_tree, delta_tree, new_data_item, 1, search_toggle);
prediction_vector(i) = predicted_z;
error_vector(i) = z(i) – predicted_z;
endfor
toc
figure, scatter3(x,y,z)
view(40,15)
figure, scatter3(x,y,prediction_vector)
view(40,15)

Superficially, the two plots look very similar, so let’s plot the errors and have a look at some related statistics:

figure, plot(1:num_data_points, error_vector)
std(abs(error_vector(:)))
mean(abs(error_vector(:)))
mean(abs(error_vector(:)))/mean(abs(z(:)))

In this case, the errors have an average absolute value of 0.012659, and are generally tightly bounded, though they are clearly not continuously distributed, since again, the algorithm does not use any type of smooth approximation, but instead makes use of a discrete data tree.

Now let’s apply the algorithm to a discontinuous, noisy sine wave function generated by the following code:

clear x y z
cnt = 1;
for i = 0 : .0125 : 3*pi
x(cnt) = i + .25*rand();
y(cnt) = i + .25*rand();
z(cnt) = sin(x(cnt));
cnt = cnt + 1;
endfor
figure, scatter3(x,y,z)
view(70,15)
data_array = shape_to_data(x,y,z);
[category_tree delta_tree anchor_tree] = generate_data_tree(data_array);

This is a sine wave with a noisy domain around the line y = x, over the interval [0,3*pi]. Since this will create a patchy domain and range, we’ll need to actually monitor the instances where the algorithm fails to map an input to the data tree. We do this by including a conditional that tests for whether the algorithm returns a predicted_z of infinity, which is how the algorithm signals that it thinks that an input is outside of the original domain of the function.

This can be done using the following code:

clear prediction_vector error_vector x y z
search_toggle = 1;
cnt = 1;
fail_count = 0;
x = 0 : .00625 : 3*pi;
y = 0 : .00625 : 3*pi;
[temp num_data_points] = size(x);
tic;
for i = 1 : num_data_points
x(cnt) = x(cnt) + .25*rand();
y(cnt) = y(cnt) + .25*rand();
z(cnt) = sin(x(cnt));
new_data_item{1} = [x(cnt) y(cnt) 0];
[category_index predicted_z final_delta] = predict_best_fit_tree(anchor_tree, delta_tree, new_data_item, 1, search_toggle);
if(predicted_z != Inf)
prediction_vector(cnt) = predicted_z;
error_vector(cnt) = z(cnt) – predicted_z;
cnt = cnt + 1;
else
fail_count = fail_count + 1;
endif
endfor
toc

We can generate the output and related statistics using the following code:

fail_count
x = x(1 : cnt – 1);
y = y(1 : cnt – 1);
z = z(1 : cnt – 1);
figure, scatter3(x,y,z)
view(40,15)
figure, scatter3(x,y,prediction_vector)
view(40,15)
figure, plot(1 : cnt – 1, error_vector)
std(abs(error_vector(:)))
mean(abs(error_vector(:)))
mean(abs(error_vector(:)))/mean(abs(z(:)))

In this case, it turns out the algorithm failed to place the new data 24 times, which is about 1.6% of the total number of data points. But, this can vary upon each run since the domain and range include some random noise. The average absolute error is about 0.027, with oscillating peaks that appear to follow the underlying function.

Finally, let’s apply the same technique to a polynomial with a noisy domain, also around the line y = x, using the following code:

clear x y z
cnt = 1;
for i = 1 : .009 : 8
x(cnt) = i + .25*rand();
y(cnt) = i + .25*rand();
z(cnt) = x(cnt)^7 + y(cnt)^7 + 10;
cnt = cnt + 1;
endfor
figure, scatter3(x,y,z)
view(40,15)
data_array = shape_to_data(x,y,z);
[category_tree delta_tree anchor_tree] = generate_data_tree(data_array);

We can generate the approximation using the following code:

search_toggle = 1;
clear prediction_vector error_vector x y z
cnt = 1;
fail_count = 0;
x = 1 : .007 : 8;
y = 1 : .007 : 8;
[temp num_data_points] = size(x);
tic;
for i = 1 : num_data_points
x(cnt) = x(cnt) + .25*rand();
y(cnt) = y(cnt) + .25*rand();
z(cnt) = x(cnt)^7 + y(cnt)^7 + 10;
new_data_item{1} = [x(cnt) y(cnt) 0];
[category_index predicted_z final_delta] = predict_best_fit_tree(anchor_tree, delta_tree, new_data_item, 1, search_toggle);
if(predicted_z != Inf)
prediction_vector(cnt) = predicted_z;
error_vector(cnt) = z(cnt) – predicted_z;
cnt = cnt + 1;
else
fail_count = fail_count + 1;
endif
endfor
toc

And, finally, we can generate the graphs and error statistics using the following code:

fail_count
x = x(1 : cnt – 1);
y = y(1 : cnt – 1);
z = z(1 : cnt – 1);
figure, scatter3(x,y,z)
view(70,20)
figure, scatter3(x,y,prediction_vector)
view(70,20)
figure, plot(1 : cnt – 1, error_vector)
std(abs(error_vector(:)))
mean(abs(error_vector(:)))
mean(abs(error_vector(:)))/mean(abs(z(:)))

As you can see, in all three cases, the approximations are true to the original functions, both in terms of the domain and range values, with very small errors compared to the range values, despite the fact that the functions are clearly very different in nature.

Octave Scripts:

find_most_distant_pair

generate_categories_3d

generate_data_tree

left_right_delta_3d

optimize_categories_3d

partition_array_3d

predict_best_fit_tree

shape_to_data

test_entropy_array_3d

generate_data_space

## Predicting and Imitating Complex Data Using Information Theory

In Uncategorized on January 4, 2019 at 3:00 pm

I’m going to present two prediction algorithms, one with a run time that is polynomial in the number of items in the underlying dataset, and another that is extremely fast, with an average case run time that appears to be significantly less than O(log(n)), where n is the number of items in the dataset.

I believe this second algorithm could turn out to be a powerful and novel tool in AI, since, in addition to accurately predicting data, it is so fast that it can quickly take an entire dataset, and reproduce it, regardless of the complexity of the original dataset. Another way to think about this is that this prediction algorithm is so good and so fast, that random inputs to the algorithm end up generating a new version of the original dataset, allowing it to expand the domain of a function, and in the extreme case, quickly imitate the structure of a three-dimensional object.

This is possible because these algorithms don’t attempt to interpolate data, and as a result, any inputs that don’t map well to the existing dataset are rejected. This allows us to provide the algorithm with random inputs, some of which will successfully map to the original dataset, which will accumulate over time, and eventually generate a new dataset that is similar to the original. Put in more formal terms, not only can this algorithm predict values in the range of a function, it can also reproduce and expand the domain of a function that it has analyzed.

I believe generalizations of this approach could yield even more impressive results in other areas of AI, beyond spatial pattern recognition. As a result, this is a long post that goes through all of the theory underlying the prediction techniques that allow these algorithms to work.

Categorizing Prior Data

The first step to both prediction algorithms is to use the categorization algorithm I presented below in the previous post entitled, “Fast N-Dimensional Categorization Algorithm”. When applied to a dataset, the categorization algorithm takes a dataset, and, as its name suggests, partitions it into a series of categories. In order to get a better sense of how the prediction algorithms work, I’ll briefly review how the categorization algorithm works.

In addition to categorizing data, the categorization algorithm will also produce two sets of variables: a single value called “delta”, and a series of values called “anchors”. Each category generated by the algorithm has an anchor, which is the data point about which the category is centered. That is, the algorithm searches through the dataset and finds a set of data points about which the dataset is clustered. Each anchor defines a category, since the data within each category is clustered around its anchor. We don’t have to specify how many categories / clusters the algorithm should produce ex ante. Instead, the algorithm decides on its own how many categories are appropriate in the context of the dataset.

As noted, each category is centered around an anchor value. Specifically, all of the vectors within a given category are within delta of the anchor for that category. That is, the norm of the difference between each vector within a given category and the anchor for that category is always less than delta. In short, delta is the natural, in-context distance between elements of the same category, based upon some assumptions rooted in information theory that I explain in the previous post below.

For example, assume we have the following dataset of 3-space vectors:

data_array{1} = [5.19858 0.61943 0.00000];
data_array{2} = [9.82500 3.40620 10.0000];
data_array{3} = [4.76372 0.20080 0.00000];
data_array{4} = [5.54710 9.43410 10.0000];
data_array{5} = [6.91330 2.81581 0.00000];
data_array{6} = [9.05740 9.94230 20.0000];
data_array{7} = [2.86459 5.11068 0.00000];
data_array{8} = [5.99046 0.26573 0.00000];

The categorization algorithm can be applied to this dataset using the following code:

[data_categories_array category_vec anchor_array H_final delta] = optimize_categories_3D(data_array,0);

Application of the categorization algorithm to this dataset yields the following categorization:

data_categories_array =
{
[1,1] =

5.19858 0.61943 0.00000

[1,2] =

4.76372 0.20080 0.00000

[1,3] =

6.91330 2.81581 0.00000

[1,4] =

2.86459 5.11068 0.00000

[1,5] =

5.99046 0.26573 0.00000

}

[1,2] =
{
[1,1] =

9.82500 3.40620 10.00000

}

[1,3] =
{
[1,1] =

5.54710 9.43410 10.00000

}

[1,4] =
{
[1,1] =

9.05740 9.94230 20.00000
}

That is, the algorithm divides the dataset into four categories, which appear to be most sensitive to the z-value of the data in this case. The first category contains 5 elements, and the remaining three categories each contain a single element. Note that the anchors generated by the algorithm are returned in the anchor_array, and in this case, the anchor for the first category is [5.19858 0.61943 0.00000]. The value of delta produced by the algorithm is 5.0711. As a result, all of the vectors in the first category are within delta of the first anchor vector, which you can check by taking the norm of the difference between the anchor and any vector in the first category. Note that applying the algorithm generates a single value of delta for all four categories.

In short, all vectors in the same category are within delta of the category’s anchor. As a result, though the goal of the algorithm is to generate categories, we can recharacterize the algorithm as generating a set of anchors, and a single value of delta, which will together define the categories it ultimately produces. That is, if I identify certain vectors within a dataset as anchors, and specify a value of delta, then I’ve defined a categorization of the dataset. We will use this fact later on to construct the “fast” prediction algorithm that looks only to the anchors and the value of delta. But first, we’ll begin with some assumptions and observations rooted in information theory that will inform our “full” prediction algorithm, which we’ll then approximate with the fast algorithm.

Categorization, Structure, and Entropy

The fundamental observation underlying nearly all of the work in AI that I’ve presented in this project is that if we want to find the “natural” structure of some artifact without any prior information, then we first represent the artifact as a mathematical object capable of producing a measure of entropy, such as a vector, matrix, or set of probabilities, and then repeatedly partition that representation until we find the point where the entropy of the partition changes the most. That is, we take the object, and iteratively partition it into smaller and smaller objects, which will affect our measurement of the object’s entropy. During this process, there’s going to be a level of subdivision where the structure of the partitioned object will “crack”, producing the maximum rate of change to the entropy of the partition, thereby uncovering the actual structure of the artifact which the object represents.

For an intuitive, visual explanation of how and why this process works, see the post below entitled, “Image Recognition with no Prior Information”.

Turning this view on its head, we can measure how well a new data point fits into an existing dataset by measuring how much the entropy of the categorization changes after inclusion of the new data point. That is, if a new data point is a good fit for a dataset, then it shouldn’t disturb the existing category structure much, resulting in a low change to the entropy of the category structure. In contrast, if a new data point is a bad fit for a dataset, then it will significantly disturb the structure of the categories, resulting in a large change to the entropy of the category structure.

Continuing with the example above, assume we’re given the following new data point:

data_array{9} = [5.19858 0.62000 0.00000];

This is obviously very close to the anchor of the first category, and as a result, it’s reasonable to view this data point as a good fit for the existing dataset. To test whether this is actually the case, we insert this new data point into the dataset, and then rerun the categorization algorithm on the resulting dataset. Because we’re rerunning the categorization algorithm, it is possible for the anchors to change. As a result, we can’t predict to which category the new data point will belong, since those categories could change as a result of the inclusion of the new data point. Instead, we measure how well the new data point fits by measuring how much the entropy of the categorization changes as a result of its inclusion.

The entropy of the categorization is calculated by taking the number of data points in each category, and dividing each by the total number of data points, which will produce a probability for each category. We then take the entropy of this set of probabilities as if it represented an actual probability distribution.

In this case, as expected, the new data point does not change the category structure significantly, and instead, the new data point is categorized together with what was the anchor for the first category in the prior run, producing the following category structure.

data_categories_array =
{
[1,1] =
{
[1,1] =

5.19858 0.61943 0.00000

[1,2] =

4.76371 0.20080 0.00000

[1,3] =

6.91329 2.81581 0.00000

[1,4] =

2.86458 5.11067 0.00000

[1,5] =

5.99046 0.26573 0.00000

[1,6] =

5.19858 0.62000 0.00000

}

[1,2] =
{
[1,1] =

9.82500 3.40620 10.00000

}

[1,3] =
{
[1,1] =

5.54710 9.43410 10.00000

}

[1,4] =
{
[1,1] =

9.05740 9.94230 20.00000

}

}

The first category contains 6 elements, and the remaining 3 categories each contain 1 element. As a result, the entropy of the categorization is given by the following:

H = (6/9)*log2(9/6) + 3*((1/9)*log2(9)) = 1.4466.

The entropy of the original categorization is 1.5488. As a result, the change to the entropy of the categorization is in this case small, and negative, -0.10218, since the new data item increases the size of the first category, concentrating more of the data into a single category, and in turn reducing the entropy of the categorization (note that entropy is minimized when one category contains all of the data points).

Now assume that we are given the following new data point:

data_array{9} = [9.83000 3.40700 10.10000];

This is very close to the only vector in the second category, and as a result, this new data point should also be a good fit for the existing data. However, it’s not a good fit for the largest category, but is instead a good fit for a small category. In some sense, this implies that it’s actually not as good of a fit as the first new data point we considered above, since this new data point is similar to only one existing data point, as opposed to being similar to a large number of existing data points.

Application of the categorization algorithm produces the following:

data_categories_array =
{
[1,1] =
{
[1,1] =

5.198583 0.61943 0.00000

[1,2] =

4.76372 0.20080 0.00000

[1,3] =

6.91330 2.81581 0.00000

[1,4] =

2.86459 5.11068 0.00000

[1,5] =

5.99046 0.26573 0.00000

}

[1,2] =
{
[1,1] =

9.82500 3.40620 10.00000

[1,2] =

9.83000 3.40700 10.10000

}

[1,3] =
{
[1,1] =

5.54710 9.43410 10.00000

}

[1,4] =
{
[1,1] =

9.05740 9.94230 20.00000

}

}

As expected, the category structure is not disturbed, and instead, the new data point is inserted into the same category as the existing data point to which it is most similar. In this case, the change in entropy is small, and positive, 0.10895, since the new data point increases the size of a small category, thereby increasing the entropy of the categorization (note that entropy is maximized when all categories contain an equal number of data points). In this case, the magnitude of the change in entropy is greater than it was in the previous example, which is consistent with our observation that while the new data point is certainly a good fit, it is not as a good of a fit as the first example we considered above.

Now assume that we’re given the following data point:

data_array{9} = [45.15000 16.80000 4.90000];

This data point is probably not a good fit for the existing data, as its distance from any anchor is significantly larger than delta. As a result, we should expect the inclusion of this new data point to disturb the category structure.

Application of the categorization algorithm produces the following:

data_categories_array =
{
[1,1] =
{
[1,1] =

5.19858 0.61943 0.00000

[1,2] =

9.82500 3.40620 10.00000

[1,3] =

4.76372 0.20080 0.00000

[1,4] =

6.91330 2.81581 0.00000

[1,5] =

2.86459 5.110678 0.00000

[1,6] =

5.99046 0.26573 0.00000

}

[1,2] =
{
[1,1] =

5.54710 9.43410 10.00000

[1,2] =

9.05740 9.94230 20.00000
}

[1,3] =
{
[1,1] =

45.15000 16.80000 4.90000

}

}

In this case, the change in entropy is significant, and negative, -0.32440, as the new data point forced the existing data into a smaller number of larger categories. Put informally, in the context of the new data point, the rest of the data seems closer together, and as a result, the categorization algorithm increases the value of delta, which is in this case 11.902, thereby concentrating all of the existing data into two categories, and creating a third category for the new data point. It is reasonable to conclude that this new data point is, therefore, not a good fit for the existing data.

Now assume we are given the following new data point:

data_array{9} = [15031.60000 1001.69000 10386.80000];

This is obviously not a good fit for the existing data, and therefore, will probably significantly disturb the existing category structure.

Application of the categorization algorithm yields the following:

data_categories_array =
{
[1,1] =
{
[1,1] =

5.19858 0.61943 0.00000

[1,2] =

9.82500 3.40620 10.00000

[1,3] =

4.76371 0.20080 0.00000

[1,4] =

5.54710 9.43410 10.00000

[1,5] =

6.91329 2.81581 0.00000

[1,6] =

9.05740 9.94230 20.00000

[1,7] =

2.86458 5.11067 0.00000

[1,8] =

5.99046 0.26572 0.00000

}

[1,2] =
{
[1,1] =

15031.60000 1001.69000 10386.80000

}

In this case, the change in entropy is even more significant, and negative, -1.0455, as all of the existing data is forced into a single category due to an increase in delta, which is in this case 299.93.

As a result, this approach allows us to construct an objective binary test for whether or not a new data point is a good fit for an existing dataset. Specifically, if a new data point ends up in a category of one, on its own, then it is not a good fit for the existing data. If it is included in a category that contains at least two data points, then it is a good fit. Further, we can also measure how well the data fits into the dataset by measuring the change in entropy of the categorization. This will in turn allow us to distinguish among good fits, and among bad fits, thereby allowing us to objectively determine best fits and worst fits, and everything in between.

Predicting Unknown Variables

The key takeaway from the analysis in the sections above is that we can measure how well a new data point fits into an existing dataset by measuring the change in entropy of the categorization that results from the inclusion of the new data point. In this case, I’ve used my own categorization algorithm to demonstrate this concept, but we could in theory use any categorization / clustering algorithm, and then test the entropy of the resultant categorization in the same manner.

As a result, the principle is a general one: we can measure how well a new data point fits into a dataset by measuring the change in the entropy of the structure of the dataset that results from inclusion of the new data point.

Now assume that we are given a new data point, but that the z-value is unknown. For example, assume that we’re given the vector [5.19859 0.61944 Z]. Further, assume that we’d like to predict the value of Z. Without any context, there’s no value of Z that’s preferable to any other. That is, there’s no ex ante value of Z that makes more sense than any other, and as a result, we need a dataset from which we can derive a value of Z that makes sense in the context of that dataset. As a result, predicting the value of a single hidden variable is a fundamentally different problem than the other problems I’ve addressed previously in this project, none of which require any prior data. This is because a dataset, or an image, can create its own context from which information can be extracted using the methods I presented in the other posts below. In contrast, a single data point is just that – a point in isolation, with no relationships to consider, and therefore, no context to consider.

At the risk of harping on the wonders of music, which I’m certainly guilty of, this is like asking what the key of a song is when you’re only given one note. The correct answer is in this case, “I don’t know”. If you hear two notes, you have a better idea. Three notes, better still. As the number of notes played increases over time, you can more confidently guess the key of the song, which is the context in which you perceive the notes. Clever composers consciously toy with these ideas, playing the same melody in the context of two different key signatures, and deliberately crafting melodies that have no obvious key signature. Jazz musicians are notorious for doing this, particularly at the end of a song, where they’ll throw in a “surprise ending” that completely changes the context of everything you’ve just listened to. For those that are interested, have a listen to the end of this Ella Fitzgerald song, which I’ve cued up in the following link https://www.youtube.com/watch?v=mshV7ug8cdE&feature=youtu.be&t=438. At the 7:27 mark, the pianist suddenly goes off the reservation, quietly changing the key, and therefore, the context of the entire song.

Similarly, data needs context in order to be understood, and a single data point by definition has only a cosmic, universal context, which I’m certainly not prepared to address. As a result, if we want to predict some unknown variable, we need an assumed context in which we’re predicting that unknown variable.

So let’s consider the new data point in the context of the original dataset we analyzed above. The original dataset generates four categories, which are in turn defined by four anchors. As a result, if the new data point is a good fit for the dataset, then the value of Z has to be similar to the z-value of at least one anchor. Therefore, we can iterate through the z-values of each anchor, treating the z-value of each anchor as an assumed value for Z, insert the resultant data point into the existing dataset, and categorize the resulting new dataset. If there is at least one assumed value of Z that causes the new data point to be categorized together with an existing data point, then the x and y values of the new data point are a good fit for the existing dataset. If there are multiple assumed values of Z that cause the new data point to be categorized with some existing data points, then we can choose the value of Z that minimizes the change in the entropy of the categorization. That is, first we test to see if the new data point is a good fit by testing whether there’s at least one assumed value of Z that causes it to be categorized together with some existing data points, and then, if there’s more than one such value, we choose the value of Z that caused the least disturbance to the existing category structure.

In short, first we test to see if the new data point is a good fit, and if it is a good fit, then we choose the z-value that produces the best fit. We’ll address this topic again at the end of this section.

Let’s begin by assuming that Z = 0.00000, i.e., the z-value of the first anchor. This implies that,

data_array{9} = [5.19859 0.61944 0.00000];

Application of the categorization algorithm yields the following:

data_categories_array =
{
[1,1] =
{
[1,1] =

5.19858 0.61943 0.00000

[1,2] =

4.76372 0.20080 0.00000

[1,3] =

6.91330 2.81581 0.00000

[1,4] =

2.86459 5.11068 0.00000

[1,5] =

5.99046 0.26573 0.00000

[1,6] =

5.19859 0.61944 0.00000

}

[1,2] =
{
[1,1] =

9.82500 3.40620 10.00000

}

[1,3] =
{
[1,1] =

5.54710 9.43410 10.00000

}

[1,4] =
{
[1,1] =

9.05740 9.94230 20.00000

}

}

The change in entropy is in this case -0.10218, and as expected, the new data item fits well into the first category, since it is very close to the anchor for the first category.

Now assume that Z = 10, i.e., the z-value of the second anchor.

Application of the the categorization algorithm yields the following:

data_categories_array =
{
[1,1] =
{
[1,1] =

5.19858 0.61943 0.00000

[1,2] =

4.76372 0.20080 0.00000

[1,3] =

6.91330 2.81581 0.00000

[1,4] =

2.86459 5.11068 0.00000

[1,5] =

5.99046 0.26573 0.00000

}

[1,2] =
{
[1,1] =

9.82500 3.40620 10.00000

}

[1,3] =
{
[1,1] =

5.54710 9.43410 10.00000

}

[1,4] =
{
[1,1] =

9.05740 9.94230 20.00000

}

[1,5] =
{
[1,1] =

5.19859 0.61944 10.00000

}

}

In this case, the new data point is categorized on its own, which implies that assuming Z = 10 does not produce a good fit. The change in entropy is in this case 0.3312, since a new category is created, thereby increasing the entropy of the categorization. Assuming Z = 20 produces the same outcome, together implying that Z = 0 is the only good fit for the (x,y) values [5.19858 0.61944].

Note that this method does not interpolate the value of Z, but instead tests known values of Z taken from the dataset itself. As a result, using this approach on an outlier data point will cause the data point to fail to categorize (i.e., end up in a category by itself) for each assumed value of Z. Put simply, if the new data point fails to categorize for every known value of Z, then in that case, our dataset doesn’t provide any information about the value of Z, and therefore, we’re no better off than we would be in attempting to predict the value of Z in a vacuum.

Approximating the Full Algorithm

Though the method above is useful, theoretically interesting, and reasonably fast, we can approximate it using another approach that is extremely fast. Specifically, note that when the new data point is not an outlier, the category structure is not significantly disturbed, and in the worst case of a quasi-outlier, a new category of one is created, without otherwise disturbing the existing categories. This in turn implies that the value of delta should not change much as a result of the inclusion of the new data point into the dataset, provided it’s a reasonable fit for the dataset. Therefore, we can approximate the behavior of the prediction method described above by simply testing whether or not the new data point is within delta of each anchor, using the assumed z-values of each anchor. If this turns out to be the case, then we know that our new data point is a reasonably good fit for the existing dataset, since there is a category to which it would be included, if we ignore the effects of the inclusion of that new data point. In short, if we “freeze” the existing category structure and the value of delta, we can simply test whether our new data item is within delta of some anchor.

Let’s run through the same example given above, but this time, rather than rerun the categorization algorithm, we’re going to simply test whether or not the new data item is within delta of some anchor, using the original value of delta = 5.0711, using each assumed value of Z.

Our first assumed value of Z is 0, producing the vector [5.19859 0.61944 0]. This new data point is very close to the first anchor, and as a result, the norm of the difference between our new data item and the first anchor is approximately 0, which is less than delta. This implies that assuming Z = 0 causes our new data item to be a good fit for the data, and since the norm of the difference is roughly zero, there’s really no point in testing the rest of the anchors, at least in this case.

Now assume that our new data point is instead [150.60000 89.12000 Z]. Because delta = 5.0711, there is clearly no assumed value of Z that will cause this vector to be categorized using this approach.

As a result, outlier data points will fail to categorize using this approximated approach as well, and so this approach allows us to quickly say whether or not the not a new data point is a good fit for the dataset. As such, it approximates the binary test of the algorithm above, as to whether or not a new data point can be included in the dataset at all, which in turn tells us whether or not our dataset can tell us anything about the unknown value Z.

We can, if we want, also measure how well the new data point fits by measuring the change in entropy that results from inserting the new data point into the dataset. This is useful in the case where the data point is within delta of two anchors, producing what should therefore be two reasonable values of Z. Though we ultimately won’t make use of this technique in the interest of speed, I’ll spend a moment describing how we could do this. First, we again assume that the structure of the dataset is unchanged, but for the inclusion of the new data point. We then measure the change in entropy that results from including the new data point into each category to which it fits. We then chose the category that changes the entropy of the dataset the least. This approach will ensure that we always choose the largest category, as opposed to the tightest fit. Which approach is best will depend upon our goal. If we want our predictions to be as close as possible to the existing data, then we should choose the category for which the distance between our new data item and the anchor is minimized, i.e., the tightest fit. If instead we want to assume that the structure of our dataset should change as little as possible, then we should choose the category that results in the smallest change in entropy.

Though it might seem odd to make use of an operating assumption that the structure of the dataset should change the least, note that if we were to actually include each new data point into the dataset, then if we didn’t make this assumption, eventually we would erode and possibly even destroy the structure of the categories. That is, repeatedly choosing the tightest fit could end up completely changing the structure of the categories. Nonetheless, because we are using this approach to predict data, and not to actually update the data in our dataset, this concern is academic, at least in this case.

Finally, note that in any case, the runtime of this algorithm is O(M), where M is the number of anchors, which is necessarily less than or equal to n, the number of data points in the dataset. As a practical matter, the value of M will depend upon the data. If the data is highly randomized, then M will be approximately equal to n. If the data has structure, then M could be extremely small compared to n, resulting in an extremely fast runtime. For example, it’s not inconceivable that a dataset that comprised of millions of data points consists of only a few top level categories, which would result in a negligible runtime. Nonetheless, when dealing with a large number of data points, it could be useful to further subdivide the dataset into smaller, more detailed categories, beyond the top level categories generated by one level of categorization. This is the approach that we will ultimately take, producing a remarkably fast and powerful prediction algorithm.

Prediction Using a Data Tree

As noted in the previous section, it could in some cases be useful to create deeper categories, beyond the top level categories generated by one single application of the categorization algorithm. Returning to our example above, one application of the algorithm produced a first category that contained five elements. We could in turn run the categorization algorithm again on that category, which will produce the following:

data_categories_array =
{
[1,1] =
{
[1,1] =

5.19858 0.61943 0.00000

[1,2] =

4.76372 0.20080 0.00000

[1,3] =

5.99046 0.26573 0.00000

}

[1,2] =
{
[1,1] =

6.91330 2.81581 0.00000

}

[1,3] =
{
[1,1] =

2.86459 5.11068 0.00000

}

}

That is, because the first category contains multiple elements, it is possible to run the categorization algorithm one more time to see if any further meaningful subdivision is possible. It turns out that in this case, the first category can be further subdivided into a set of three points that are all very close to each other, and two remaining points that are relatively isolated.

The value of delta produced by the categorization algorithm is in this case 0.90492, which is significantly smaller than the value produced at the top level of categorization, 5.0711. As a result, by categorizing one level down, we can produce categories that are significantly tighter, which will in turn allow us to match new data points within much smaller categories, thereby presumably producing tighter predicted values of Z.

By repeatedly applying the categorization algorithm in this fashion, we will produce a hierarchy of categories, beginning with the top layer categories generated by applying the algorithm to the original dataset, and continuing downward until we produce singletons, and categories that simply cannot be further subdivided. Expressed visually, this will produce a tree of categories, and as a result, a tree of anchors, and a tree of values of delta.

If we want to find the tightest fit for a new data point, we can begin at the bottom of the anchor tree, and test whether that new data point fits into any of the bottom level anchors, which will by definition correspond to the smallest values of delta. If that doesn’t work, then we proceed one level up, and test whether the new data point fits within any of the anchors within the second lowest level of the tree.

This method is implemented by the attached algorithms, generate_data_tree, which iteratively applies the categorization algorithm until the data trees are produced, and predict_best_fit_tree, which takes in a new data item and attempts to place it at the lowest possible level in the resultant data trees. Assuming the dataset is fixed, we’ll only have to produce the trees once, meaning that the only algorithm we’ll need to run to predict the z-value of a new data point is predict_best_fit_tree.

We can apply this method to the new data point [5.19000 0.61900 Z] using the following code:

[category_tree delta_tree anchor_tree] = generate_data_tree(data_array);
new_data_item{1} = [5.19 0.619 0];

[category_index predicted_z final_delta] = predict_best_fit_tree(anchor_tree, delta_tree, new_data_item,1);

The returned value predicted_z gives us our desired prediction for the value of z, which will in this case be 0, as expected.

The runtime complexity of predict_best_fit_tree will depend upon the dataset in question, but common sense suggests that it is unlikely to exceed O(log^2(n)), since the number of top level categories is probably not going to exceed M = log(n), and the depth of the tree is probably not going to exceed log(M). As a practical matter, predict_best_fit_tree runs in a tiny fraction of a second on an ordinary commercial laptop. This makes it practical to repeatedly feed it random inputs, ultimately allowing it to reconstruct and expand both the domain and the range of a function that has been analyzed by the generate_data_tree function, which I’ll address in the next section.

Intimating Complex Data

In the big picture, interpolation methods take computationally complex data, and force it into some computationally cheap shape, like a line, curve, or surface. In contrast, the approach I’ve outlined above inserts new data into the roughly natural, and therefore, computationally complex structure of the original dataset, but in a computationally cheap manner. In short, rather than fight the structure of the underlying data, we work with it, and as a result, we won’t have a smooth domain unless the original domain is smooth. Similarly, we won’t have a smooth range unless the original range is smooth. Instead, what the approach above produces is a dataset that is quantized into a discrete number of increasingly smaller spheres, each centered around data points that are, at various levels of categorization, treated as the center points for the categories they define.

As a result, repeated application of the predict_best_fit_tree algorithm to randomly generated new data will allow us to identify new data points that are similar to the original domain and range of the function. This is accomplished by the attached imitate_data script, which, when given the data tree of a function, and the approximate boundaries of its domain, can effectively reproduce the original dataset.

I’ve attached a few examples of how these algorithms can be used to imitate complex datasets, and a text file that contains the Octave code that will let you generate the datasets for yourself. Any point that fails to categorize is painted grey and mapped to the 0-plane. All points that categorize successfully are painted blue.

1. The Cylindrical “Flower Pot”

In this example, I’ve used a built-in Octave function to produce a three-dimensional surface that looks a bit like a flower pot. I then take the underlying point information, process it using the generate_data_tree function, which repeatedly categorizes the point information as if it were a dataset, and then imitate that dataset using the imitate_data function. The results are pretty awesome, and clearly reproduce the overall three-dimensional shape of the surface.

The first attached image shows the original surface, the second image shows the imitated surface generated by the imitate_data function, and the third image shows the top-level categories generated by the categorization algorithm. Proximate ‘o’ rings painted the same color are part of the same category. Large rings are part of a large category, and small rings are part of a small category.

Though this is just a surface, in theory, these algorithms could reproduce both the surface, and the internal structure of a three-dimensional object, suggesting that these algorithms might have applications in 3D printing, autocad, etc.

2. The Noisy Sine Wave

This is a noisy, discontinuous sine curve that I then reproduced using the imitate_data function. Macroscopically, the reproduction almost looks smooth, but up close you can quite plainly see that the algorithm reproduces not only the overall shape and specific values of the function, but also the patchiness of the domain and range of the function.

This is a simple quadratic curve that is smooth compared to the other functions.

The relevant scripts are here:

find_most_distant_pair

generate data samples

generate_categories_3d

generate_data_tree

generate_n_random_categories_3d

imitate_data

left_right_delta_3d

optimize_categories_3d

partition_array_3d

predict_best_fit_tree

shape_to_data

test_entropy_array_3d

display_categorization_3d

## Note on the Riemann Zeta Function

In Uncategorized on December 28, 2018 at 12:28 pm

As you know, I’ve been doing research in information theory, and some of my recent work has some superficial connections to the Zeta function, since I was making use of logarithms of complex numbers.

I noticed the following graph:

In short, we’d be looking at the Zeta function using values of s produced by powers of the logarithm function:

$zeta(log^n (-1) /x).$

There appears to be several zeros for n = 6.

You can see it yourself on wolfram using the following link:

But, I don’t know enough about this topic to be sure that this is actually interesting, since I know that at least some of the zeros of this function are considered “trivial”.

Nonetheless, I thought I’d share this in case some of you found it interesting and had insights.

Regards,

Charles

## On the Applications of Information Theory to Physics

In Uncategorized on December 25, 2018 at 9:24 am

I’ve attached an updated draft of what is in effect a summary of the work I’ve been doing applying information theory to physics, that I thought some of you would find interesting, given the popularity of my work on the connections between information theory and artificial intelligence.

Despite the imposing title, the attached note is actually a straightforward explainer that a computer scientist or mathematician with almost no background in physics will have no trouble understanding. In fact, it arguably doubles as a quick explainer on what are in my opinion some very interesting issues in physics. In any case, I hope you enjoy it.

Happy Holidays,

Charles

Non-Local Interactions, Spin, the Strong and Weak Forces, Inertia, and Faraday Induction (1) (1)

## Fast N-Dimensional Categorization Algorithm

In Uncategorized on December 17, 2018 at 5:08 pm

In a previous post entitled, “Using Information Theory to Create Categories with No Prior Information”, I presented an algorithm that can quickly construct categories given a linear dataset with no other prior information. In this post, I’ll present a generalization of that algorithm that can be applied to a dataset of n-dimensional vectors, again with no prior information. Like all of the other work I’ve presented as part of this project, this algorithm is again rooted in information theory and computer theory. For an explanation as to why this algorithm works, you should see the post below, as this is really just a generalization of the previous algorithm applied to higher dimensional data, with no meaningfully new theory.

In this post, I’m going to focus on the results of this algorithm, which are quite good. I’ll also discuss the runtime complexity of the algorithm, which I believe to be worst-case $O(log(n)n^{D+1})$, where D is the dimension of the dataset, making this algorithm exceptionally fast for what it is, which is a data categorization algorithm.

As a general matter, the only thing that this algorithm requires is a dataset that has a measure function that maps to a real number. In simpler terms, all we need in order to make use of this algorithm is a function that can compare the “distance” between any two data points in the dataset. This means that we can apply this algorithm to any n-dimensional dataset simply by making use of the norm of the difference between any two vectors in the dataset, and that we can also apply this algorithm to non-Euclidean spaces, so long as we have a well-defined measure function on the space.

In a follow up post, I’ll show how we can actually use this algorithm to produce a well-defined measure in classification problems by iterating through different measures until we produce the “right” measure that results in the maximum change in the structure of our classification.

Runtime Complexity

The first thing that this algorithm does is test the symmetry of the dataset using a function called “partition_array”, which is attached together with all of the other scripts you need to run this algorithm. Partition_array repeatedly subdivides the space in which the dataset exists into smaller and smaller “chunks”, until each chunk contains a maximally different amount of information. In this case, just imagine breaking a cube of 3-space into smaller and smaller equally sized subcubes. As we do this, the number of data points within each subcube will vary as a function of the number of subdivisions, with the number of data points per subcube generally decreasing as a function of the number of subdivisions, simply because each subcube will get smaller as we increase the number of subdivisions.

For reasons explained below, this algorithm stops once it finds the partition size that maximizes the standard deviation of the information contained within each subcube. This will produce a number N, which is how many times we’ve subdivided each dimension in order to achieve this maximization. It turns out that N is a measure of how symmetrically the data is distributed within its own space. If N is large, then the data is symmetrically distributed. If N is small, then the data is asymmetrically distributed.

The partition_array algorithm begins by making an initial guess as to the value of N, which it sets to the log of the number of the items in the dataset. Then, depending upon the characteristics of the dataset, it will either increase its guess, or decrease its guess, by either multiplying its first guess by 2, or dividing its first guess by 2, respectively. It will continue doing this until it finds the value of N that maximizes the standard deviation of the information contained in each subcube, which will cause the value of N to increase, or decrease, exponentially. However, the algorithm prevents N from increasing past the number of items in the dataset, and also prevents it from decreasing below 1. As a result, if n is the number of items in the dataset, then the partition_array algorithm runtime complexity cannot exceed log(n).

For each value of N generated by partition_array, another function called “test_entropy” is called. This is the function that actually tests the information content of each subcube. As part of this function, we will have to test each item in the dataset to determine which subcube it belongs to. Because we divide each dimension N times, which generates $N^3$ subcubes, and N is necessarily less than or equal to n, it follows that the number of subcubes cannot exceed $n^3$. As a result, we’ll have to do at most $n^4$ comparisons each time we call test_entropy.

As a general matter, this implies that the runtime complexity of this phase of the algorithm is necessarily less than,

$log(n)n^{D+1}$,

where D is the dimension of the space. In this case, D = 3, and therefore, the worst-case runtime complexity is $log(n)n^4$.

Though this is only the preprocessing phase of the algorithm, it is actually the part of the algorithm with the highest worst-case runtime complexity. As a result, this algorithm has an overall worst-case runtime complexity of $O(log(n)n^{D+1})$, making it extremely fast for what it does.

Anecdotally, I’ve noticed that the algorithm makes short work of data that actually has structure, and struggles with data that is truly randomized. This is not surprising, since truly randomized data should be highly symmetrical, forcing the preprocessing stage into the worst-case runtime (i.e., producing a very large value of N).

Application to Data

I’ve attached a set of scripts that make it easy to test the algorithm, which should be called from the Octave command line as follows:

data_categories_array = optimize_categories_3D(data_array);

[X Y Z S C] = display_categorization_3D(data_categories_array,num_items);

figure, scatter3(X,Y,Z,S,C);

The generate_n_random_categories_3D script does exactly what its name suggests, which is to generate categories of random vectors in 3-space. The script creates what we know to be categories of data by randomly generating a set of seed values, around which it will generate some random data points.

So, for example, if num_categories = 2, and num_items = 1000, then it will create 2 categories of data that each contain 500 data points. This is accomplished by generating 2 seed locations, about which each category is centered. These seed locations can be anywhere from [0 0 0] to [base base base]. As a result, the greater the value of base is, the greater the space in which the data exists. Therefore, a small number of categories and a large value of base increases the probability that our categories will be very far apart.

The min_spread and max_spread control how diffuse the data is around each category’s central point. The first category generated has the minimum diffusion, and the last category generated has the maximum diffusion, with diffusion increasing linearly with each category generated. A high value of adjustment in essence forces each category’s central point to be generated along the same line, and a zero value of adjustment allows the central points to be generated randomly throughout the entire space from [0 0 0] to [base base base].

The “optimize_categories_3D” algorithm takes the data, and sorts it into categories, using the process I described above, which are then formatted for display by the “display_categorization_3D” function.

Data points are represented by the little ‘o’ rings in the attached images. When two data points are part of the same category, the display algorithm paints them with the same color. However, because category colors are randomly generated, it’s possible for two different categories of data to be assigned approximately the same color, and, though extremely unlikely, it’s even possible they’re assigned the exact same color. As a result, local colors are what to look for if you want to determine what data points are part of the same category.

The size of the ring representing a data point is determined by the size of the category to which the data point belongs. If the data point is part of a big category, then the ring will be big. If the data point is part of a small category, then the ring will be small.

I’ve attached graphs from two series of datasets, one for which the value of adjustment is high, forcing categories to be generated along a roughly straight line, and another for which the value of adjustment is zero, allowing categories of data to be spontaneously generated anywhere within the space bounded by [0 0 0] to [base base base].

The first series consists of 11 datasets, with all of the inputs to the random data generator remaining constant for each data set, except the value of max_spread, which increases from the first dataset to the last dataset. Each dataset in the first series consists of 50 categories, with 10 points in each category, for a total of 500 data points per data set. The first dataset in this series looks like a series of tightly packed rings, since for the first dataset, the min_spread equals the max_spread, meaning that each of the categories in this dataset should be roughly identically distributed, and just seeded at different locations in the space. As we move through this series of datasets, max_spread increases, meaning that the first category in each dataset will be more tightly packed than the last category in that same data set. This is why the last dataset in this series looks like a lot like a confetti cannon, since the first category in that data set is tightly packed, and the last category is extremely diffuse, representing the maximum possible spread for that entire data set.

Looking at the results, you can clearly see that the algorithm generates intuitively correct categorizations, grouping tightly packed clusters into a small number of relatively large categories, and diffuse clusters into a large number of relatively small categories. The last two charts show how many categories the algorithm actually generates for each series. Specifically, as max_spread increases, the number of categories generated for each of the 11 datasets in this series is as follows:

46 42 46 66 73 78 88 86 106 85 126

Note that 50 categories is in some sense the “correct” answer, since that is the number of clusters generated by the underlying random data generator. However, as the diffusion of each cluster increases, the clusters will become more diffuse, causing each cluster to lose its structure, and causing neighboring clusters to overlap. As a result, we should expect the number of categories identified by the algorithm to increase as a function of diffusion, which is exactly what happens.

The next series of datasets is generated using exactly the same variables, except the value of adjustment is set to 0, and the difference between the min and max spread is smaller to account for the smaller space the data occupies (this is a consequence of adjustment being set to 0). As a result, in this case, the seed values for the clusters propagate randomly throughout the entire space. This causes the datasets to start out in little tightly packed clusters that are randomly distributed, and eventually diffuse into generally unstructured data with a few small local clusters that the algorithm does a great job of categorizing together, when appropriate.

As max_spread increases, the number of categories generated for each of the 11 data sets is as follows:

46 180 265 321 261 297 331 329 365 375 246

Again, 50 is arguably the “correct” answer. However, in this case, the algorithm is clearly far more sensitive to diffusion in the data, which isn’t surprising, since in this case, we haven’t constrained the diffusion at all, but instead, have allowed the data to propagate randomly throughout the entire space.

To generate the “confetti cannon” data, use the following:

base = 10;
num_items = 500;
num_categories = 50;

for i = 0 : 10

data_categories_array = optimize_categories_3D(data_array);
[X Y Z S C] = display_categorization_3D(data_categories_array,num_items);
figure, scatter3(X,Y,Z,S,C)

endfor

To generate the randomly clustered data, use the following:

base = 25;
num_items = 500;
num_clusters = 50;

for i = 0 : 10

clear data_categories_array
tic; data_categories_array = optimize_categories_3D(data_array);toc
[X Y Z S C] = display_categorization_3D(data_categories_array, 500);
figure, scatter3(X,Y,Z,S,C)

endfor

In a follow up post, I’ll show how we can categorize new data, and predict to which category new data fits best, using the opposite approach: that is, we find the dataset of best-fit for new data by including the new data into a series of datasets, and the dataset that changes the least in terms of the structure of its categories is the dataset to which the new data fits best. We measure the change in structure of the categories by using the entropy of the categorization, just as we did above. In short, new data fits best where it disturbs category structure the least, and we measure change in category structure using the entropy of the categorization.

The relevant Octave scripts are available here:

display_categorization_3D

find_most_distant_pair

generate_categories_3D

generate_n_random_categories_3D

left_right_delta_3D

optimize_categories_3D

partition_array_3D

test_entropy_array_3D

## A Simple Model of Non-Turing Equivalent Computation

In Uncategorized on December 14, 2018 at 5:54 pm
Alan Turing’s statement of what is now known as the Church-Turing Thesis asserts that all forms of “mechanical” computation are equivalent to a Universal Turing Machine. The Church-Turing Thesis is not a mathematical theorem, but is instead a hypothesis that has turned out to be true as an empirical matter. That is, every model of computation that has ever been proposed has been proven to be capable of computations that are either equivalent to, or a subset of, the computations that can be performed by a UTM.

Nonetheless, as I’ve discussed in previous posts, there are good reasons to believe that nature itself is capable of generating non-computable processes. First off, it is mathematically impossible for a UTM to generate information (see the post below for a simple proof). Nonetheless, human beings do this all the time. In fact, I’m doing exactly that right now.

At the same time, Alan Turing is one of my personal heroes, so it’s not easy for me to disagree with him. He’s also a cosmically intelligent human being that I don’t disagree with, without extreme caution. As a result, I’ll hedge my bets and say that what I present below is exactly what he was hoping to be true: that there is a simple model of computation beyond the UTM, and that nature is, as common sense suggests, stranger and more complicated than even the strangest fiction.

The “Box”

We naturally associate computation with symbolic computation, since that’s what we’re taught to do from the time we’re born. That is, computation is what you to do numbers and symbols, and computers are tools you use to do a lot of computations in a short amount of time. This view is embedded in the Church-Turing Thesis itself, in that Turing described computation as an inherently “mechanical” process. But when you compress the input to a UTM maximally, what you really have is a “trigger”. That is, there’s some string that we provide as the input to some box, and then that box spontaneously generates some output. That input string will probably be totally unrecognizable as anything that should be associated with the output that it will ultimately generate, since, by definition, we’ve eliminated all the compressible structure in the input, leaving what will be a maximally randomized kernel.

Abstracting from this, we can define a kind of computation that is not the product of mathematical operations performed on a set of symbols. Instead, we can view the input to a computational device as a trigger for the outputs that follow. This view allows us to break free of the mathematical restraints of mechanical computation, which prevent us from generating complexity. That is, as noted above, and proven below, the complexity of the output of a UTM can never exceed the complexity of its input. If, however, we treat the input to a device as a trigger, and not the subject of some mathematical operation, then we can map a finite string to anything we’d like, including, for example, a particular configuration of a continuous surface.

Implementing this would require a continuous system that responds to inputs. Some kind of wave seems to be the best candidate, but I’m not claiming to know what physical systems we should use. Instead, I’m presenting what I believe to be a simple mathematical model of computation that is on its face non-computable, and leaving it up to the adults in the room to figure out how to implement it.

Assume that we have some input m, that we give to our computational device X.
In particular, let’s assume X is always in some state that we’ll represent as a series of integer values mapped to an uncountable space. That is, the state of X can be represented as an uncountable multi-set of integers, which for convenience we’ll imagine being represented as heights in a plane. Specifically, imagine an otherwise flat plane that has some height to it, with the height of the plane at a given point representing the value of the state of X at that point, and the aggregate set of heights representing the overall state of X at a given moment in time.

When X is given some input m, assume that its state changes from some X_0, to some other state X(m).

We could assume that the state of X changes for some period of time after receiving the input m, or switches immediately to X(m). Either way, the important part is to assume that X is, by definition, capable of being in states that require an infinite amount of information to fully represent. We could of course also allow X to be in states that require only a finite amount of information to represent, but the bit that will make X decidedly non-computable is the ability to be in a state that is in effect a chaotic wave. That is, a discontinuous set of integer values mapped onto a continuous surface, that, as a result, cannot be compressed into some finite representation.

If such a system exists, X would be, by its nature, non-computable, since, first of all, X generates states that require an uncountable number of integers to fully represent. Since a UTM can generate only finite outputs, the states of X cannot be simulated by a UTM. Further, X is capable of being in non-compressible, infinitely complex states, which means that there is no finite compression of at least some of the states of X, again implying that, even if we could “read” every point on the surface of X, there are at least some states of X that could not be compressed into a finite string that could in turn generate the complete state of X on some UTM.

As a result, if such a system exists, then we would be able to take a finite string, and trigger an output that contains an infinite amount of information, generating complexity. Further, by juxtaposing two waves directly on top of each other, we would presumably generate constructive interference, the result of which we could define as the “sum” of two inputs. By juxtaposing two slightly offset waves, we would presumably generate destructive interference, the result of which we could define as the “difference” between two inputs. As a result, if we could understand and control such a system, perhaps we could perform otherwise non-computable computations.

## Correlation, Computability, and the Complexity of Music

In Uncategorized on December 13, 2018 at 11:23 am

If we know truly nothing at all about a data set, then the fact that the data is presented to us as a collection of vectors does not necessarily imply that there is any connection between the underlying dimensions of the vectors. That is, it could just be a coincidence that has caused these otherwise independent dimensions of data to be combined into vector form. This suggests that whether we create categories based upon the vectors as a whole, or the individual dimensions of the vectors, will depend upon our ex ante assumptions about the data.

Even in the case where the components of the vectors are not statistically correlated, it could still nonetheless be rational to treat the components as part of a whole. This would be the case whenever a combination of underlying characteristics affects the whole. Color is a good example of this. As a general matter, we’ll probably want to categorize colors as a whole (i.e., a single RGB vector), but the individual components of a data set of colors might not necessarily be correlated with each other. That is, we could be given a data set of colors where the red, blue, and green luminosity levels are all statistically independent of each other. Nonetheless, the combinations of color luminosities determine the perceived colors, and therefore, we can rationally construct categories using entire vectors, and not just the components of the vectors, despite the fact that the components of the vectors might be statistically independent of each other. In this case, this is driven by a perceptual phenomenon, since it just happens to be the case that the brain combines different exogenous wavelengths of light into a single perceived color.
This example highlights the distinction between (1) statistical correlation between the components of a vector, and (2) the co-relevance of the components of a vector in their contribution to some whole that is distinct from its underlying components.

Similar ideas apply to music, where a chord produces something that is distinct from its individual components. That is, when a single note is played, there is no harmony, since by definition there is only one note. This is in some sense a point of semantics, but it can also be expressed mathematically. That is, when a single note is played, there is no relationship to be considered between any two notes, since there is, of course, only one note. When two notes are played, however, not only are there two auditory signals being generated, but there is a third distinct artifact of this arrangement, which is the relationship between the two notes. As we add notes to the chord, the number of relationships between the notes increases.

We can count these relationships using simple combinatorics. For example, 3 notes played simultaneously creates 7 distinct perceptual artifacts. Specifically, there are the 3 individual notes; the 3 combinations of any two notes; and the 1 combination of all 3 notes. An untrained musician might not be conscious of these relationships, whereas a trained musician will be. But in either case, in a manner more or less analogous to how a blue and a red source produce magenta, which is a non-spectral color, two or more notes generate higher order perceptual experiences that are fundamentally different than those generated by their individual components. That is, harmony is a perceptual experience that must be distinct from the signals generated by its underlying components, since certain combinations of notes are known to be harmonious, whereas others are not, and instead produce dissonance (i.e., combinations of notes that “clash”).

Unlike visual art, which exists in a Euclidean space, music extends through time, in a fairly rigorous manner, with definite mathematical relationships between the notes played at a given moment, and between the notes played over time. Moreover, these relationships change in a non-linear manner as a function of the underlying variables. For example, a “major third” is arguably the most pleasant sound in music, and is generally associated with an expression of joy (think of the melody from Beethoven’s, “Ode to Joy”). One half-step down (which is the minimum decrement in the 12-tone scale), and we find the minor third, which, while harmonious, is generally associated with an expression of sadness (think of the opening to Beethoven’s, “Moonlight Sonata”). One whole-step down from a major third, we find a harmonically neutral combination between the root of a chord and the second note in the related scale. That is, adding this note to a chord doesn’t really change the character of the chord, but adds a bit of richness to it. In contrast, one whole step up from a major third, and we find a tri-tone, which is a dissonance so visceral, it’s generally associated with evil in cheesy horror movies, producing a demented sound (have a listen to the opening of “Totentanz”, by Franz Liszt).

In short, though there are undoubtedly patterns in music, the underlying space is extremely complex, and varies in an almost chaotic manner (at least over small intervals) as a function of its fundamental components.

This suggests that generating high-quality, complex music is probably a much harder problem than generating high-quality visual art. With all due respect to visual artists, the fact is that you can add statistical noise to the positions of pixels in a Picasso, and it will still look similar to the original piece. Similarly, you can add statistical noise to its colors, and nonetheless produce something that looks close to the original piece. As a result, it suggests that you can “approximate” visual art using statistical techniques. This is a consequence of the space in which visual art exists, which is a Euclidean physical space, and a roughly logarithmic color space. In contrast, if you “blur” Mahler’s 5th Symphony, changing the notes slightly, you’re going to produce a total disaster. This is a consequence of the underlying space in music, which is arguably chaotic over small intervals, though it certainly has patterns over large intervals.

Upon reflection, it is, therefore, actually remarkable that human beings can take something as complex as a symphony, which will have an enormous number of relationships to consider, that change randomly as a function of their underlying variables, and reduce it to a perceptual experience that is either harmonious or not. The ability to create something so complex that is nonetheless structured, and perceived in a unitary manner by others, borders on the astonishing.

It suggests that the minds of people like Mozart, Beethoven, and Brahms, could provide insights into how some human beings somehow operate as net contributors of structured information, despite the fact that it is mathematically impossible for a classical computer to generate “new information”, since the Kolmogorov complexity of the output of a Turning Machine is always less than or equal to the complexity of its input. That is, a Turing Machine can alter, and destroy information, but it cannot create new information that did not exist beforehand.
This can be easily proven as follows:

Let K(x) denote the computational complexity of the string x, and let y = U(x) denote the output of a UTM when given x as input. Because x generated y, by definition, K(y) <= |x|. Put informally, K(y) is the length, measured in bits, of the shortest program that generates y on a UTM. Since x generates y when x is given as the input to a UTM, it follows that K(y) can’t be bigger than the length of x. This in turn implies that K(y) <= K(x) + C. That is, we can generate y by first running the shortest program that will generate x, which has a length of K(x), and then feed x back into the UTM, which will in turn generate y. This is simply a UTM that “runs twice”, the code for which will have a length of C that does not depend upon x, which proves the result. That is, there’s a UTM that always runs twice, and the code for that machine is independent of the particular x under consideration.

We could, therefore, take the view that meaningful non-determinism is the best evidence for computation beyond what is possible by a UTM.

That is, if a source generates outputs, the complexities of which consistently exceed the aggregate complexities of any apparent inputs, then that source simply cannot be computable, since, as we just proved above, computable processes cannot generate complexity. If it is also the case that this source generates outputs that have structure, then we cannot say that this source is simply producing random outputs. Therefore, any such source would be a net contributor of structured information, which means that the source would be a non-random, non-computable source.

I am clearly suggesting the possibility that at least some human beings are capable of producing artifacts, the complexities of which exceed the aggregate complexities of any obvious sources of information. In short, human creativity might be the best evidence for non-random, non-computable processes of nature, which would in turn, imply that at least some human beings are fundamentally different from all known machines. This view suggests that, similarly, our greatest mathematicians weren’t operating as theorem provers, beginning with assumptions and mechanistically deducing conclusions, but were perhaps arriving at conclusions that did not follow from any obvious available sources of information, with minds that made use of processes of nature that we do not yet fully understand. This is probably why these people are referred to as geniuses. That is, the artifacts produced by people like Newton, Gauss, and Beethoven are astonishing precisely because they don’t follow from any obvious set of assumptions, but are instead only apparent after they’ve already been articulated.

But in addition to the admittedly anecdotal narrative above, there is also a measure of probability developed by Ray Solomonoff that provides a more convincing theoretical justification for the view that human creativity probably isn’t the product of a computable process. Specifically, Solomonoff showed that if we provide random inputs to a UTM (e.g., a binary coin toss), then the probability of that UTM generating a given output string x is given by,

$p \approx 1/2^{K(x)}$,

where K(x) is the same Kolmogorov complexity of x we just discussed above. That is, the probability that a UTM given random inputs generates a given string x is approximately equal to $1/2^{K(x)}$.

We can certainly iteratively generate all binary inputs to a UTM, and it is almost certainly the case that, for example, no one has stumbled upon the correct input to a UTM that will generate Mahler’s 5th symphony. So, if we want to argue that the creative process is nonetheless the product of a computable process, it follows that the computable process, in this case, Mahler’s creative process, is the product happenstance, where a series of random inputs serendipitously found their way to Gustav Mahler, ultimately causing his internal mental process to generate a masterpiece.

In addition to sounding ridiculous when stated in these terms, it turns out that Solomonoff’s equation above also casts serious doubt on this as a credible possibility. Specifically, because we’ve presumably yet to find the correct input to a UTM that will generate Mahler’s 5th symphony, this input string is presumably fairly large. This implies that the probability that an encoded version of Mahler’s 5th will be generated by a UTM given random inputs is extremely low. As a result, we’re left with the conclusion that large, high-complexity artifacts that nonetheless have structure are probably not the product of a random input being fed to a UTM. Moreover, such artifacts are even less likely to be the product of pure chance, since K(x) <= |x| + C. That is, we can just feed the input x to a UTM that simply copies its input, so a string is never more complex than its own length plus a constant. As a result, assuming x is an encoding of Mahler’s 5th symphony, we’re probably far more likely to randomly generate y, for which U(y) = x, than we are to generate x itself. But as we just showed above, both of these outcomes have probabilities so small that it’s probably more sensible to assume that we just don’t understand how some people think.

As a result, Solomonoff’s equation expresses something we all know to be true in mathematical terms: I can toss coins for a billion years, and I’ll still never produce something like Mahler’s 5th. In the jargon of computer theory, Mahler’s 5th Symphony might be the best evidence that the Church-Turing Thesis is false.

This view is even more alarming when you consider the algorithmic probability of generating a DNA molecule…

## Using Information Theory to Create Categories with No Prior Information

In Uncategorized on December 11, 2018 at 12:03 pm

In a previous post below entitled, “Image Recognition with No Prior Information”, I introduced an algorithm that can identify structure in random images with no prior information by making use of assumptions rooted in information theory. As part of that algorithm, I showed how we can use the Shannon entropy of a matrix to create meaningful, objective categorizations ex ante, without any prior information about the data we’re categorizing. In this post, I’ll present a generalized algorithm that can take in an arbitrary data set, and quickly construct meaningful, intuitively correct partitions of the data set, again with no prior information.

Though I am still conducting research on the run-time of the algorithm, the algorithm generally takes only a few seconds to run on an ordinary commercial laptop, and I believe the worst case complexity of the algorithm to be $O(log(n)n^2 + n^2 + n)$, where n is the number of data points in the data set.

The Distribution of Information

Consider the vector of integers  x = [1 2 3 5 10 11 15 21]. In order to store, convey, or operate on x, we’d have to represent x in some form, and as a result, by its nature, x has some intrinsic information content. In this case, x consists of a series of 8 integers, all less than $2^5 = 32$, so we could, for example, represent the entire vector as a series of 8, 5-bit binary numbers, using a total of 40 bits. This would be a reasonable measure of the amount of information required to store or communicate x. But for purposes of this algorithm, we’re not really interested in how much information is required to store or communicate a data set. Instead, we’re actually more interested in the information content of certain partitions of the data set.

Specifically, we begin by partitioning x into equal intervals of Δ = (21-1) / N, where N is some integer. That is, we take the max and min elements, take their difference, and divide by some integer N, which will result in some rational number that we’ll use as an interval with which we’ll partition x. If we let N = 2, then Δ = 10, and our partition is given by {1 2 3 5 10 11} {15 21}. That is, we begin with the minimum element of the set, add Δ, and all elements less than or equal to that minimum + Δ are included in the first subset. In this case, this would group all elements from 1 to 1 + Δ = 11. The next subset is in this case generated by taking all numbers greater than 11, up to and including 1 + 2Δ = 21. If instead we let N = 3, then Δ = 6 + ⅔, and our partition is in this case given by {1 2 3 5} {10 11} {15 21}.

As expected, by increasing N, we decrease Δ, thereby generating a partition that consists of a larger number of smaller subsets. As a result, the information content of our partition changes as a function of N, and eventually, our partition will consist entirely of single element sets that contain an equal amount of information (and possibly, some empty subsets). Somewhere along the way from N = 1 to that maximum, we’ll reach a point where the subsets contain maximally different amounts of information, which we treat as an objective point of interest, since it is an a priori partition that generates maximally different subsets, at least with respect to their information content.

First, we’ll need to define our measure of information content. For this purpose, we assume that the information content of a subset is given by the Shannon information,

$I = log(1/p)$,

where p is the probability of the subset. In this case, the partition is a static artifact, and as a result, the subsets don’t have real probabilities. Nonetheless, each partition is by definition comprised of some number of m elements, where m is some portion of the total number of elements M. For example, in this case, M = 8, and for Δ = 6 + ⅔, the first subset of the partition consists of m = 4 elements. As a result, we could assume that p = m/M = ½, producing an information content of log(2) = 1 bit.

As noted above, as we iterate through values of N, the information contents of the subsets will change. For any fixed value of N, we can measure the standard deviation of the information contents of the subsets generated by the partition. There will be some value of N for which this standard deviation is maximized. This is the value of N that will generate maximally different subsets of x that are all nonetheless bounded by the same interval Δ.

This is not, however, where the algorithm ends. Instead, it is just a pre-processing step we’ll use to measure how symmetrical the data is. Specifically, if the data is highly asymmetrical, then a low value of N will result in a partition that consists of subsets with different numbers of elements, and therefore, different measures of information content. In contrast, if the data is highly symmetrical, then it will require several subdivisions until the data is broken into unequally sized subsets.

For example, consider the set {1, 2, 3, 4, 15}. If N = 2, then Δ = 7, and immediately, the set set is broken into subsets of {1, 2, 3, 4} and {15}, which are of significantly different sizes when compared to the size of the original set. In contrast, given the set {1, 2, 3, 4, 5}, if N = 2, then Δ = 2, resulting in subsets of {1, 2, 3} {4, 5}. It turns out that the standard deviation is in this case maximized when N = 3, resulting in a partition given by {1, 2} {3} {4, 5}. We can then express the standard deviation of the partition as an Octave vector:

std( [ log(5/2) log(5) log(5/2) ] ) = 0.57735.

As a result, we can use N as a measure of symmetry, which will in turn inform how we ultimately group elements together into categories. The attached partition_vec script finds the optimum value of N.

Symmetry, Dispersion, and Expected Category Sizes

To begin, recall that N can be used as a measure of the symmetry of a data set. Specifically, returning to our example x = [1 2 3 5 10 11 15 21] above, it turns out that N = 2, producing a maximum standard deviation of 1.1207 bits. This means that it requires very little subdivision to cause x to be partitioned into equally bounded subsets that contain maximally different amounts of information. This suggests that there’s a decent chance that we can meaningfully partition x into categories that are “big” relative to its total size. Superficially, this seems reasonable, since, for example, {1 2 3 5} {10 11 15} {21} would certainly be a reasonable partition.

As a general matter, we assume that a high value of N creates an ex ante expectation of small categories, and that a low value of N creates an ex ante expectation of large categories. As a result, a high value of N should correspond to a small initial guess as to how different two elements need to be in order to be categorized separately, and a low value of N should correspond to a large initial guess for this value.

It turns out that the following equation, which I developed in the context of recognizing structure in images, works quite well:

$divisor = N^{N(1-symm)}/ 10$,

where divisor is the number by which we’ll divide the standard deviation of our data set, generating our initial minimum “guess” as to what constitutes a sufficient difference between elements in order to cause them to be categorized separately. That is, our initial sufficient difference will be proportional to s / divisor.

“Symm” is, as the name suggests, yet another a measure of symmetry, and a measure of dispersion about the median of a data set. Symm also forms the basis of a very simple measure of correlation that I’ll discuss in a follow up post. Specifically, symm is given by the square root of the average of the sum of the squares of the differences between the maximum and minimum elements of a set, the second largest and second smallest element of a set, and so on.

For example, in the case of the vector x, symm is given by,

$[\frac{1}{8} ( (21 - 1)^2 + (15 - 2)^2 + (11 -3)^2 + (10 - 5)^2 + (10 - 5)^2 + (11 -3)^2 + (15 - 2)^2 + (21 - 1)^2 )]^{\frac{1}{2}},$

which is 12.826.

As we increase the distance between the elements from the median of a data set, we increase symm. As we decrease this distance, we decrease symm. For example, symm([ 2 3 4 5 6 ]) is greater than symm([ 3 3.5 4 4.5 5 ]). Also note that a data set that is “tightly packed” on one side of its median and diffuse on the other will have a lower value of symm than another data set that is diffuse on both sides of its median. For example, symm([ 1 2 3 4 4.1 4.2 4.3 ]) is less than symm([ 1 2 3 4 5 6 7 ]).

For purposes of our algorithm, symm is yet another a measure of how big we should expect our categories to be ex ante. A large value of symm implies a data set that is diffuse about its median, suggesting bigger categories, whereas a small value of symm implies a data set that is tightly packed about its median, suggesting smaller categories. For purposes of calculating divisor above, we first convert the vector in question into a probability distribution by dividing by the sum over all elements of the vector. So in the case of x, we first calculate,

x’ = [0.014706 0.029412 0.044118 0.073529 0.147059 0.161765 0.220588 0.308824].

Then, we calculate symm(x’) = 0.18861. Putting it all together, this implies divisor = 0.30797.

Determining Sufficient Difference

One of the great things about this approach is that it allows us to define an objective, a priori measure of how different two elements need to be in order to be categorized separately. Specifically, we’ll make use of a technique I described in a previous post below that iterates through different minimum sufficient differences, until we reach a point where the structure of the resultant partition “cracks”, causing the algorithm to terminate, ultimately producing what is generally a high-quality partition of the data set. The basic underlying assumption is that the Shannon entropy of a mathematical object can be used as a measure of the object’s structural complexity. The point at which the Shannon entropy changes the most over some interval is, therefore, an objective local maximum where a significant change in structure occurs. I’ve noticed that this point is, across a wide variety of objects, including images and probabilities, where the intrinsic structure of an object comes into focus.

First, let maxiterations be the maximum number of times we’ll allow the main loop of the algorithm to iterate. We’ll want our final guess as to the minimum sufficient difference between categories to be s, the standard deviation of the data set. At the same time, we’ll want our initial guess to be proportional to s / divisor. As a result, we use a counter that begins at 0, and iterates up to divisor, in increments of,

$increment = divisor / maxiterations$,

This allows us to set the minimum sufficient difference between categories to,

$delta = s(counter / divisor)$,

which will ultimately cause delta to begin at 0, and iterate up to s, as counter iterates from 0 to divisor, increasing by increment upon each iteration.

After calculating an initial value for delta, the main loop of the algorithm begins by selecting an initial “anchor” value from the data set, which is in this case simply the first element of the data set. This anchor will be the first element of the first category of our partition. Because we are assuming that we know nothing about the data, we can’t say ex ante which item from the data set should be selected as the initial element. In the context of image recognition, we had a bit more information about the data, since we knew that the data represented an image, and therefore, had a good reason to impose additional criteria on our selection of the initial element. In this case, we are assuming that we have no idea what the data set represents, and as a result, we simply iterate through the data set in the order in which it is presented to us. This means that the first element of the first category of our partition is simply the first element of the data set.

We then iterate through the data set, again in the order in which it is presented, adding elements to this first category, provided the element under consideration is within delta of the anchor element. Once we complete an iteration through the data set, we select the anchor for the second category, which will in this case be the first available element of the data set that was not included in our first category. This process continues until all elements of the data set have been included in a category, at which point the algorithm measures the entropy of the partition, which is simply the weighted average information content of each category. That is, the entropy of a partition is,

$H(P) = \Sigma (p log(1 / p))$.

We then store H, increment delta, and repeat this entire process for the new value of delta, which will produce another partition, and therefore, another measure of entropy. Let H_1 and H_2 represent the entropies of the partitions generated by delta_1 and delta_2, respectively. The algorithm will calculate $(H_1 - H_2)^2$, and compare it to the maximum change in entropy observed over all iterations, which is initially set to 0. That is, as we increment delta, we measure the rate of change in the entropy of the partition as a function of delta, and store the value of delta for which this rate of change is maximized.

As noted above, it turns out that, as a general matter, this is a point at which the inherent structure of a mathematical object comes into focus. This doesn’t imply that this method produces the “correct” partition of a data set, or an image, but rather, that it is expected to produce reasonable partitions ex ante based upon our analysis and assumptions above.

Minimum Required Structure

We can’t say in advance exactly how this algorithm will behave, and as a result, I’ve also included a test condition in the main loop of the algorithm that ensures that the resultant partition has a certain minimum amount of structure. That is, in addition to testing for the maximum change in entropy, the algorithm also ensures that the resultant partition has a certain minimum amount of structure, as measured using the entropy of the partition.

In particular, the minimum entropy required is given by,

$H_{min} = (1 - symm)log(numitems)$,

where symm is the same measure of symmetry discussed above. This minimum is enforced by simply having the loop terminate the moment the entropy of the partition tests below the minimum required entropy.

Note that as the number of items in our data set increases, the maximum possible entropy of the partition, given by log(numitems), increases as well. Further, as our categories increase in size, and decrease in number, the entropy of the resultant partition will decrease. If symm is low (i.e., close to 0), then we have good reason to expect a partition that contains a large number of narrow categories, meaning that we shouldn’t allow the algorithm to generate a low entropy partition. If symm is high, then we can be more comfortable allowing the algorithm to run a bit longer, producing a lower entropy partition. See, “Image Recognition with No Prior Information” on my researchgate page for more on the theory underlying this algorithm, symm, and the Shannon entropy generally.

Applying the Algorithm to Data

Let’s consider the data sets generated by the following code:

for i = 1 : 25

endfor

for i = 26 : 50

endfor

This is of course just random data, but to give the data some intuitive appeal, we could assume that base = 4, adjustment = 3, spread = .1, and interpret the resulting data as heights measured in two populations of people: one population that is significantly shorter than average (i.e., around 4 feet tall), and another population that is significantly taller than average (i.e., around 7 feet tall). Note that in Octave, rand() produces a random number from 0 to 1. This is implemented by the attached generate_random_data script.

Each time we run this code, we’ll generate a data set of what we’re interpreting as heights, that we know to be comprised of two sets of numbers: one set clustered around 4, and the other set clustered around 7. If we run the categorization algorithm on the data (which I’ve attached as optimize_linear_categories), we’ll find that the average number of categories generated is around 2.7. As a result, the algorithm does a good job at distinguishing between the two sets of numbers that we know to be present.

Note that as we decrease the value of adjustment, we decrease the difference between the sets of numbers generated by the two loops in the code above. As a result, decreasing the value of adjustment blurs the line between the two categories of numbers. This is reflected in the results produced by the algorithm, as shown in the attached graph, in which the number of categories decreases exponentially as the value of adjustment goes from 0 to 3.

Similarly, as we increase spread, we increase the intersection between the two sets of numbers, thereby again decreasing the distinction between the two sets of numbers. However, the number of categories appears to grow linearly in this case as a function of spread over the interval .1 to .7, with adjustment fixed at 3.

As a general matter, these results demonstrate that the algorithm behaves in an intuitive manner, generating a small number of wide categories when appropriate, and a large number of narrow categories when appropriate.

Generalizing this algorithm to the n-dimensional case should be straightforward, and I’ll follow up with those scripts sometime over the next few days. Specifically, we can simply substitute the arithmetic difference between two data points with the norm of the difference between two n-dimensional vectors. Of course, some spaces, such as an RGB color space, might require non-Euclidean measures of distance. Nonetheless, the point remains, that the concepts presented above are general, and should not, as a general matter, depend upon the dimension of the data set.

The relevant Octave / Matlab scripts are available here:

generate_linear_categories

generate_random_data

left_right_delta

optimize_linear_categories

partition_vec

spec_log

test_entropy_vec

vector_entropy

## A Mathematical Theory of Partial Information

In Uncategorized on November 24, 2018 at 10:57 am

The fundamental observation underlying all of information theory is that probability and information are inextricably related to one another through Shannon’s celebrated equation,

I = log(1/p),

where I is the optimal code length for a signal with a probability of p. This equation in turn allows us to measure the information content of a wide variety of mathematical objects, regardless of whether or not they are actually sources that generate signals. For example, in the posts below, I’ve shown how this equation can be used to evaluate the information content of an image, a single color, a data set, and even a particle. In each of these instances, however, we evaluated the information content of a definite object, with known properties. In this post, I’ll discuss how we can measure the information content of a message that conveys partial information about an uncertain event, in short, answering the question of, “how much did I learn from that message?”

Exclusionary Messages

Let’s begin with a simple example. Consider a three-sided dice, with equally likely outcomes we’ll label A, B, and C. If we want to efficiently encode and record the outcomes of some number of throws, then Shannon’s equation above implies that we should assign a code of length log(3) bits to each of the three possible outcomes.

Now imagine that we have received information that guarantees that outcome A will not occur on the next throw. This is obviously a hypothetical, and in reality, it would generally not be possible to exclude outcomes with certainty in this manner, but let’s assume for the sake of illustration that an oracle has informed us that A will not occur on the next throw. Note that this doesn’t tell us what the next throw will be, since both B and C are still a possibility. It does, however, provide us with some information, since we know that the next throw will not be A.

Now imagine that, instead, our oracle told us that the next throw is certain to be A. That is, our oracle knows, somehow, that no matter what happens, the next throw will be A. In this case, we have specified a definite outcome. Moreover, this outcome has a known ex ante probability, which in turn implies that it has an information content. Specifically, in this case, the probability of A is 1/3, and its information content is log(3) bits. Since learning in advance that A will occur is not meaningfully distinguishable from actually observing A (at least for this purpose), upon learning that A will be the next outcome, we receive log(3) bits of information. As a result, we receive log(3) bits of information upon receipt of the message from our oracle, and learn nothing at all upon throwing the dice, since we already know the outcome of the next throw. Note that this doesn’t change the total amount of information observed, which is still log(3) bits. Instead, the message changes the timing of the receipt of that information. That is, either we receive log(3) bits from the oracle, or we observe log(3) bits upon throwing the dice. The only question is when we receive the information, not how much information we receive.

Now let’s return to the first case where our oracle told us that the outcome will definitely not be A. In this case, our oracle has excluded an event, rather than specified a particular event. As a result, we cannot associate this information with any single event in the event space, since two outcomes are still possible. We call this type of information an exclusionary message, since it excludes certain possibilities from the event space of an uncertain outcome. As a general matter, any information that reduces the set of possible outcomes can be viewed as an exclusionary message. For example, if instead our oracle said that the next outcome will be either B or C, then this message can be recharacterized as an exclusionary message conveying that A will not occur.

If we receive an exclusionary message that leaves only one possible outcome for a particular trial, then the information content of that message should equal the information content of observing the actual outcome itself. Returning to our example above, if our oracle tells us that B and C will not occur upon the next throw, then that leaves A as the only possible outcome. As a result, knowing that neither of B and C will occur should convey log(3) bits of information. Therefore, intuitively, we expect an exclusionary message that leaves more than one possible outcome remaining to convey less than log(3) bits of information. For example, if our oracle says that B will not occur, then that message should convey fewer than log(3) bits, since it conveys less information than knowing that neither of B and C will occur.

As a general matter, we assume that the information content of an exclusionary message that asserts the non-occurrence of an event that otherwise has a probability of p of occurring is given by,

I = log(1/(1-p)).

The intuition underlying this assumption is similar to that which underlies Shannon’s equation above. Specifically, an event with a high ex ante probability that fails to occur carries a great deal of surprisal, and therefore, a great deal of information. In contrast, a low probability event that fails to occur carries very little surprisal, and therefore, very little information. Note that, as a result, the information content of an exclusionary message will depend upon the ex ante probability for the event excluded by the message, which is something we will address again below when we consider messages that update probabilities, as opposed to exclude events.

Returning to our example, if our oracle informs us that A will not occur on the next throw, then that message conveys log(3/2) bits of information. Upon receipt of that message, the probabilities of B and C should be adjusted to the conditional probabilities generated by assuming that A will not occur. In this case, this implies that B and C each have a probability of 1/2. When we ultimately throw either a B or a C, the total information received from the message and observing the throw is log(3/2) + log(2) = log(3) bits. This is consistent with our observation above that the oracle does not change the total amount of information received, but instead, merely changes the timing of the receipt of that information.

If instead our oracle informs us that neither of A and B will occur, then that message conveys log(3) bits of information, the same amount of information that would be conveyed if our oracle told us that C will occur. This is consistent with our assumption that the amount of information contained in the message should increase as a function of the number of events that it excludes, since this will eventually lead to a single event being left as the only possible outcome.

If we receive two separate messages, one informing us of the non-occurrence of A, and then another message that informs us of the non-occurrence of B, both received prior to actually throwing the dice, then this again leaves C as the only possible outcome. But, we can still measure the information content of each message separately. Specifically, the first message asserts the non-occurrence of an event that has a probability of 1/3, and therefore, conveys log(3/2) bits of information. The second message, however, asserts the non-occurrence of an event that has a probability of 1/2 after receipt of the first message. That is, after the first message is received, the probabilities of the remaining outcomes are adjusted to the conditional probabilities generated by assuming that A does not occur. This implies that upon receipt of the second message, B and C each have a probability of 1/2. As a result, the second message conveys log(2) bits of information, since it excludes an outcome that has a probability of 1/2. Together, the two messages convey log(3/2) + log(2) = log(3) bits of information, which is consistent with our assumption that whether a message identifies a particular outcome, or excludes all outcomes but one, then the same amount of information should be conveyed in either case.

As a general matter, this approach ensures that the total information conveyed by exclusionary messages and through observation is always equal to the original ex ante information content of the outcome that is ultimately observed. As a result, this approach “conserves” information, and simply moves its arrival through time.

As a general matter, when we receive a message asserting the non-occurrence of an event with a probability of p*, then we’ll update the remaining probabilities to the conditional probabilities generated by assuming the non-occurrence of the event. This means all remaining probabilities will be divided by 1 – p*. Therefore, the total information conveyed by the message followed by the observation of an event with a probability of p is given by,

log(1/(1-p*)) + log((1-p*)/p) = log(1/p).

That is, the total information conveyed by an exclusionary message and any subsequent observation is always equal to the original ex ante information content of the observation.

Partial Information and Uncertainty

This approach also allows us to measure the information content of messages that don’t predict specific outcomes, but instead provide partial information about outcomes. For example, assume that we have a row of N boxes, each labelled 1 through N. Further, assume that exactly one of the boxes contains a pebble, but that we don’t know which box contains the pebble ex ante, and assume that each box is equally like to contain the pebble ex ante. Now assume that we receive a message that eliminates the possibility that box 1 contains the pebble. Because all boxes are equally likely to contain the pebble, the information content of that message is log(N/N-1). Now assume that we receive a series of messages, each eliminating one of the boxes from the set of boxes that could contain the pebble. The total information conveyed by these messages is given by,

log(N/(N-1)) + log((N-1)/(N-2)) + … + log(2) = log(N).

That is, a series of messages that gradually eliminate possible locations for the pebble conveys the same amount of information as actually observing the pebble. Note that simply opening a given box would constitute an exclusionary message, since it conveys information that will either reveal the location of the pebble, or eliminate the opened box from the set of possible locations for the pebble.

As a general matter, we can express the uncertainty as to the location of the pebble as follows:

U = Log(N) – I.

Messages that Update Probabilities

In the previous section, we considered messages that exclude outcomes from the event space of a probability distribution. In practice, information is likely to come in some form that changes our expectations as to the probability of an outcome, as opposed to eliminating an outcome as a possibility altogether. In the approach we developed above, we assumed that the information content of the message in question is determined by the probability of the event that it excluded. In this case, there is no event being excluded, but instead, a single probability being updated.

Let’s begin by continuing with our example above of the three-sided dice and assume that we receive a message that updates the probability of throwing an A from 1/3 to 1/4. Because the message conveys no information about the probabilities of B and C, let’s assume that their probabilities maintain the same proportion to each other. In this particular case, this implies that each of A and B have a probability of 3/8. Though its significance is not obvious, we can assume that the updated probabilities of A and B are conditional probabilities, specifically, the result of division by a probability, which in this case would be 8/9. That is, in our analysis above, we assumed that the remaining probabilities are adjusted by dividing by the probability that the excluded event does not occur. In this case, though there is no event being excluded, we can, nonetheless, algebraically solve for a probability, division by which, will generate the updated probabilities for the outcomes that were not the subject of the message.

Continuing with our example above, the message updating the probability of A from 1/3 to 1/4 would in this case have an information content of log(9/8). Upon observing either B or C after receipt of this message, the total information conveyed would be log(9/8) + log(8/3) = log(3). Note that information is not conserved if we were to subsequently observe A, but this is consistent with our analysis above, since throwing an A after receipt of an exclusionary message regarding A would imply that we’ve observed an infinite amount of information.

Interestingly, this approach implies the existence of a probability, the significance of which is not obvious. Specifically, if we receive a message with an information content of i, then since,

i = log(1/1 – p),

the probability associated with that information is given by,

$p = 1 - 1/2^i$.

This is the same form of probability we addressed in the post below, “Using Information Theory to Explain Color Perception”. In the analysis above, this was the probability of an event excluded by a message. If we assume that, similarly, in this case, this is the probability of some event that failed to occur, then the information content of the message would again increase as a function of surprisal, with high probability events that fail to occur carrying more information than low probability events.

We can still make use of this probability to inform our method, even though we don’t fully understand its significance. Specifically, this probability implies that messages that update probabilities always relate to probabilities that are reduced. That is, just like an exclusionary message eliminates an event from the outcome space, a message that updates a probability must always be interpreted as reducing the probabilities of some outcomes in the event space, meaning that the conditional probabilities of the outcomes that are not the subject of the message will be increased. Since we divide by 1 – p to generate those conditional probabilities, assuming that the conditional probabilities decrease implies that the probability 1 – p > 1, which in turn implies that p < 0. As a result, assuming that p is in fact a probability provides insight into our method, regardless of whether or not we fully understand the significance of the probability.

For example, if we receive a message that increases the probability of A to 2/3, then we would interpret that message as decreasing the probability of both B and C to 1/6. That is, we recharacterize the message so that the subject of the message is actually outcomes B and C. Recall that we determine p by looking to the conditional probabilities of the outcomes that are not the subject of the message, and so in this case, we have (1/3)/(1-p) = 2/3, which implies that 1 – p = 1/2. Therefore, the information content of the message is log(2), and upon observing an A, the total information received is log(2) + log(3/2) = log(3), i.e., the original ex ante information content of the outcome A.