Using Information Theory to Inform Belief

When multiple sources of information produce different, possibly even superficially conflicting probabilities for the same event, the question arises as to how those probabilities should be treated. That is, different sources, or different experimental tests, could imply different probabilities for the same event. There’s no obvious objective solution to this problem, and as a practical matter, even a crude method, such as taking a simple average of the probabilities, will probably work just fine for something like determining the probabilities of a coin toss. However, when dealing with arbitrary data, perhaps even with no prior information about the mechanics that generated the data, such an approach is likely to eliminate too much information about the structure of the data itself, generating useless answers.

Any solution to this problem is really an answer to the question of how much weight we should ascribe to each probability. That is, whatever probability we ultimately arrive at as the “correct” answer can be expressed as a linear combination of the underlying probabilities, even if that’s not how we actually arrived at the correct answer. While this might seem like a trivial statement of algebra, it offers insight into the fundamental question that underlies any solution to this problem, which is, “how much attention should I pay to this piece of information?”

Shannon Coding

In 1948, Claude Shannon showed that there was a surprising connection between probability and information. Though he was arguably only trying to construct optimally short encodings for messages, he ended up uncovering a deep, and fundamental relationship between the nature of probability and information. To establish an intuition, consider the string x = (aaab), and assume that we want to encode x as a binary string. First we need to assign a binary code to each of a and b. Since a appears more often than b, if we want to minimize the length of our encoding of x, then we should assign a shorter code to a than we do to b. For example, if we signify the end of a binary code with a 1, we could assign the code 1 to a, and 01 to b. As such, our encoding of x would be 11101, and since x contains 4 characters, the average number of bits per character in our encoding of x is 5/4. Now consider the string x = (ab). In this case, there are no opportunities for this type of compression because all characters appear an equal number of times. The same would be true of x = (abcbca), or x = (qs441z1zsq), each of which has a uniform distribution of characters. In short, we can take advantage of the statistical structure of a string, assigning longer codes to characters that appear less often, and shorter codes to characters that appear more often. If all characters appear an equal number of times, then there are no opportunities for this type of compression.

Shannon showed that, in general, we can construct an optimally short encoding, without any loss of information, if we assign a code of length log(1/p) bits to a signal that has a probability of p. That is, if the frequency of a signal from a source is p, and we want to encode that source, then we should assign a code of length log(1/p) to the signal. It follows that if a source has N possible signals, and the probability of each signal is 1/N (i.e., the signals have a uniform distribution), then the expected code length of a signal generated by the source is log(N).

Resource Management and Human Behavior

Even though it is ultimately a human being that will assign a code to a signal, and therefore ascribe information to the signal, if we take a step back, we see that, using an optimal encoding, a low probability event carries more information than a high probability event. Moreover, Shannon’s coding method is objectively optimal, in that it is not possible to construct a better encoding without losing information (if we’re considering only the distribution of the signals). That is, a source generates signals, not information, but once a human being observes a source, a representational system is required if that person wants to record, or communicate, the signals generated by that source, which will produce information. As a result, though a human being must take the active steps of observing a source, and then encoding its signals, there is an objective component to the statement that low probability events carry more information than high probability events. This is consistent with common sense, and possibly even how our brains filter information. Put crudely, as an example, high probability stories are generally considered boring (e.g., “I went to my local pub last night”), whereas extraordinary, and unlikely tales generally garner a great deal of attention and inquiry (e.g., “I met a celebrity at my local pub last night”).

You could take the view that this is just an oddity of human nature, and that people prefer unusual tales. Alternatively, you could also take the view that it is instead the consequence of something much more profound about how our brains filter out the information that constantly bombards our senses. Specifically, it seems possible that our decisions allocate resources to the experiences most likely to generate the lowest probability signals, thereby maximizing our information intake. Without spending too much time on the topic, this view could explain why human beings are generally drawn towards things and experiences that are presented as exclusive, rare, or unusual, and less enthusiastic about things and experiences that are presented as common, or mundane. In this view, pretensions and preferences are the product of a rational resource management system that allocates the senses towards experiences that are most likely to maximize the amount of information observed.

Information and Probability

Returning to the original problem above, the probability of an event should then be a function of the probabilities implied by a data set, and the importance of each data point in our data set. For simplicity, let’s assume that we have total confidence in the data from which we derive our probabilities. That is, we’re assuming our data is free from errors and reflects events that actually occurred, and were properly measured and recorded.

Let’s begin with the classic example of a coin toss, and assume that 5 people are all tossing perfectly identical coins, 25 times each, and recording the results. If one of those 5 people came back and said that they had tossed 25 heads in a row, we would probably view that as extraordinary, but possible. Considering this person’s data in isolation, the observed probability of heads would be 1, and the observed probability of tails would be 0. In contrast, our ex ante expectation, assuming the coins were expected to be a fair, would have been that the probability of each of heads and tails is 1/2. Assuming that all of the other people come back with roughly equal probabilities for heads and tales, our outlier data set would assign a probability of 1 to an event that we assumed to have a probability of 1/2, and a probability of 0 to another event that we assumed to have a probability of 1/2.

Before we consider how much weight to ascribe to this “outlier” data, let’s consider another case that will inform our method. Now assume that instead, 4 out of the 5 people report roughly equal probabilities for heads and tails, whereas one person reports back that their coin landed on its side 13 times out of 25 tosses, and that the remaining tosses we’re split roughly equally between heads and tails. That is, rather than landing on heads or tails, the coin actually landed on its side 13 times. This is an event that is generally presumed to have a probability of 0, and ignored. So upon hearing this piece of information, we would probably be astonished (recall that we are not calling the data itself into question). That is, something truly incredible has happened, since an event that has a probability of effectively zero has in fact happened 13 times. Considering the data reported by this person in isolation, we’d have an event that generally has an ex ante probability of 0 being assigned an observed probability of about 1/2, and then two events that typically have an ex ante probability of 1/2 being assigned observed probabilities of about 1/4.

Now ask yourself, which is more remarkable:

(1) the first case, where the person reported that an event with an ex ante probability of 1/2 had an observed probability of 1; or

(2) the second case, where the person reported that an event that had an ex ante probability of 0 had an observed probability of 1/2?

Though the magnitude of the difference between the ex ante probability and the observed probability is the same in both cases, the clear answer is that the second case is far more surprising. This suggests an inherent asymmetry in probabilities that are biased towards low probability events. That is, “surprisal” is not a function of just the difference between the expected probability and the observed probability, but also of the probability itself, with low probability events generating greater “surprisal”.

If we return to Shannon’s formula for the optimal code length for a signal, we see this same asymmetry, since low probability signals are assigned longer codes than high probability signals. This suggests the possibility that we can mirror surprisal in observations by weighting probabilities using the information content of the probability itself. That is, we can use the optimal code length for a signal with a given probability to inform our assessment as to how important the data underlying a particular reported probability is.

Under this view, the reports generated by the people flipping coins will contain different amounts of information. The more low probability events a given report contains, the greater the information contained in the report, implying that data that describes low probability events quite literally contains more information than data that describes high probability events. As a result, we assume that for a given event, the data underlying a low reported probability contains more information than the data underlying a high reported probability.

Information-Weighted Probabilities

Rather than simply weight the probabilities by their individual code lengths (which might work just fine in certain contexts), I’ve been making use of a slightly different technique in image recognition, which looks at how much the information associated with a given data point deviates from the average amount of information contained across all data points in the data set. Specifically, the algorithm I’ve been making use of partitions an image into equally sized regions, with the size of the region adjusted until the algorithm finds the point at which each region contains a maximally different amount of information. That is, for a given region size, the algorithm measures the average entropy of each region (by analyzing the color distribution of each region), and then calculates the standard deviation of the entropies over all regions. It keeps iterating through region sizes until it finds the size that maximizes the standard deviation of the entropies over all regions. This ends up working quite well at uncovering structure in arbitrary images with no prior information, with the theory being that if two regions within an image contain legitimately different features, then they should contain objectively different quantities of information. By maximizing the information contrast of an image, under this theory, we maximize the probability that each region contains a different feature than its neighboring regions. I then have a second algorithm that is essentially an automata “crawl” around the image and gather up similar regions (see the paper entitled, “A New Model of Artificial Intelligence” for more on this approach).

By analogy, if we have a set of probabilities, {p_1, p_2, p_3, \ldots , p_n}, all purportedly informing us at to the probability of the same event, then we can assign weights to each probability that reflect how “unusual” the information content of the probability itself is in the context of the other probabilities. We begin by taking the logarithm of the reciprocal of each probability, log(1/pi), which gives us the optimal code length for a signal with a probability of pi. We’re not interested in encoding anything, but instead using the code length associated with a probability to measure the information contained in the data that generated the probability in the first instance. This will produce a set of code lengths for each probability {l_1, l_2, l_3, \ldots , l_n}, with lower probabilities having longer code lengths than higher probabilities. We then calculate the average code length over the entire set, u. This is our average information content, from which we will determine how “unusual” a probability is, ultimately generating the weight we’ll assign to the probability. In short, if a probability was generated by an unusually large, or unusually small, amount of information relative to the average amount of information, then we’re going to assign it a larger weight.

Specifically, we calculate the weight for each probability as follows:

w_i = (l_i - u)^2 + 1.

That is, we take the variance and add 1 to it, producing a set of weights that are greater than or equal to 1. This produces a probability given by the weighted sum,

p = \frac{1}{W} (w_1p_1 + w_2p_2 + ... + w_np_n),

where W is the sum over all weights.

This formulation gives greater weight to outlier probabilities than those that are closer to the average probability, but because we begin with the code length associated with each probability, and not the probability itself, we overstate outlier low probabilities more than outlier high probabilities. That is, this method reflects our observations above, that events that have lower probabilities generate more surprisal, and therefore, “get more attention” in this method. Put differently, this method balances both (1) how “unusual” a data point is compared to the rest of the data set, and (2) how much information went into generating the data point. Since lower probability events are generated by data that contains more information (which is admittedly an assumption of this model), unusual low probabilities are given more weight than unusual high probabilities, and average probabilities have the same, minimum weight.

For example, the information weighted probability of the probabilities {.8 .8 .05} is 0.47. In contrast, the simple average of those probabilities is .55.

Note that this method doesn’t require any ex ante probabilities at all, so the question of what “prior” probability to use is moot. That is, this method takes the data as it is, with no expectations at all, and generates a probability based upon the assumed information content of the data that generated each probability. As a result, this method could be useful for situations where no prior information about the data is available, and we are forced to make sense of the data “in a vacuum”.

Advertisements

Information Theory and Time-Dilation

All,

I’ve assembled all of my additional notes on time-dilation and information theory on my blog at researchgate here:

https://www.researchgate.net/project/Information-Theory-16

I’ve also attached pdf copies of the documents here.

Regards,

Charles

A Unified Model of the Gravitational, Electrostatic, and Magnetic Forces (43)

Non-Local Interactions, Quantum Spin, the Strong and Weak Forces, and Inertia (21)

CompModofTD5-1v2 (35)

Mass_Energy_and_Momentum

MomentumMagnetismWaves25

 

On_the_Value_of_Gamma

 

Fast Image Categorization Algorithm

Following up on previous posts (available on my research blogĀ here) in which I presented algorithms rooted in information theory that identify structure in images, I’ve come up with a series of algorithms that together allow for very fast image categorization.

Using the algorithms presented in previous posts, the process starts with breaking an image into a set of equally sized rectangular regions. These regions are sized so as to maximize the standard deviation of the entropies of the regions. That is, we size the regions so that they contain maximally different amounts of information. This process alone uncovers quite a bit of structure within an image, and is extremely fast, generally taking only a few seconds on an ordinary laptop. See the previous posts for more on this process.

Next, the algorithm assembles these regions into larger features, using an automata that “crawls” around the image, grouping similar regions together into larger contiguous features. The question arises as to how similar two regions need to be in order to be considered part of a larger contiguous feature. Previously, I answered this question by grouping regions that are within one standard deviation of each other (as measured using a metric I describe in previous posts). This method works very well in general, and is very fast. However, it misses smaller structures in more complex images, such as busy city streets.

So instead, this set of algorithms iterates through different multiples of the standard deviation, beginning with a very small fraction of one standard deviation, and iterating up to one standard deviation. The algorithm then chooses the multiple of the standard deviation for which the structure of the features begins to “crystallize”. That is, the algorithm stops partitioning the image at the point where the structure of the resulting features is most stable. The stability of the structure of the features is tested by measuring the stability of the entropy of the distribution of the features. That is, once the distribution of the features identified by the algorithms begins to stabilize, the algorithm stops partitioning the image.

For example, if between .5*std_dev and .6*std_dev, the number and distribution of features is roughly constant, and is the most stable when compared to all other intervals, then the algorithm will choose delta = .5*std_dev as the optimal value for purposes of testing how similar neighboring regions must be in order to be grouped into a single contiguous feature.

In general, the algorithm measures the difference between H(i), the entropy of the distribution of features using a delta of (i*alpha*std_dev), and H(i+1), the entropy of the distribution of features using a delta of ((i+1)*alpha*std_dev), and chooses the value of i that minimizes,

|H(i) – H(i + 1)|.

The theory underlying this approach is that when we don’t distinguish between any of the regions within an image (by using a delta of infinity), the information content of the resultant partition (i.e., the entropy of the distribution of features) is zero, since we are in this case treating the image as one giant feature. That is, the difference between any two regions is finite using any method of comparison, and so if delta is set to infinity, all of the regions will be indistinguishable, producing one feature: the entire image. This feature appears exactly once (since it is simply the entire image), meaning H = 1*log(1) = 0.

As we begin to distinguish between regions, by using a smaller delta, we’ll end up partitioning the image into more and more features. As a result, the entropy of the distribution of features will increase (since it increases as a function of the number of features), reaching the maximum at the point where we distinguish between each region within the image, thereby maximizing the number of features. Though there could be local maxima along the way, as a general matter, the entropy of the distribution of features will increase as we reduce delta, thereby distinguishing between more and more regions within the image, generating a finer partition of the image.

There should, however, be a point along the way at which we actually uncover the “real” structure of the image, which means that the partition should be roughly stable for some interval near that point. Put loosely, the real structure of the image should be “in focus” within some small window around the objectively correct partition.

These algorithms attempt to find that correct partition by choosing the value of delta for which the derivative of the entropy of the distribution of features, as a function of delta, is minimized. This is the point at which the structure of the features, and the number of the features, changes the least as we alter delta, and where we assume the structure of the image is most likely to be “in focus”.

I have attached examples showing the structure identified by these algorithms. The images attached to this post have been weighted by their S-score, so that the most notable features are the brightest, and the least notable the darkest. The S-score is, in short, a measure of how notable a region is in terms of its information content. The shading here “paints” a feature according to the average S-score over all regions within the feature.

Finally, once all of that pre-processing is sorted, another algorithm goes through the features that have been identified in one image, and compares them to the features identified in another, by using six statistical properties of the features, each of which are invariant under rotation:

(1) The partition score N, which is the point at which the “information contrast” of the feature is maximized. See previous posts for more on this.

(2) The standard deviation of the entropy of the feature, as calculated using the partition score N. This is a measure of how much variation there is in the statistical color structure of the feature. A “boring” feature will have a low score, and a “complex” feature that changes color distributions often will have a high score.

(3) The average entropy of the feature, as calculated using the partition score N. This is a measure of how evenly distributed the color information is, with features that have a very concentrated color distribution scoring low, and features that have a very wide distribution of colors scoring high.

(4) (5) and (6) are the average colors across the R,G, and B channels of the feature. This is used to capture the “actual” average color of the feature.

The norm of the difference of the resultant 6-vector for the two regions being compared is then calculated, and averaged over all features, weighted by the importance of the region, as measured using the “S-score” I presented in previous posts. The larger the resultant score, the less similar the two images are. Categorization can then be achieved by comparing the scores generated when one image is compared to a series of images from two or more different categories, allowing us to predict to which category the input image most likely belongs.

Pre-processing an image using the attached algorithms takes anywhere between a few seconds and a few minutes (on a very low-fi laptop), depending upon the size of the image, and its complexity.

Comparing two pre-processed images takes fractions of a second, making processing time for comparing pre-processed images generally negligible. This suggests that these algorithms would likely be able to process video in real, or close to real-time, on an industrial machine. This would allow, for example, live video to be compared to pre-processed databases of images, in order to determine the most likely subject matter of the features within the video.

The code should be called from the Octave command line as follows:

I = imread(“[FILE PATH]”);
[region_matrix N S_mat feature_score_vector] = identify_features(I);
imshow(ent_contrast_darken(I,N,region_matrix, S_mat));

To compare two images that have been processed, use the following:

average_score = IMG_Comparison(region_matrix_1, N_1, S_mat_1, feature_score_vector_1, region_matrix_2, N_2, S_mat_2, feature_score_vector_2)

Octave Scripts

ent_contrast_darken

extract_region

gather_regions_delta

generate_weight_matrix

identify_features

IMG_Comparison

local_test