A Mathematical Theory of Partial Information

The fundamental observation underlying all of information theory is that probability and information are inextricably related to one another through Shannon’s celebrated equation,

I = log(1/p),

where I is the optimal code length for a signal with a probability of p. This equation in turn allows us to measure the information content of a wide variety of mathematical objects, regardless of whether or not they are actually sources that generate signals. For example, in the posts below, I’ve shown how this equation can be used to evaluate the information content of an image, a single color, a data set, and even a particle. In each of these instances, however, we evaluated the information content of a definite object, with known properties. In this post, I’ll discuss how we can measure the information content of a message that conveys partial information about an uncertain event, in short, answering the question of, “how much did I learn from that message?”

Exclusionary Messages

Let’s begin with a simple example. Consider a three-sided dice, with equally likely outcomes we’ll label A, B, and C. If we want to efficiently encode and record the outcomes of some number of throws, then Shannon’s equation above implies that we should assign a code of length log(3) bits to each of the three possible outcomes.

Now imagine that we have received information that guarantees that outcome A will not occur on the next throw. This is obviously a hypothetical, and in reality, it would generally not be possible to exclude outcomes with certainty in this manner, but let’s assume for the sake of illustration that an oracle has informed us that A will not occur on the next throw. Note that this doesn’t tell us what the next throw will be, since both B and C are still a possibility. It does, however, provide us with some information, since we know that the next throw will not be A.

Now imagine that, instead, our oracle told us that the next throw is certain to be A. That is, our oracle knows, somehow, that no matter what happens, the next throw will be A. In this case, we have specified a definite outcome. Moreover, this outcome has a known ex ante probability, which in turn implies that it has an information content. Specifically, in this case, the probability of A is 1/3, and its information content is log(3) bits. Since learning in advance that A will occur is not meaningfully distinguishable from actually observing A (at least for this purpose), upon learning that A will be the next outcome, we receive log(3) bits of information. As a result, we receive log(3) bits of information upon receipt of the message from our oracle, and learn nothing at all upon throwing the dice, since we already know the outcome of the next throw. Note that this doesn’t change the total amount of information observed, which is still log(3) bits. Instead, the message changes the timing of the receipt of that information. That is, either we receive log(3) bits from the oracle, or we observe log(3) bits upon throwing the dice. The only question is when we receive the information, not how much information we receive.

Now let’s return to the first case where our oracle told us that the outcome will definitely not be A. In this case, our oracle has excluded an event, rather than specified a particular event. As a result, we cannot associate this information with any single event in the event space, since two outcomes are still possible. We call this type of information an exclusionary message, since it excludes certain possibilities from the event space of an uncertain outcome. As a general matter, any information that reduces the set of possible outcomes can be viewed as an exclusionary message. For example, if instead our oracle said that the next outcome will be either B or C, then this message can be recharacterized as an exclusionary message conveying that A will not occur.

If we receive an exclusionary message that leaves only one possible outcome for a particular trial, then the information content of that message should equal the information content of observing the actual outcome itself. Returning to our example above, if our oracle tells us that B and C will not occur upon the next throw, then that leaves A as the only possible outcome. As a result, knowing that neither of B and C will occur should convey log(3) bits of information. Therefore, intuitively, we expect an exclusionary message that leaves more than one possible outcome remaining to convey less than log(3) bits of information. For example, if our oracle says that B will not occur, then that message should convey fewer than log(3) bits, since it conveys less information than knowing that neither of B and C will occur.

As a general matter, we assume that the information content of an exclusionary message that asserts the non-occurrence of an event that otherwise has a probability of p of occurring is given by,

I = log(1/(1-p)).

The intuition underlying this assumption is similar to that which underlies Shannon’s equation above. Specifically, an event with a high ex ante probability that fails to occur carries a great deal of surprisal, and therefore, a great deal of information. In contrast, a low probability event that fails to occur carries very little surprisal, and therefore, very little information. Note that, as a result, the information content of an exclusionary message will depend upon the ex ante probability for the event excluded by the message, which is something we will address again below when we consider messages that update probabilities, as opposed to exclude events.

Returning to our example, if our oracle informs us that A will not occur on the next throw, then that message conveys log(3/2) bits of information. Upon receipt of that message, the probabilities of B and C should be adjusted to the conditional probabilities generated by assuming that A will not occur. In this case, this implies that B and C each have a probability of 1/2. When we ultimately throw either a B or a C, the total information received from the message and observing the throw is log(3/2) + log(2) = log(3) bits. This is consistent with our observation above that the oracle does not change the total amount of information received, but instead, merely changes the timing of the receipt of that information.

If instead our oracle informs us that neither of A and B will occur, then that message conveys log(3) bits of information, the same amount of information that would be conveyed if our oracle told us that C will occur. This is consistent with our assumption that the amount of information contained in the message should increase as a function of the number of events that it excludes, since this will eventually lead to a single event being left as the only possible outcome.

If we receive two separate messages, one informing us of the non-occurrence of A, and then another message that informs us of the non-occurrence of B, both received prior to actually throwing the dice, then this again leaves C as the only possible outcome. But, we can still measure the information content of each message separately. Specifically, the first message asserts the non-occurrence of an event that has a probability of 1/3, and therefore, conveys log(3/2) bits of information. The second message, however, asserts the non-occurrence of an event that has a probability of 1/2 after receipt of the first message. That is, after the first message is received, the probabilities of the remaining outcomes are adjusted to the conditional probabilities generated by assuming that A does not occur. This implies that upon receipt of the second message, B and C each have a probability of 1/2. As a result, the second message conveys log(2) bits of information, since it excludes an outcome that has a probability of 1/2. Together, the two messages convey log(3/2) + log(2) = log(3) bits of information, which is consistent with our assumption that whether a message identifies a particular outcome, or excludes all outcomes but one, then the same amount of information should be conveyed in either case.

As a general matter, this approach ensures that the total information conveyed by exclusionary messages and through observation is always equal to the original ex ante information content of the outcome that is ultimately observed. As a result, this approach “conserves” information, and simply moves its arrival through time.

As a general matter, when we receive a message asserting the non-occurrence of an event with a probability of p*, then we’ll update the remaining probabilities to the conditional probabilities generated by assuming the non-occurrence of the event. This means all remaining probabilities will be divided by 1 – p*. Therefore, the total information conveyed by the message followed by the observation of an event with a probability of p is given by,

log(1/(1-p*)) + log((1-p*)/p) = log(1/p).

That is, the total information conveyed by an exclusionary message and any subsequent observation is always equal to the original ex ante information content of the observation.

Partial Information and Uncertainty

This approach also allows us to measure the information content of messages that don’t predict specific outcomes, but instead provide partial information about outcomes. For example, assume that we have a row of N boxes, each labelled 1 through N. Further, assume that exactly one of the boxes contains a pebble, but that we don’t know which box contains the pebble ex ante, and assume that each box is equally like to contain the pebble ex ante. Now assume that we receive a message that eliminates the possibility that box 1 contains the pebble. Because all boxes are equally likely to contain the pebble, the information content of that message is log(N/N-1). Now assume that we receive a series of messages, each eliminating one of the boxes from the set of boxes that could contain the pebble. The total information conveyed by these messages is given by,

log(N/(N-1)) + log((N-1)/(N-2)) + … + log(2) = log(N).

That is, a series of messages that gradually eliminate possible locations for the pebble conveys the same amount of information as actually observing the pebble. Note that simply opening a given box would constitute an exclusionary message, since it conveys information that will either reveal the location of the pebble, or eliminate the opened box from the set of possible locations for the pebble.

As a general matter, we can express the uncertainty as to the location of the pebble as follows:

U = Log(N) – I.

Messages that Update Probabilities

In the previous section, we considered messages that exclude outcomes from the event space of a probability distribution. In practice, information is likely to come in some form that changes our expectations as to the probability of an outcome, as opposed to eliminating an outcome as a possibility altogether. In the approach we developed above, we assumed that the information content of the message in question is determined by the probability of the event that it excluded. In this case, there is no event being excluded, but instead, a single probability being updated.

Let’s begin by continuing with our example above of the three-sided dice and assume that we receive a message that updates the probability of throwing an A from 1/3 to 1/4. Because the message conveys no information about the probabilities of B and C, let’s assume that their probabilities maintain the same proportion to each other. In this particular case, this implies that each of A and B have a probability of 3/8. Though its significance is not obvious, we can assume that the updated probabilities of A and B are conditional probabilities, specifically, the result of division by a probability, which in this case would be 8/9. That is, in our analysis above, we assumed that the remaining probabilities are adjusted by dividing by the probability that the excluded event does not occur. In this case, though there is no event being excluded, we can, nonetheless, algebraically solve for a probability, division by which, will generate the updated probabilities for the outcomes that were not the subject of the message.

Continuing with our example above, the message updating the probability of A from 1/3 to 1/4 would in this case have an information content of log(9/8). Upon observing either B or C after receipt of this message, the total information conveyed would be log(9/8) + log(8/3) = log(3). Note that information is not conserved if we were to subsequently observe A, but this is consistent with our analysis above, since throwing an A after receipt of an exclusionary message regarding A would imply that we’ve observed an infinite amount of information.

Interestingly, this approach implies the existence of a probability, the significance of which is not obvious. Specifically, if we receive a message with an information content of i, then since,

i = log(1/1 – p),

the probability associated with that information is given by,

p = 1 - 1/2^i.

This is the same form of probability we addressed in the post below, “Using Information Theory to Explain Color Perception”. In the analysis above, this was the probability of an event excluded by a message. If we assume that, similarly, in this case, this is the probability of some event that failed to occur, then the information content of the message would again increase as a function of surprisal, with high probability events that fail to occur carrying more information than low probability events.

We can still make use of this probability to inform our method, even though we don’t fully understand its significance. Specifically, this probability implies that messages that update probabilities always relate to probabilities that are reduced. That is, just like an exclusionary message eliminates an event from the outcome space, a message that updates a probability must always be interpreted as reducing the probabilities of some outcomes in the event space, meaning that the conditional probabilities of the outcomes that are not the subject of the message will be increased. Since we divide by 1 – p to generate those conditional probabilities, assuming that the conditional probabilities decrease implies that the probability 1 – p > 1, which in turn implies that p < 0. As a result, assuming that p is in fact a probability provides insight into our method, regardless of whether or not we fully understand the significance of the probability.

For example, if we receive a message that increases the probability of A to 2/3, then we would interpret that message as decreasing the probability of both B and C to 1/6. That is, we recharacterize the message so that the subject of the message is actually outcomes B and C. Recall that we determine p by looking to the conditional probabilities of the outcomes that are not the subject of the message, and so in this case, we have (1/3)/(1-p) = 2/3, which implies that 1 – p = 1/2. Therefore, the information content of the message is log(2), and upon observing an A, the total information received is log(2) + log(3/2) = log(3), i.e., the original ex ante information content of the outcome A.

Using Information Theory to Explain Color Perception

RGB encoding is an incredibly useful representational schema that has helped facilitate the proliferation and current ubiquity of high quality digital images. Nonetheless, the color space generated by RGB vectors using its natural Euclidean boundaries as a subset of 3-space is not representative of how human beings actually perceive differences in color. We could of course chalk this up to subjectivity, or some biological processes that cause human beings to inaccurately distinguish between colors. Instead, I think that human beings perceive colors in a perfectly rational, efficient manner, and that information theory can help us understand why it is that human beings don’t perceive changes in color and luminosity linearly. Specifically, I believe that human beings naturally, and rationally, construct efficient codes for the colors they perceive, leading to changes in perceived colors that are logarithmic, rather than linear in nature.

The Information Content of a Color

The fundamental observation that underlies all of information theory is the following equation due to Claude Shannon:

I = \log(1/p),

where I is the optimal code length (measured in bits) of a signal with a probability of p. In this particular case, the ultimate goal is to take an RGB color vector, map it to a probability, and from that, generate a code length using the equation above that we will treat as the information content of the color represented by the RGB vector (see the posts below for more on this approach).

In previous posts, I used this method to measure the information content of a set of pixels, despite the fact that a set of pixels is not a source, but is instead a static artifact. Nonetheless, a set of pixels will have some distribution of colors, that, while fixed, can be viewed as corresponding to a probability distribution. As a result, we can measure the information content of a group of pixels by treating the frequency of each color as a probability, and then calculating the entropy of the resulting probability distribution. But unlike a set of pixels, a single color has no distribution, and is instead, using RGB encoding, a single 3-vector of integers. As a result, there’s no obvious path to our desired probability.

Let’s begin by being even more abstract, and attempt to evaluate the information content of a single color channel in an RGB vector. This will be some integer x with a value from 0 to 255. If we’d like to make use of the equation above, we’ll have to associate x with some probability, and generate a code for it, the length of which we can treat as the information content of the color channel. One obvious approach would be to view x as a fraction of 255. This will produce a number p = x/255 that can be easily interpreted (at least mathematically) as a probability. But upon examination, we’ll see that this approach is unsatisfying from a theoretical perspective.

First, note that x is measure of luminosity that we intend to map to a probability. Luminosity is an objective unit of measurement that does not depend upon its source. As a result, the probability we assign to x should similarly not depend upon the particular source we’re making use of, if we want our method to be objective. That is, a particular level of luminosity should be associated with the same probability regardless of the source that is generating it.

To make things less abstract, imagine that we had a series of 5 identical light bulbs bundled together. The more lights we have on, the greater the luminosity generated by the device. If we assign probabilities to these luminosity levels based upon the portion of lights that are on, then using four out of five lights would correspond to a probability of 4/5. Our code length for that outcome would then be \log(5/4). Now imagine that we have a group of 10 identical light bulbs bundled together. On this device, using 4 of the lights produces a probability of 4/10, and a code length of \log(10/4). In both cases, the luminosity generated would be the same, since in both cases, exactly 4 lights are on. This implies that, as a general matter, if we use this approach, our code lengths will depend upon the particular light source used, and will, therefore, not be an objective mapping from luminosity to probability.

Instead, what we’re really looking for is something akin to what Ray Solomonoff called the “universal prior”. That is, we’d like an objective ex ante probability that we can ascribe to a given luminosity without any context or other prior information. If we give this probability physical meaning, in this case, our probability would be an objective ex ante probability for seeing a particular luminosity. This might sound like a tall order, but it turns out that by using Shannon’s equation above, we can generate priors that are useful in the context of understanding how human beings perceive colors, even if we don’t believe that they are actually universal probabilities. In short, I think that the human brain uses universal priors because they work, not because they’re actually correct answers to cosmic questions like, “what’s the probability of seeing blue?”

Information and Luminosity

Thanks to modern physics, we know that light is quantized, and comes in little “chunks” called photons. This means that more luminous light sources literally generate more photons, and therefore, more information. For those that are interested, I’ve written a fairly in-depth study of this topic, and others, which you can find here:

A Computational Model of Time-Dilation

This does not, however, imply that perceived luminosity will vary in proportion to the actual luminosity of the source. Instead, I assume that perceived luminosity is proportional to the amount of information triggered by signals generated by our sense organs upon observing a light source. Under this view, what human beings perceive as luminosity is actually a measure of information content, and not a measure of luminosity itself. That is, one light appears to glow brighter than another because the brighter light triggers a signal in our sense organs that contains more information than the signal triggered by the darker light source. This implies that sense organs that detect light are designed to measure information, not luminosity. Moreover, this implies that what we perceive as light is a representation of the amount of information generated by a light source, and not the actual luminosity generated by the light source. In crude terms, human vision is a bit counter, not a light meter.

If this is true, then it’s not the number of photons generated by a light source that we perceive, but the number of bits required to represent the number of photons generated by the light source. We can make sense of this by assuming that our perceptions are representational, and that our brain has a code for “light”, and a code for luminosity. When we see a light with a given luminosity, our brain recognizes it as a light, and generates a code for “light”, and then attaches a code for the perceived level of luminosity, which then triggers a sense impression of a light with a particular brightness. This view assumes that what we experience as sight is quite literally a representation, triggered by codes that are generated by our senses.

This view might be philosophically troubling, but we’ll see shortly that it produces the right answers, so, as a pragmatist, I’m willing to consider the possibility that what I perceive is what Kant would call phenomena (i.e., a representation of a thing), and not noumena (i.e., the underlying thing itself). In some sense, this is trivially the case, since we interact with the world through our senses, and as a result, our senses are our only source of information about the external world. But when we measure luminosity with something other than our eyes, it turns out that there’s a systematic disconnect between perceived luminosity, and actual measured luminosity, suggesting that there is real physical meaning to Kant’s distinction between phenomena and noumena. That is, the fact that human beings perceive luminosity as a logarithmic function of actual measured luminosity suggests the possibility that, as a general matter, our perceptions are representational.

Leaving the philosophy, and returning to the mathematics, this view implies that the total signal information generated by our sense organs upon observing a light source with an actual luminosity of L should be given by,

i = \log(L) + C,

where C is a constant that represents the length of the code for perceiving light.

Note that, in this case, i is not the information content of a Shannon code. Instead, i is the raw signal information generated by our sense organs upon observing the light source itself. So let’s take things a step further, and assume that our sense organs are efficient, and make use of compression, which would mean that at some point along the set of processes that generate our perceptions, the initial code with a length of i is compressed. Further, for simplicity, let’s assume that it’s compressed using a Shannon code. In order to do so, we’ll need an ex ante probability. Again, we’re not suggesting that this probability is the “correct” probability of observing a particular signal, but rather, that it is a useful prior.

We can accomplish this by taking Shannon’s equation and solving for p. Specifically, we see that given an amount of information i, the probability associated with that information is given by,

p = 1/2^i .

Again, we can agonize over the physical meaning of this probability, and say that it’s the correct ex ante prior for observing a given amount of information i. This has some intuitive appeal, since intuition suggests that as a general matter, low information events should be far more likely than high information events, which could explain why we find unusual events exciting and interesting. But we don’t need to take it that far. Instead, we can assume that our sense organs make use of a universal prior of this sort because it’s useful as a practical matter, not because it’s actually correct as an objective matter.

In this case, note that p = 1/L (ignoring the constant term C), so plugging p back into Shannon’s original equation, we obtain the following:

I = \log(L).

In short, we end up at the same place. This suggests that whether or not our sense organs make use of signal compression, in the specific case of perceiving luminosity, we end up with a code whose length is given by the logarithm of the luminosity of the source, which is consistent with how human beings actually perceive luminosity.

In summation, human beings perceive luminosity logarithmically as a function of actual luminosity because what we are perceiving is a representation of the information content of a light source, not actual light, and not a representation of light.

If we take an RGB channel value as a proxy for luminosity, we can now finally express the information content of a single color channel value x as simply I = \log(x).

Color and Entropy

Luminosity is of course only one aspect of color, since colors also convey actual color information. To address this aspect, I assume that color itself is the consequence of the brain’s distinctions between the possible distributions of luminosity across different wavelengths of light. That is, a light source will have some total luminosity across all of the wavelengths that it generates, and the color ultimately perceived is the result of the distribution of that total luminosity among the individual wavelengths of light produced by the source. This is consistent with the fact that scalar multiples of a given RGB vector don’t change the color that is generated, but instead simply change the luminosity of the color.

Let’s consider the specific case of a given color vector (x y z). The total luminosity of the vector is simply L = x + y + z. As a result, we can construct a distribution of luminosity given by,

(p_1 p_2 p_3) = \frac{(x y z)}{L}.

We can then take the entropy of (p_1 p_2 p_3), which will give us a measure of the diffusion of the luminosity across the three channels. The maximum diffusion occurs when each channel has an equal amount of luminosity, which will produce no color at all, and scale from black to white, suggesting that color itself is the perceptual result of asymmetry in the distribution of information across wavelengths of light. In short, a high entropy color contains very little color information, and a low entropy color contains a lot of color information.

Comparing Colors Using Information Theory

Combining the information content of the color channels of an RGB vector, and the entropy of the vector, we can construct an overall measure of the difference between two colors (x y z) and (a b c) as follows:

\delta L = ||\log((x y z)) - \log((a b c))||,

and,

\delta H= (H_1 - H_2)^2/\log(3)^2,

where H_1 and H_2 are the respective entropies of (x y z) and (a b c). That is, we take the logarithm of the color vectors, then we take the norm of the difference, and this will give us a measure of the difference in luminosity between two colors. Then, we take the difference between the entropies of the two color vectors, which will give us a measure of how different the two colors are in terms of how much color information they convey. Note that each of H_1 and H_2 are necessarily less than or equal to \log(3).

Though we can of course consider these two metrics separately, we can also combine them into a single metric that allows us to compare two colors:

\delta T = 50 \delta L (\delta H + 1).

Since the logarithmic scale creates small differences between colors, I’ve chosen a multiplier of 50 to make the differences more noticeable, but this is otherwise an arbitrary scalar.

 

I’ve attached a color bar that begins with black (0 0 0), and increases linearly by a factor of 25.5 along the blue channel, together with a graph that shows the difference between each pair of adjacent colors, as measured by \delta T. As you can see, the colors become more similar as we traverse the color bar from left to right, despite the fact that their distances in Euclidean space are constant.

 

I’ve also attached another color bar that iterates through eight colors, but in this case, the increment moves from one channel to another, as if we were counting from 0 to 7 in binary using three bits.

The actual colors used in this case are as follows:

0 0 0
0 0 128
0 128 0
0 128 128
128 0 0
128 0 128
128 128 0
128 128 128

As you can see, yet again, the measured differences in color reflect how human beings actually perceive changes in color. Note that the greatest difference is between a non-primary blue, and primary red, which is intuitively correct both analytically (since there’s no intersection between the two colors) and visually (since there is a great deal of contrast between the two colors).

Perception and Symbolic Computation

In short, if we assume that human beings perceive colors as encoded representations of underlying physical objects, then by applying principles of information theory, we can generate equations that accurately reflect how human beings actually perceive colors. Taken as a whole, this is rather remarkable, since it suggests that the human body might, at least in some cases, operate like an efficient processor of representational information. This implies that our perceptions are the result of symbolic computations that manipulate representations, rather than one-to-one approximations of external objects. In less abstract terms, this view suggests that the way we perceive objects is more like the abstract symbolic representation of the number π , than a specific decimal approximation of the number represented by that symbol. In short, it could be that human beings not only think in terms of symbolic representations, but that we actually perceive in terms of symbolic representations. If correct, this suggests that attempts to mimic human intelligence should, at least in part, be focused on intelligent symbolic computation, and not just statistical techniques such as machine learning.