# A Note on Absolute Context

I’m in the middle of writing a formal paper and related software on automated hypothesis testing, but I thought I’d outline the basic idea, which is that information theory can be used to provide an absolute context in which error is evaluated to determine whether a hypothesis is exact, imprecise, or simply incorrect. I explained the basic idea on Twitter last week, which is that every observation carries a certain amount of information, and because length itself can be associated with information, we can therefore measure the net information of a hypothesis as the difference between (x) the information content of an observation less (y) the information content of the error between the observation and the hypothesis.

Error, Length, and Information

I’ll present a more fulsome discussion of this topic in a formal research paper I’m working on, but in short, any physical system can be used to store information, so long as you can identify and control its states. As a result, a physical system that can be in any one of $N$ states can be used to represent $N$ characters, or any other set of $N$ distinct elements. As a practical example, just imagine a dial that you rotate, that has $N$ settings. This is a physical system that can be in any one of $N$ states, and therefore, has the same storage capacity as $\log(N)$ binary bits. The same is true of a shoe box – either the lid is on, or it isn’t, and as a result, a shoe box is a physical system that has two states, and can store $\log(2) = 1$ bit of information. This is not how normal people think about the world, but it is how an information theorist should, in my opinion.

Now consider a line with a point upon it at some location. Each possible location for the point will represent a unique state of this system. And of course, the longer the line is, the more possible locations there will be for the point, producing a system that has a larger number of states as the length of the line increases. Again, this is not how normal people think about lines, since they’re primarily conceptual devices imposed upon empty spaces, or objects, and not technically “real”. But, if you had, for example, some string, and a piece of glitter, you could use this system to store information. As a practical matter, if the string is unmarked, it will be hard for you to tell the difference between the possible locations for the speck of glitter along its length. To remedy this, you could mark the string with equally spaced lines that indicate the beginning of one segment and the end of another. Though this is decidedly primitive, a string with $N$ etchings on it together with a piece of glitter can literally store $\log(N)$ bits of information.

As a practical and theoretical matter, there will be some minimum length, below which, you cannot measure distance. That is, your vision is only so good, and even if you have the assistance of a machine, its resolution will still be finite. Therefore, in all circumstances, there will be some minimum length $\delta$ that is the minimum segment size into which you can divide any length. As a result, when presented with a line of length $l$, the maximum number of possible discernible locations along its length upon which you could place an object is given by,

$N = \lceil \frac{l}{\delta} \rceil.$

Similarly, there will be some practical limit to the number of objects $M$ that you can place upon any given line, and as a result, the maximum number of states that can be generated by placing objects upon any given line is $N^M$. Therefore, the information capacity of a system comprised of $M$ points along a line is given by,

$\log(N^M) \approx M(\log(l) - \log(\delta)) \approx \log(l).$

For any given observer in a particular context, the values of $\delta$ and $M$ will be fixed. As a result, the information capacity of a line is really a function of the length of the line, since the other two parameters will be fixed, both as a practical and theoretical matter. This equation is supported by the fact that you can encode any vector with a norm of $l = ||v||$ using $\log(l) + C$ bits. For example, if we want to express a vector in two-space, we need to specify only the length of the vector, which requires $\log(l)$ bits, and the angle the vector forms with either the horizontal or vertical axis. Because the angle does not change as a function of the norm of the vector, it requires a constant amount of information to represent, resulting in a total number of bits given by $\log(l) + C$ bits. Though this is the amount of information required to represent a vector, which is distinct from the amount of information that a vector can store, in reality, these two numbers are the same for any physical system, if we don’t make use of compression in the representation.

As a result, every error, which will be the norm of the difference between two vectors $\epsilon = ||x - y||$, will be associated with an amount of information given by,

$I_{\epsilon} = \log(||x - y||).$

To make things more concrete, let’s consider an example involving a given observed level of RGB luminosity $x = (L,L,L)$. Now let’s suppose we are told by a third party that the level of luminosity is hypothesized to be $y = (H,H,H)$. The error in this case is given by,

$\epsilon = ||x - y||,$

and therefore, the information associated with the error is $\log(||x - y||).$ The net information of the hypothesis is given by,

$h =\log(||x||) - \log(||x - y||)$.

I’ve chosen $h$ in part because capital $H$ is associated with information through Shannon’s equation for entropy. This equation allows us to distinguish between exact, imprecise, and incorrect answers, since $h$ is either equal to $\log(||x||)$, less than $\log(||x||)$ but positive, or negative. As a simple example, the following is Octave code that takes a given observed luminosity as input, and iterates through increasing hypothetical luminosities from $(0,0,0)$, until it produces an incorrect hypothesis, using the criteria above.

initial_lum = 50
obs_color = ones(1,3)*initial_lum;

hyp_lum = 0;
hyp_color = [hyp_lum hyp_lum hyp_lum];

h = 0;

inf_cont = log2(norm(obs_color));

while(h >= 0)

hyp_lum = hyp_lum + 1;
hyp_color = [hyp_lum hyp_lum hyp_lum];

error = norm(obs_color – hyp_color)

h = inf_cont – spec_log(error)

endwhile

figure, imshow(display_color(obs_color))
figure, imshow(display_color(hyp_color))