The Surprisal of a Distribution

The probability of throwing a billion heads in a row using a fair coin is no different than any other outcome involving a billion coin tosses. If we measure the surprisal of observing a billion heads in a row using Shannon information, it’s the same as every other outcome, since their probabilities are all equal, forming a uniform distribution. At the same time, observing a billion heads in a row is beyond surprising, as a practical matter, and would probably be a story you tell for quite a while (putting the time constraints aside).

If however, we look at the distribution of outcomes within each throw, then we get a very different measure of surprisal. Specifically, there is only one outcome that consists of only heads, and so the probability of that distribution is extremely low, in context, since there are going to be an enormous number of possible distributions over a billion coin tosses. We can abstract further, and look at the structure of the distribution, which in this case consists entirely of a single outcome. There are exactly two distributions with this structure –

All heads and all tails.

We can abstract one more time, looking at the distribution of entropies. And again, there are exactly two zero-entropy distributions –

All heads and all tails.

The surprisal of observing an enormous, uniform toss, using any of these other probabilities will be significant, since the probability will be very low, in context.

This intuition is backed up by common sense –

The uniform string of a billion heads or tails is surprising because it has an objective property that is special, and rare, even though all other underlying strings are generated with the same frequency. You don’t notice the other strings, in real life, because they’re not as special in this view, even though they all carry the same underlying probability.