On the Probability of Independent Variables

Imagine you have N lightbulbs, and each lightbulb can be either on or off. Further, posit that you don’t know what it is that causes any of the lightbulbs to be on or off. For example, you don’t know whether there’s a single switch, or N switches, and moreover, we leave open the possibility that there’s a logic that determines the state of a given bulb, assuming the state of all other bulbs. We can express this formally, as a rule of implication F from subset C \subset S to S, such that given any C, the state of S is determined.

This is awfully formal, but it’s worth it, as you can now describe what you know to be true: there’s basically no way there are N switches for any N lightbulbs in the same room.

The rule of implication F is by definition a graph in that if the state of lightbulb A implies the state of lightbulb B, then you can draw an edge from A to B. There is only one graph that has no rule of implication, i.e., the empty graph. As a consequence, the least likely outcome, given no information ex ante, is that there is no correlation between the states of the lightbulbs. This would require N independent switches.

As a general matter, given N variables, the least likely outcome is that either all variables are perfectly independent, or all variables are perfectly dependent, since there is only one empty graph, and only one complete graph. It is therefore more reasonable to assume some correlation between a set of variables, than not, given no information ex ante. This is similar to Ramsey’s theorem on friends and strangers, except it’s probabilistic, and not limited to a single size graph, and is instead true in general.

More generally, this implies that the larger a set of random elements is, the less likely it is the elements are independent of each other. To test this empirically, I generated a dataset of 1,000 random vectors, with 1,000 randomly generated classifiers, 1, 2, 3, and 4, and tried to predict the classifier of each vector. I ran Black Tree‘s size-based confidence prediction algorithm on this dataset, which should produce an accuracy of 25%, since these are random vectors, unless something unusual is going on. Unfortunately, this is not included in the free version of Black Tree, but you can do something similar, since it pulls clusters near a given vector and then uses the modal class of the cluster as the predicted class of the given vector. The bigger the cluster, the higher the confidence metric. However, the bigger the cluster, the less likely the vectors in the cluster are to be independent of each other, according to this analysis. And as it turns out, the accuracy actually increases as a function of cluster size, starting out around 27%, and peaking around 70%. The average accuracy for the 90th percentile confidence is about 34%. This makes no sense, in the absence of a theory like the one above. Here’s the code I used to generate the dataset:

clear all

num_rows = 1000;
N = 1000;

dataset = 10*rand(num_rows,N);
dataset(:,N+1) = randi([1 4],[num_rows 1]);

file = “/Users/charlesdavi/Desktop/random_vectors.txt”;

Here are the results, with accuracy plotted as a function of confidence, together with the distribution of accuracies at 90% confidence. I ran this algorithm 5,000 times, over 5,000 random subsets of this dataset, and this is the result over all iterations, so it cannot be a fluke outcome. It seems instead the theory is true empirically.


Note that this is not causation. It is instead best thought of as a fact of reality, just like Ramsey’s friends and strangers theorem. Specifically, in this case, it means that information about some of the predictions is almost certainly in their related clusters, because the vectors in the cluster are so unlikely to be independent of one another, despite the fact they were randomly generated. In fact, because they were randomly generated, and not deliberately made to be independent of one another, the more likely outcome is that the vectors are not independent. Therefore, we can use the classifiers of the vectors in the cluster to predict the classifier of the input vector. The larger the cluster, the less likely the vectors are to be mutually independent, resulting in accuracy increasing as a function of cluster size.

As a general matter, if you deliberately cause a set of systems to be totally independent or totally dependent, then those systems are of course not random. If however, you allow the systems to be randomly generated, then the most likely case is that they’re partially dependent upon each other. This is most certainly counterintuitive, but it makes perfect sense, since allowing a random process to control the generation of the systems causes you to lose control of their relationships, and the most likely outcome is that their relationships are dependent upon one another.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s