Clustering Mass-Scale Datasets
In previous articles, I’ve introduced algorithms that can quickly process datasets that consist of tens of millions of vectors, on a consumer device, allowing for meaningful analysis of the microstates of a thermodynamic system.
In the original article on the topic, I showed how my deep learning algorithms can be used to identify the macrostates of a thermodynamic system, given its microstates.
In a follow up article, I showed how to radically compress a dataset consisting tens of millions of vectors into a dataset of just a few thousand vectors, ultimately using this technique to make radically efficient predictions (about 500 predictions per second, running on an iMac), with perfect accuracy, classifying the rate of expansion of a gas.
In this article, I’ll present a method for measuring how ordered a thermodynamic system is, using a combination of techniques from deep learning and information theory. Specifically, I’ll demonstrate how these techniques can correctly identify the fact that an expanding gas has a progression of states, thereby generating an actual, measurable mathematical order as the gas expands, whereas a stationary gas is essentially unordered, in that any microstate of the gas could in theory appear at any point in time.
Order, Entropy, and Variance
Imagine repeatedly watching the same gas expand, from roughly the same initial volume, to roughly the same final volume, with the number of particles being fixed over each instance of expansion. Further, assume that you took some fixed number of snapshots of the system as it progressed from compressed to expanded, at roughly the same relative moments in time.
Now assume you ran a clustering algorithm on the individual states, as observed. Intuitively, you would expect all of the final, most expanded states to get clustered together, since they are the most similar, structurally. Similarly, you would expect all the initial, compressed states to get clustered together, again, because they are the most similar, structurally.
Now, instead of using typical classifiers, let’s instead put the index at which the state occurs in a hidden dimension. So for example, the first state of an instance of the gas expanding will have a classifier of 1, the second state 2, etc. Since each state of the gas will have a classifier that tells us when it occurred, this will create a distribution of timestamps for each cluster, and if the sequence of observations is truly ordered in terms of timing, then all of the states clustered together should occur at exactly the same timestamp (assuming a relative time stamp, expressed as a length of time since the inception).
So for example, in a truly ordered system, all of the initial states would be clustered together, with no other states; all of the secondary states would be clustered together, with no other states, etc. This is probably not going to work perfectly in the case of a thermodynamic system, since the motions of the particles are highly randomized, but nonetheless, intuitively, you’d expect a stationary gas to be less ordered than an expanding gas, since a stationary gas doesn’t have any noticeable macroscopic change, whereas an expanding gas does, since the volume occupied by the gas is increasing.
We can, therefore, use entropy to measure the consistency of the timestamps within a cluster, since the timestamps will literally generate a distribution. For example, assume that an ideally expanding gas is such that all of its initial states get clustered together, with no other states. The timestamps associated with this cluster would look something like , if we use simple integer timestamps. This distribution will have an entropy of zero, which indicates perfect consistency. In contrast, if a cluster is comprised of similar states that nonetheless occur at totally different points in the sequence, then the entropy will be non-zero.
For these same reasons, we can also use the standard deviation to measure the consistency of the timestamps associated with a cluster. Returning to the example of the ideal gas, the distribution of timestamps will have a standard deviation of zero, whereas a cluster with a diverse set of timestamps will obviously have a non-zero standard deviation.
Together, these two measures allow us to quantify the extent to which a system is ordered, from two different perspectives:
(1) The entropy allows us to measure the multiplicity of outcomes possible;
(2) The standard deviation allows us to measure the variance of when those outcomes occur in time.
Application to Examples in Thermodynamics
As a demonstration, I’ve put together some code attached below that applies these ideas to two datasets:
(A) one is a stationary gas in a fixed volume, represented by 75,000,000 Euclidean three-vectors;
(B) the other is a gas in an expanding volume, also represented by 75,000,000 three-vectors.
The thesis above would imply that the expanding gas is more ordered than the stationary gas, and this is in fact the case.
Beginning with the stationary gas, the first step is to cluster the dataset, using specialized algorithms that I’ve developed to handle enormous datasets of this type –
This takes about 15 minutes, running on an iMac.
Again, rather than attempt to classify the data, we are instead testing for how consistent its positions are in time, so the classifier dimensions are instead integer indexes that tell us at what point in the sequence the state occurred. You could argue that the integers are arbitrary, but this is actually incorrect, and you can use an argument that is similar to the one I use in this article, on the connections between information and error, to show that so long as your observations are always evenly spaced in time, the actual distance in time between observations does not matter for this purpose.
Returning to the measurements described above, we would expect both the entropy and the standard deviation of the expanding gas to be lower than those associated with the stationary gas, and this is in fact the case.
Running the code attached below produced the following measurements:
Average entropy per cluster: 3.9069.
Average standard deviation per cluster: 4.3234.
Average entropy per cluster: 0.87500.
Average standard deviation per cluster: 0.43970.
The command line code is below, and the full archive of my A.I. code is available on ResearchGate.
Note that you will need to download my archive (approximately 73 MB) to run the command line code.