# Predictions Using Mass-Scale Data

Following up on the previous article below, attached is an updated command line script that allows for efficient predictions over datasets comprised of tens of millions of observations in Euclidean space.

The specific dataset in the command line code attached models a gas expanding in Euclidean space, at two different rates of expansion, and the prediction task will be to correctly identify the rate of expansion, as either the, “fast one” or the, “slow one”.

Each state of the gas is comprised of 10,000 points in Euclidean space, and each sequence of the gas expanding consists of 15 states, for a total of 150,000 three-dimensional vectors per sequence.

Four frames from one sequence of the gas expanding.

There are 300 sequences, for a total of 45,000,000 three-dimensional vectors.

Obviously, it is, as a general matter, very difficult to analyze datasets that involve this many vectors, but the algorithms I’ve developed can nonetheless quickly and efficiently cluster and then make predictions over datasets of this type, on an ordinary consumer device.

The first step is to sort the data using an operator I introduced in this article, specially designed for mass-scale data –

This first step took about 10 minutes, running on an iMac.

The next step, is to embed each state of the gas on the real number line, using algorithms I introduced in this article

This step will drastically compress the dataset, from 45,000,000, three-dimensional vectors, to 300, 15-dimensional vectors, with each vector representing a sequence of states of the gas expanding, which took about 25 seconds.

The next step, is to cluster the real-number vectors, which took about .06 seconds.

The final step, is to actually make predictions, which in this case took a total .70 seconds, over the entire dataset of 300 sequences.

The accuracy is in this case perfect with absolutely no errors, given only the first 5 of the 15 observations in each input vector. That is, if you give the prediction algorithm the first 5 states of the gas, it can correctly classify its expansion rate, every time.

My complete set of algorithms is available on ResearchGate, though I’ve also attached the command line code for this particular dataset as a PDF.

7-22CMNDLINE