This is just the first post in what will be torrent of new algorithms I’ve developed for dealing efficiently with datasets comprised of large numbers of observations.
This set of algorithms implements the dictionary of states referenced in my original article on the topic (the latest article on the topic is here, which links back to the original), allowing for radically efficient comparison between very complex observations, in this case reducing a set of observations to a single dictionary of states.
The net effect of this particular set of algorithms (attached below) is to achieve compression, taking a set of complex observations, that could each consist of thousands of Euclidean datapoints (or significantly more), and quickly eliminate duplicate states, reducing the dataset to a set of unique observations.
The example in the code below takes a sequence of observations intended to model a simple example in thermodynamics, of an expanding gas, which is plotted below.
Each state of the gas consists of 10,000 points in Euclidean space, and there are 10 observations per sequence, with 10 sequences, for a total of 1,000,000 vectors.
Compression is in this case achieved in about 8 seconds, on an iMac.
This could be useful on its own, but I will follow up with another set of algorithms sometime this week that allow this initial step to facilitate fast prediction over these types of datasets, that consist of massive numbers of observations, even on consumer devices, which would otherwise be intractable, and then follow up with a comprehensive approach to natural language processing using these same algorithms.
The code necessary for this example is available here:
All additional code can be found on my Researchgate page, under “Project Log”.