# A Note on Layering Classifications

In a previous research note, I showed how to use my deep learning application Prometheus to do video classifications. The technique I employed was to apply image classification to each frame of the video, which produces a vector of classifications for each video file. That is, a video consists of some number of frames, and so if we classify each frame as an image, we will generate a vector of classifications that contains as many entries as the video contains frames.

I’ve been experimenting with broadening this approach, by repeatedly applying my classification and prediction algorithm. Specifically, I build a first model using the “generate data tree” function, as applied to a portion of the training set. Then, I take a new portion of the training set, and feed it to the prediction algorithm, which will generate classification predictions. However, during this process, it compares each input to every entry in the data tree that underlies the prediction model. By saving saving this information in a vector, we’ll generate a vector of differences associated with each input.

We can then classify those vectors of differences, by calling the “generate data tree function” again, using those vectors as the training dataset. If we do this repeatedly, we’re going to generate a chain of models, and we can then pump new testing inputs through that chain, generating a classification prediction at each stage of the chain. The end result of this process will be a vector of classifications: the same product produced by my video classification technique. We can then classify those vectors of classifications, with the expectation that this final classification will make use of multiple layers of information extracted from the original dataset.

We could also generate a chain of length three, and then put the bottom two models in the chain into a loop. That is, if the models are M1, M2, and M3, we could run the output of M3 back into the input of M2, producing a loop that would run some fixed number of times, generating a fixed number of classifications. This would cut down on model generation time, but still allow us to generate a vector of classifications. You need a minimum of three models to do this because the input to M1 is the actual training data, whereas the inputs to M2 and M3 are vectors of differences.

I’m also going to experiment with random “filters”, rather than using the trees generated during this process. That is, we compare the vectors to randomly generated vectors, measure the differences, and store them as a difference vector. The hope would be that each class will have a signature pattern when applied to the same set of filters.

I’m currently testing this and will follow up in the next few days with code and a research note.

# A Note on Delimited Sequences

In a previous research note, I introduced a method of delimiting sequences of data, by testing the ratio of adjacent entries in the sequence. This process generates mutually exclusive categories, simply because it places delimiters in the sequence of data, which we can then interpret as indicators that mark the beginning and end of categories. I showed how this method can be used to quickly find the boundaries of objects in images, and of course, it can also be used to categorize data.

However, rather than read the data in the order in which it is presented, we can also compare each element to every other element, thereby generating non-mutually exclusive categories. That is, you perform the same process that I described in the previous research note, but rather than simply test adjacent entries, you test a given entry against all other entries in the dataset, as if they were adjacent. This will generate a category for each element of the dataset. We then test each element, against every other element, and if the test generates a delimiter, then we don’t include the element in question in the category in question. If the test does not generate a delimiter, then we do include the element in question in the category in question.

We can of course also produce mutually exclusive categories using this technique by simply tracking which elements have already been selected.

In the research note below this one, I noted that there is good reason to believe that there is a single objective in-context minimum difference for distinction, which I call $\delta$, and that two of my categorization algorithms produce very similar values for $\delta$ when applied to the same dataset, despite the fact that the algorithms are very different. Specifically, one produces mutually exclusive categories, and the other produces non-mutually exclusive categories. Nonetheless, both produce very similar values of $\delta.$

The question is then, does the delimiter process, which also produces a measure of distinction I call $\Delta$, produce roughly the same value for $\Delta$, whether we’re generating mutually exclusive categories, or non-mutually exclusive categories?

I’m going to test this hypothesis over the next few days, and follow up with the results.

Since it’s easy to generate either mutually exclusive categories, or non-mutually exclusive categories using this approach, regardless of the operator we use to compare two elements of a dataset, it suggests a more general question:

Is there is an objective, in-context level of distinction associated with every operator as applied to a dataset?

My original categorization algorithm uses the norm of the difference between two vectors to compare elements of a dataset. But my library contains algorithms that use other operators, such as intersection, inequality, and we can imagine others, like taking the union of sets. These are just trivial variations on my main theme of AI, which is to iterate through levels of distinction, and select the level that generates the greatest change in the entropy of the object in question.

Restating the question: is there a single, objective threshold value, analogous to $\delta$, that is associated with every operator as applied to a given dataset?

To test this hypothesis, we’d have to generate mutually exclusive categories using the operator in question, note the associated value of $\delta$, and then generate non-mutually exclusive categories, and compare the resultant value of $\delta$ to the prior value of $\delta$. I will test this hypothesis as well, but possibly in a separate note, since it is a much larger topic.

# Measuring Dataset Consistency

Even if two datasets are derived from the same pool of observations, it could still be the case that there are unaccounted for differences between the two datasets that are not apparent to a human observer. Any such discrepancies could change the way we think about the data, and could, for example, justify building two separate models, or suggest the existence of inconsistencies in the way the datasets were generated, undermining the validity of any inferences drawn from the datasets. Below, I’ll show how we can use one of my algorithms to measure the internal consistency of a single dataset, and the consistency between two datasets.

Measuring Internal Consistency

In a previous article, I mentioned that I had written an algorithm that can quickly generate non-mutually exclusive categories on a dataset. Like nearly all of my algorithms, this “graph categorization algorithm” generates a measure of distinction $\delta$, that tells us how different two data points need to be in order to justify distinguishing between them in the context of the dataset. Specifically, if vectors $x$ and $y$ are both in some dataset, then we distinguish between $x$ and $y$ only if $||x - y|| > \delta$. That is, if two vectors are within $\delta$ of each other, then we treat them as equivalent, whereas if the norm of their difference exceeds $\delta$, then we distinguish between them.

The structure of the dataset will of course affect the value of $\delta$. Generally speaking, if the data is spread out, then $\delta$ will be large, and if the data is concentrated, then $\delta$ will be small. As a result, when we add new data to a dataset, we will almost certainly change the value of $\delta$. However, if we add new data that is drawn from the same underlying dataset, then the value of $\delta$ shouldn’t change much. That is, if the original dataset is sufficiently large, then we’re not learning anything from the new data – we’re just including more examples of the same type of data. As a result, we can use $\delta$ as a measure of how much the inclusion of new data changes the structure of a dataset, by evaluating $\delta$ before and after the inclusion of the new data.

Let’s begin by incrementally adding new data to a dataset, and measuring the value of $\delta$ at each iteration. The dataset will in this case consist of 10-dimensional vectors of random numbers generated by Octave. The expectation is that the value of $\delta$ should stabilize once the dataset is sufficiently large, since once we have enough data, the appropriate level of distinction should become clear, and roughly invariant with respect to new data being added. Stated differently, assuming my algorithms work, there should be a single level of distinction for a dataset, that shouldn’t change as we add new data, assuming the new data is of the same type as the existing data.

We can accomplish this with the following Octave code:

N = 10;

for i = 1 : 500

data_matrix = rand(i,N);

[final_graph_matrix final_delta] = generate_dataset_graph(data_matrix, N);

data_vector(i) = final_delta;

final_delta

endfor

figure, plot(1:500,data_vector)

This will add one new vector at a time to the dataset, categorize the dataset using the “graph categorization algorithm”, print the resultant value of $\delta$, and store it in a vector that is later displayed as a graph. Below is the graph generated by the code above, which will vary somewhat each time you run the code, since the dataset is generated randomly. Note that even though the graph categorization algorithm is fast, and has a polynomial runtime, the code above involves calling the algorithm $500$ times, so it’s going to take a little while to run. I’ve named the graph generated by plotting $\delta$ as a function of the size of the dataset the consistency curve for the dataset. The average over the consistency curve is in this case $1.0147$, and the standard deviation is $0.10186$.

The consistency curve for the dataset of random vectors.

As you can see, the value of $\delta$ stabilizes near the average, which is shown as the orange line in the graph above. This result is consistent with the hypothesis that there should be a single objective value of $\delta$ that is intrinsic to the dataset in the abstract, and not dependent upon the number of observations. Obviously, without a sufficiently large number of observations, you can’t determine this value. But nonetheless, at the risk of being overly philosophical, there really is an underlying process that generates every dataset. Therefore, the correct minimum difference that warrants distinction in the context of a particular dataset is a function of that process, not the observed dataset. The observed dataset gives us a window into that process, and allows us to generate an approximation of the true, underlying in-context difference that warrants distinction between observations generated by that process.

The idea that there is a single, in-context measure of distinction for a dataset is further supported by the observation that replacing the graph categorization algorithm, with my original categorization algorithm (“optimize_categories_N”), in the code above produces very similar values for $\delta$. This is actually remarkable, because these two algorithms make use of different processes, and generate different outputs: the graph categorization algorithm generates non-mutually exclusive categories, whereas the original categorization algorithm generates mutually exclusive categories. Nonetheless, they generate approximately the same values of $\delta$, which supports the notion that just like a dataset has a “true” mean, and standard deviation, there is also a true minimum difference that warrants distinction – i.e., a true value of $\delta$.

It also supports the idea that my method of selecting the level of distinction that generates the greatest change in the entropy of the object in question is the correct way to find this value of $\delta$, since this is the method that both algorithms have in common, despite the fact that they produce different outputs using that value of $\delta$.

Now let’s repeat the same process using the Wine Dataset, which is courtesy of the UCI Machine Learning Repository. The same hypothesis should hold, which is that the value of $\delta$ should stabilize around the “correct”, in-context level of distinction for the dataset as we add more data. This is in fact exactly what happens, as you can see in the chart below. The average value over the consistency curve is in this case $5.5301$, and the standard deviation is $1.0820$.

The consistency curve for the Wine Dataset.

The volatility of the consistency curve should decline as a function of the number of observations, and if it doesn’t, then this implies that new observations carry new information about the structure of the dataset. Since this could of course be the case, not all datasets will produce consistency curves that stabilize, and so we can use this curve as a measure of the internal consistency of a dataset. That is, the greater the volatility of this curve, the more “shocks” there are to the structure of the dataset from new observations. If this persists, then there might not be a single process at work generating the data, which means that what we’re observing might actually be transitory in nature, and not a stable process that we can observe and predict. Alternatively, it could be the case that the underlying process has a high complexity. That is, if the underlying process has a high Kolmogorov complexity, then it will generate a large number of novel observations, each of which will shock the dataset. Finally, the dataset could also contain interjections of noise, which will also shock the dataset if the noise is novel every time, which is plausible, since noise is presumably the product of a Kolmogorov-random process that generates novel observations.

The consistency curve for the Ionosphere Dataset.

Above is the consistency curve for the Ionosphere Dataset, which is also courtesy of the UCI Machine Learning Repository. The average over the curve is $1.8694$, and the standard deviation is $.44548$. As you can see, it’s highly volatile, suggesting that new observations significantly change the structure of the dataset. Interestingly, this dataset produces a large percentage of “rejected” inputs when my prediction algorithm is applied to it. A rejected input indicates that the prediction algorithm believes that the input is beyond the scope of the training dataset. If a dataset is not internally consistent, then randomly selected observations are more likely to be novel observations, and therefore, outside the scope of the training dataset. Therefore, we would expect a large percentage of rejections for a dataset that is not internally consistent, which is exactly what happens in the case of the Ionosphere Dataset.

It’s tempting to think that a countably infinite number of observations (which is obviously not physically possible), would allow us to discern the “true” level of distinction using this process, but I’m not sure that’s correct. I haven’t worked through the mathematics carefully, yet, but even a superficial analysis implies that the notion of entropy has to change when you have a countable set, since Shannon’s equation does not work with a countable set. Specifically, you can’t have a uniform distribution on a countable set using ordinary probabilities, and therefore, you need a new measure of entropy if you’re going to make use of countable sets. But, as far as we know, observation is finite, so this question is academic, at least for now.

Measuring Consistency Between Datasets

If we have two datasets of observations that were ostensibly generated using the same procedures, and sampled from the same source, then the value of $\delta$ shouldn’t change much when we combine the datasets. If $\delta$ does change significantly, then either the two datasets are incomplete on their own, or, there’s some underlying difference between them that’s unaccounted for.

There are probably a number of reasonable ways to go about testing the consistency  between two datasets using $\delta$, but the method I’ve decided to use is to generate three consistency curves: one for the first dataset, one for the second dataset, and one for the combined dataset. Then, we can measure both the average value and standard deviation of each consistency curve. When examining the consistency curve for the combined dataset, if it turns out that the average value changes significantly, or the standard deviation increases significantly, in each case as compared to the two individual curves, then it suggests that combining the datasets significantly disturbed the structure of the individual datasets. This in turn suggests that the two datasets are in fact distinct. In contrast, if the average value of $\delta$ is not significantly changed, and the standard deviation is unchanged or decreases, then it suggests that the two datasets are consistent.

If the two datasets are both reasonably large, and their individual consistency curves stabilize, then if the combined consistency curve is drastically more volatile, we can be more confident that the datasets are not incomplete, but that instead, there is some bona fide difference between them. If they’re both sampled from the same source, then there must be some difference in the sampling process that explains the resultant differences between the datasets. As a result, we can also use this process to identify inconsistencies in sampling methods used to gather data, as well as distinguish between datasets that are superficially similar, but nonetheless have some subtle differences that might not be apparent to a human observer.

We’ll apply this process by combining the Parkinsons Dataset, which is again courtesy of the UCI Machine Learning Repository, and the Wine Dataset. The Parkinsons Dataset is $23$ dimensions, and the Wine Dataset is $14$ dimensions. As a result, we can’t combine them without reducing the dimension of the Parkinsons Dataset to $14$. Reducing the dimension of the Parkinsons Dataset will obviously affect the dataset, but we’re using it for a very limited purpose, which is to demonstrate that when two datasets that are clearly not consistent with each other are combined, the consistency curve will be drastically impacted.

The consistency curve for the Parkinsons Dataset, limited to $N = 14$.

Above is the consistency curve for the Parkinsons Dataset, limited to $14$ dimensions of data. The average over the curve is $10.662$, and the standard deviation is $2.8521$. Though there are some shocks, it trends reasonably close to the average, suggesting that the dataset is reasonably internally consistent, even when limited to $14$ dimensions. Below is the consistency curve for the Parkinsons Dataset using the full $23$ dimensions of data. The average over the curve below is $13.282$, and the standard deviation is $3.6951$. The inclusion of the additional $9$ dimensions obviously affected the consistency curve significantly, but this was expected.

The consistency curve for the Parkinsons Dataset, using all $N = 23$ dimensions.

I combined the two datasets into a single dataset with the rows from the Wine Dataset first, and the rows from the Parkinsons Dataset afterward, without changing the order of the rows in either dataset. I then ran the same process on the combined dataset, which generated the consistency curve below. The Wine Dataset contains $178$ rows, and the Parkinsons Dataset contains $195$ rows, and you can clearly see that the consistency curve breaches the average after roughly $200$ observations, suggesting that the inclusion of the Parkinsons Dataset drastically altered the dataset, which is consistent with our hypothesis. The average over the curve is $9.5724$, and the standard deviation is $4.4233$.

The consistency curve for the combined dataset.

Deciding whether two datasets are distinct is a binary question, but the purpose of this process is to provide data that informs a decision either way, rather than an objective threshold for distinction. In this case, the results are rather obvious. Nonetheless, the decision to distinguish will depend upon what you’re doing with the data. That is, even if there’s a significant degree of inconsistency between two datasets, it might not matter for certain purposes, which means that we can’t set an objective point at which distinction is necessary, without also taking into account the purpose for which the datasets are being used.

This process of comparing consistency curves could be a powerful tool for statisticians looking to identify errors and inconsistencies in their sampling procedures, and for data scientists deciding whether to build separate models for ostensibly similar datasets. Though it’s not an area of interest for me, I suspect this methodology could also be used to facilitate forgery detection, and DNA analysis, since this process would uncover discrepancies in superficially similar datasets, which could be generated by examining real-world objects.

Applications to Data Compression

If we have an extremely large dataset, then we probably don’t want to use the entire dataset as a training dataset, since this will require a significant amount of time, and, at least when using my software, generate a large model (i.e., the data structure that models the dataset will be large). A large model will slow down predictions, so for both reasons, we should use as little data as possible in training the learning algorithm.

In order to compress a dataset, we could randomly select data from the dataset until the consistency curve stabilizes. Once the consistency curve stabilizes, we can form the reasonable expectation that there shouldn’t be any further shocks to the sampled dataset from new observations, and therefore, any new information from the dataset will be of only marginal importance. Of course, this could be wrong, and there could be some corner of the dataset that is significant, that we just happened to miss in our sample. But nonetheless, as a practical matter, this should work just fine, and if at some future point our model starts to generate a significant number of errors, then we can retrain it.

Changes in Entropy; Physics

For those that are interested in further reading, in a previous article, I discussed how the entropy of a dataset changes as we add new data. The results are similar to the results presented above, and actually form the basis of my prediction algorithm. For those interested in making sense of the Kolmogorov complexity of a physical process, you can see my paper on the applications of information theory to physics, though it is quite long.

# Non-Mutually Exclusive Categories / Real-Time Deep Learning

My main categorization algorithm that underlies Prometheus generates mutually exclusive categories. But we can use a similar method to generate categories that aren’t mutually exclusive. Specifically, we can generate a value $\delta$, and then ask, as a general matter, whether two elements of a dataset are within $\delta$ of each other. Represented visually, we can assign each data point in the dataset a vertex in a discrete graph, and if two data points are within $\delta$ of each other, then we connect them with an edge.

We can generate $\delta$ using my main categorization algorithm, which will produce an in-context value of $\delta$, or we can instead use another technique I introduced previously that measures the local consistency of data. Using the local consistency technique, if we have $M$ elements in our dataset, we would produce an $M \times M$ matrix, where entry $i,j$ is $1$ only if data points $i$ and $j$ are within $\delta$ of each other.

We would then iterate through different values of $\delta$, and select the value that generates the greatest change in the entropy of the matrix. For an explanation of why this would work, you can have a look at my main research paper on AI.

This will produce a category associated with each element of the dataset, where another element is a member of that category only if it is within $\delta$ of the element that defines the category.

We can use the resultant matrix to define a graph that is associated with the dataset. This graph will show, visually, which data points are sufficiently similar to be connected by an edge, which in turn allows for the quantization of distance by path length between all elements of the dataset.

The script to generate this categorization / graph is available on my researchgate blog.

There are other variations on this theme that we could use, like calculating a measure of the entropy of the graph, rather than the entropy of the matrix that defines the graph, but this works, and so, that’s it for now.

I plan to use this technique in a real-time deep learning algorithm called Ayin ( ע ) that I’m currently working on, which should (hopefully) be ready in the next few days. I will also further optimize my existing algorithms to make maximum use of vectorization, which I hope will push performance over the edge, allowing for truly real-time deep learning on consumer devices.

# Prometheus GUI Application

I’m happy to announce that a free, non-commercial version of my Prometheus Deep Learning Engine is now available as a GUI Application.

Below / Attached is a research note that explains how to use Prometheus to do basic machine learning, and deep learning video classification.

Everything you need to download, and install Prometheus, can be found in the research note:

https://www.researchgate.net/publication/335224609_Autonomous_Deep_Learning

If you’re interested in purchasing a commercial version of Prometheus, please send me an email using the email address on my SSRN page.

Autonomous Deep Learning

# Prometheus AI Engine

I’ve put together a command line interface that allows for autonomous machine learning and deep learning using my AI algorithms:

https://www.researchgate.net/project/Information-Theory-SEE-PROJECT-LOG

The user simply selects the training file and testing file using a GUI, and a single core learning algorithm automatically generates a model, and then predictions. I’ll follow up with a research paper demonstrating how this one application can do everything from basic machine learning, to deep learning video classification.

I’m also working on a full blown industrial quality Python application, that will allow for GUI-based access to all of my AI algorithms, ultimately allowing non-experts to do the work of machine learning and deep learning, simply because it’s so easy to do using my algorithms. With the addition of a GUI, only minimal training will be necessary to accomplish otherwise very sophisticated tasks in AI.

The business case for this application is obvious: you don’t need as many data scientists when you have an application that can spontaneously generate models that work fast and accurately. Shrewdness aside, this will also allow data scientists to leverage the power of this application to do things that were simply impossible beforehand, and focus on more interesting questions.

If you’re interested, or know someone who would be interested, please send me an email. You can find my email address by clicking the “Contact” button on my SSRN page.

# A Note on Unbalanced Datasets

Note that if you’re using my algorithms on a dataset where the dimensions contain drastically different values, then performance could be negatively affected. One simple way to identify this problem, is to measure the standard deviation of the log base 10 of each dimension of the dataset.

So for example, if a row contains the entries [1 10 100], we would calculate log([1 10 100]) = [0 1 2], and then calculate the standard deviation of that vector, which is in this case 1. If the standard deviation of this vector is above 1, then you should probably experiment with weighting the data before training the algorithms. This can be done by a simple loop that divides the outlier large dimensions by powers of 10, and then tests the resultant accuracy, of course picking the weights that generate the greatest accuracy.

On a related note, I’m generalizing my statespace optimization algorithm to allow for totally arbitrary input data, where the algorithm will decide on its on how much to weight a particular dimension of a dataset.

Note that this is not what gradient descent and other interpolation algorithms do. Those types of algorithms use the weights to classify or predict data. My AI algorithms can already simulate basically any machine learning or deep learning algorithm as they stand.

This additional step will instead allow my algorithms to identify which dimensions are relevant when given a dataset that might contain a significant amount of irrelevant information (i.e., a significant number of “noise” dimensions), as opposed to unbalanced but relevant “signal” dimensions. That is, this process will allow my algorithms to autonomously identify which dimensions from a dataset are most relevant, and then construct categories using those dimensions.

# A Brief Note on Resource Management and State-Space Navigation

Shannon’s entropy equation can be used to solve resource management problems:

Let’s say we have 5 nodes in a search space, and we expect each node to be about the same distance from some goal state. In this case, there’s no reason to allocate more computational resources to one node than any other, since ex ante we expect all nodes to perform the same.

Now imagine that we expect one of those nodes to be significantly closer to the goal state than the others. It would be rational in that case to dedicate more computational power to the states generated by that node.

We can use Shannon’s equation to answer the question of how much additional power should be allocated.

We can also use information theory to help guide the search process itself. If we know nothing about a state-space, we’ll probably want to fully unpack all the nodes until we have a robust set of nodes that have some minimum threshold of diversity.

Generating that last, sufficiently diverse state-space will of course necessarily require producing a sequence of state-spaces, each with their own measure of entropy.

We can then apply my standard technique in AI, which is to select the state-space that generates the greatest change in the entropy of the state-space during that process.

This is the state-space that produced the greatest change in the structure of the state-space, which we can then use as our starting point for more carefully unpacking the depths below.

Finally, generating a state-space is generally very computationally intensive, and likely to be exponential as a function of the number of iterations (i.e., a state-space can be represented as a tree with an exponentially wider breadth as a function of depth). As a result, it’s rational to try and compress the number of dimensions used in navigating the state-space.

I introduced an approach to dimension compression in the note below this one that makes use of my categorization algorithm, but one possibly faster approach is to simply test how sensitive the value we’re attempting to predict is to changes in the input dimension in question. That is, we would start with dimension 1, and test the response of the function we’re evaluating to changes in the value of that dimension, and repeat this process for each dimension. Then, we allocate resources according to how sensitive the output variable is to the dimension in question.

So for example, if we’re trying to find the minimum of the function $z = x^5 + y$, the value of $z$ is going to be far more sensitive to changes in the value of $x$ than changes in the value of $y$. This is easy to test, so we can do this type of test quickly for each dimension of a dataset, and then allocate resources using Shannon’s equation (where the most sensitive dimensions get the most resources), or simply select some fixed number of the most sensitive dimensions. This could dramatically reduce the workload of the state-space navigation function for high-dimensional functions.

I’ll follow up with code and a research note.