Vector Entropy and Periodicity

Revisiting the topic of periodicity, the first question I tried to answer is, given a point, how do you find the corresponding points in the following cycles of the wave? If for example, you have a simple sine wave, then this reduces to looking for subsequent points in the domain that have exactly the same range value. However, this is obviously not going to work if you have a complex wave, or if you have some noise.

You would be in this view looking for the straightest line possible across the range, parallel to the x-axis, that hits points in the wave, since those points will for a perfectly straight line, have exactly equal range values. So then the question becomes, using something like this approach, how do I compare two subsets of the wave, in the sense that, one could be better than the other, in that one more accurately captures some sub-frequency of the total wave. This would be a balancing act between the number of points captured, and the error between their respective values. For example, how do you compare two points that have exactly the same range value, and ten points that have some noise?

The Logarithm of a Vector

This led me to what is fairly described as vector entropy, in that it is a measure of diffusion that produces a vector quantity. And it seems plausible that maximizing this value will allow you to pull the best sets of independent waves from a total wave, though I’ve yet to test this hypothesis, so for now, I’ll just introduce the notion of vector entropy, which first requires defining the logarithm of a vector.

Defining the logarithm of a vector is straight forward, at least for this purpose:

If \log(v_1) = v_2, then 2^{||v_2||} = ||v_1||, and \frac{v_1}{||v_1||} = \frac{v_2}{||v_2||}.

That is, raising 2 to the power of the norm of v_1 produces the norm of v_2, and both vectors point in the same direction. Note that because ||v_2|| = \log(||v_1||), and v_2 = \log(v_1), it follows that ||v_2|| = ||\log(v_1)||.

This also implies a notion of bits as vectors, where the total amount of information is the norm of the vector, which is consistent with my work on the connections between length and information. It also implies that if you add two opposing vectors, the net information is zero. As a consequence, considering physics for a moment, two offsetting momentum vectors would carry no net momentum, and no net information, which is exactly how I describe wave interference.

Vector Entropy

Now, simply read the wave from left to right (assuming a wave in the plane), and each point will define a v_i = (x_i,y_i) vector, in order. Take the vector difference between each adjacent pair of vectors, and take the logarithm of that difference, as defined above. Then take the vector sum over the resultant set of difference vectors. This will produce a vector entropy, and the norm of that vector entropy is the relevant number of bits.

Expressed symbolically, we have,

\overrightarrow{H} = \sum_{\forall i} \log(v_i - v_{i+1}),

Where ||\overrightarrow{H}|| has units of bits.

Lemma 1. The norm of ||\overrightarrow{H}|| is maximized when all \Delta_i = v_i - v_{i+1} are equal in magnitude and direction.

Proof. We begin by proving that,

\bar{H} = \sum_{\forall k} \log(||\Delta_k||),

is maximized when the norms of all \Delta_i are equal. Assume this is not the case, and so there is some ||\Delta_i|| > ||\Delta_j||. We can restate \bar{H} as,

\bar{H} = \sum_{\forall k \neq (i,j)}[\log(||\Delta_k||)] + \log(\Delta_i) + \log(\Delta_j).

Now let L = ||\Delta_i|| + ||\Delta_j||, and let F(x) =  \log(L - x) + \log(x). Note that if x = ||\Delta_j||, then F(x) = \log(\Delta_i) + \log(\Delta_j). Let us maximize F(x), which will in turn maximize \bar{H} by taking the first derivative of F, with respect to x, which yields,

F' = \frac{1}{x} - \frac{1}{L - x}.

Setting F' to zero, we find x = \frac{L}{2}, which in turn implies that F(x) has an extremal point when the arguments to the two logarithm functions are equal to each other. The second derivative is negative for any positive value of L, and because L is the sum of the norm of two vectors, L is always positive, which implies that F is maximized when ||\Delta_i|| = ||\Delta_j||. Since we assumed that \bar{H} is maximized for ||\Delta_i|| > ||\Delta_j||, we have a contradiction. And because this argument applies to any pair of vectors, it must be the case that all vectors have equal magnitudes.

For any set of vectors, the norm of the sum is maximized when all vectors point in the same direction, and taking the logarithm does not change the direction of the vector. Therefore, in order to maximize ||\overrightarrow{H}||, it must be the case that all \Delta_i point in the same direction. Note that if all such vectors point in the same direction, then,

||\overrightarrow{H}|| = \sum_{\forall i} ||\log(v_i - v_{i+1})|| = \bar{H},

which completes the proof. □

Note that this is basically the same formula I presented in a previous note on spatial diffusion, though in this case, we have a vector quantity of entropy, which is, as far as I know, a novel idea, but these ideas have been around for decades, so it’s possible someone independently discovered the same idea.


Compressing Data Over Time

In a previous article, I introduced an algorithm that can partition data observed over time, as an analog of my core image partitioning algorithm, which of course operates over space. I’ve refined the technique, and below is an additional algorithm that can partition a dataset (e.g., an audio wave file) given a fixed compression percentage, and it runs in about .5 seconds, per 1 second of audio (44100 Hz mono).

Here are the facts:

If you fix compression at about 90% of the original data, leaving a file of 10% of the original size, you get a decent sounding file, that visually plainly preserves the original structure of the underlying wave. Compression in this case takes about .5 seconds, per 1 second of underlying mono audio, which is close to real time (run on an iMac). If you want the algorithm to solve for “ideal” compression, then that takes about 1 minute, per 1 second of underlying mono audio, which is obviously not real time, but still not that bad.

What’s interesting, is that not only is the audio quality pretty good, even when, e.g., compressing 98% of the underlying audio data, you also preserve the visual shape of the wave (see above). For a simple spoken audio dataset, my algorithm places the “ideal” compression percentage at around 98%, which is not keyed off of any normal notions of compression, because it’s not designed for humans, but is instead designed for a machine to be able to make sense of the actual underlying data, despite compression. So, even if you think the ostensibly ideal compression percentage sounds like shit, the reality is, you get a wave file that contains the basic structure of the original underlying wave, with radical compression, which i’ll admit, I’ve yet to really unpack and apply to any real tasks (e.g., speech recognition, which I’m just starting). However, if your algorithms (e.g., classification or prediction) work on the underlying wave file, then it is at least at this point not yet absurd that they would also work on the compressed wave, and that’s what basically all of my work in A.I. is based upon:

Compression for machines, not people.

And so, any function that turns on macroscopic structure, and not particulars, which is obviously the case for many tasks in A.I., like classification, can probably be informed by a dataset that makes use of far more compression than a human being would like. Moreover, if you have fixed capacity for parallel computing, then these types of algorithms allow you to take an underlying signal, and compress it radically, so that you can then generate multiple threads, so that you can, e.g., apply multiple experiments to the same compressed signal. Note that this is not yet fully vectorized, though it plainly can be, because it relies upon the calculation of independent averages. So even if, e.g., Octave doesn’t allow for full vectorization in this case, as a practical matter, you can definitely make it happen, because the calculations are independent.

Attached is the applicable code, together with a simple audio dataset of me counting in English.

Counting Dataset

Partitioning Data Over Time

My basic image partition algorithm partitions an image into rectangular regions that maximize the difference between the average colors of adjacent regions. The same can be done over time, for example, when given a wave file that contains audio, other amplitude data, or any time-series generally. And the reasoning is the same, in that minor changes in timing prevent row by row comparison of two sets of observations over time, just like noise in an image prevents pixel by pixel comparison in space, making averaging useful in both cases, since you group observations together, blurring any idiosyncrasies due to small changes in observation.

Below is the result of this process as applied to a simple sin wave in the plane, with the partitions generated by the algorithm on the left, and the resultant square wave produced by replacing each value with the average associated with the applicable region. Also attached is the Octave command line code.

Note that I made what is arguably a mathematical error in the function that partitions the time-series. Attached is a standalone function that accomplishes exactly this. Also attached is an algorithm that does the same given a fixed compression percentage.