A Note on Multiplicity in Time

I’ve been exploring notions of physical multiplicity for years, and I think you can make rigorous sense of it, using the ideas I’ve already developed in physics. Specifically, I think a multi-verse is fine, and in fact, you can imagine multiplicity expressed through a space that actually exists, where all possible outcomes from a given moment into the next, are actually physically real. Though this implies a question:

How are they arranged in that space?

As stated in a previous note, as a far as I know, multiplicity of outcome is real, given the same initial conditions, since collisions require only conservation of momentum (assuming nothing else changes in the collision). So given a moment in time, all future outcomes would be physically extant in the space of time itself, which would imply a space that grows, but is fixed at all generated points. That is, you’d have a tree that produces from inception, all possible next states. Instead, I think you also need a source that generates basically the same set of initial conditions over and over, and this would be like the Big Bang on repeat, with no material differences between proximate instances. This would cause mass to move not only through Euclidean space, but also through the space of time itself. Creating a space that is at any fixed point, basically the same forever, but nonetheless not truly fixed, with particles that have a velocity in both physical space, and time, entering, and leaving, onto the next.

We can therefore imagine a line through the space of time itself, where nothing changes, because all exchanges of momentum net to zero, causing no change at all to the entire Universe, along that line. That is, this is the freak outcome Universe where every exchange of momentum nets to zero change, producing a static portrait of inception itself –

It’s the coin that always lands on its side, forever.

Imagine this as a line through a plane, with increasingly distant outcomes at increasing distances from this line –

This is all you need to imagine the space of time, organized in the plane.

Now imagine gravity and charge diffuse over both space and time. For if they instead diffuse over only space, then you end up with a law of gravity, e.g., that decreases in strength as a function of linear distance (just use basic trigonometry, assuming a constant rate and distribution of emission of force carriers projected towards a line of increasing distance from a point origin of the force carriers). This is, of course, wrong, because gravity obeys a square law of diffusion. So now instead assume that gravity diffuses over both time and space –

You end up with a square law of diffusion, which is correct.

This suggests the possibility that both gravity and charge are emitted through the space of time itself, in addition to Euclidean space.

But if this is the case, then it poses a related problem:

This implies that identical masses would be positioned proximately to each other in the space of time itself, subject to gravity. They would therefore attract each other, through gravity, which could literally cause a collision between two physically proximate moments in time. Since this doesn’t seem to happen, there must be a mechanism that ensures this doesn’t happen, though I think dark energy could be the result of gravity from proximate outcomes in time (note we wouldn’t be able to otherwise interact with the associated mass). And I think you can use charge and perhaps other forces to make sure that there’s no meaningful intrusion of mass from one moment in time into the next. For example, just imagine the entire Universe rotated so that the next copy of itself is oriented so as to maximize electrostatic repulsion between the original, and the copy, which is positioned at some distance in another space, which I treat as time. This would create repulsion between moments in time, ensuring that only small intrusions occur, which seem to happen, at the quantum scale, where energy comes and goes over short periods of time. This would not apply to light, which could be a means of communication through time.

Moreover, because we can’t physically measure complex numbers, I believe that distance in time actually has complex units, for good reason, rooted at least in part in Relativity. That is, I suspect there’s another space altogether that we progress through, as stated above, that actually has units of complex length (see Footnote 7 of, A Unified Model of the Gravitational, Electrostatic, and Magnetic Forces).

I’ve argued that multiplicity and waves are physically related, and this makes perfect sense, because waves are by definition a distribution over some space. Now consider the idea of not thermodynamic reversibility, but real mathematical reversibility, and consider it in the context of time –

How many initial conditions can give rise to the same final outcome?

Because conservation of momentum is typically the only constraint for collisions, the answer is an infinite number of initial conditions. So now imagine moving backwards through time, from a given state of a system, to its possible prior states, and what do you have?

A wave, an actual quantum wave.

So this suggests at least the possibility that the transition from a point particle to a wave is the result of a change that flips a switch on the direction of time itself, causing a particle to propagate forwards through time, as if it were propagating backwards.

A Note on Physical Waves

As far as I know, exchanges of momentum between colliding systems are permitted, provided they conserve vector momentum. This suggests multiplicity of outcome, since there are an infinite number of exchanges of momentum between, for example, two colliding particles, that will conserve momentum. I happen to quantize basically everything in my model of physics, but it doesn’t matter, because you still get multiplicity of outcome, albeit finite in number. Note that a wave can be thought of as a set of individual, interacting frequencies, that together produce a single composite system. I’m not an expert on the matter, and I’m just starting to look into these things, but I don’t believe there’s any meaningful multiplicity to the outcome of a set of juxtaposed frequencies, and instead, I believe you end up with the same wave every time. This would make perfect sense if the quantity of momentum possessed by a wave is incapable of subdivision, which would either produce an interaction or not, between two individual waves. You could, for example, have wave interference, at offsetting points of two waves, each possessing equal quantity, in opposite directions, when and where they interact, producing a zero height at each such point. As the probability of interaction increases, you’d have an increasingly uniform zero wave.

Interestingly, this suggests the possibility that rules of physics actually have complexity, in the sense that you might have primitive rules for some interactions that impose what is in this case binary quantization (i.e., either it happens or it doesn’t). This is alluded to in Section 1.4 of the first link above, where I discuss the applications of the Kolmogorov Complexity to physical systems.

A Note on Physical Entropy

I’m fairly certain I’ve already written about a concept I call, “Net Effective Momentum”, though I’ve frankly written too many articles to find the reference, and I’m trying to do some more work, so I’ll just explain it again. The basic idea is that macroscopic objects, and even fundamental particles, can have motions that aren’t truly rectilinear, and simply appear rectilinear at some scale of observation. The notion of the Net Effective Momentum is intended to capture the effective, macroscopic direction of momentum for a system or particle. For a thermodynamic system, or a system at rest, this should be basically zero, since it’s motions are presumably highly random, producing no real macroscopic displacement over any sensible period of time, with offsetting motions in basically every direction. In contrast, the NEM of a photon in a vacuum is exactly its true momentum as a vector quantity, because it has exactly one direction of motion.

So the question becomes, how do you measure this as a practical matter, and I think I just came up with a decent method of doing exactly that. If you want to calculate the NEM, you find the scale of observation in time and space that minimizes the standard deviation of the observed momenta. In contrast, if you want to uncover the true underlying motions, you select the scale of observation that maximizes the standard deviation of the observed momenta. For a photon in a vacuum, the difference between these two will be the zero vector, for all scales of observation. For a system with some wobble, like a struck baseball, there will be a sizable difference between these two measurements. Specifically, the NEM will produce exactly one velocity vector, whereas the underlying velocities will be numerous, and disparate.

This in turn suggest a new notion of entropy, defined by the difference between these two measurements, as follows:

\bar{H} = \sum_{\forall i} \log(||\rho_{n} - \rho_{i}||),

where \rho_{n} is the NEM, and each \rho_{i} is taken from the set of observed momenta that maximize the standard deviation of the observations. Note that for a photon, \bar{H} will be zero, since for a photon, \rho_{n} = \rho_{i}, for all i.

I’m still working on this, and more importantly related matters, but I thought it was worth sharing in advance of further consideration. In particular, in a previous article on entropy as a vector quantity, I proved that \bar{H} is maximized when the arguments to the logarithm function are all the same vector, just like the Shannon entropy is maximized for uniform probabilities. I’m not proving this tonight, but it seems that this implies that if \rho_{n} is the zero vector, and each \rho_{i} is non-zero, then the value of \bar{H} is maximized, in that changing the magnitude or direction of any \rho_{i} will reduce the value of \bar{H}. This makes intuitive sense, but it’s not a proof, so I’ll return with a proof that this is or is not the case, hopefully within a few days. And again, I need to think more about this, but it suggests what could be an elegant definition for a thermodynamic system:

A thermodynamic system is any system observed in a frame of reference for which \rho_{n} is the zero vector.

This would imply, for example, co-moving with a photon, you could treat a photon as a thermodynamic system. Interestingly, this would be a one state system that never changes position or property in a true vacuum, and therefore requires zero bits of information to describe, in that it simply exists, in that frame of reference. I don’t want to muddy the water with too much philosophy, but it suggests pure existence is more primordial than information, since something that exists and never changes doesn’t require any bits to describe, since it has one state. However, it still needs a label to distinguish it from other systems, but that’s a different consideration. And this is literally the case, because you can imagine a table with two columns, where one column has labels, which requires some number of bits, depending upon the number of systems, and then there’s another column that describes the state of each system, and in the case of a photon observed from a co-moving frame of reference, you could argue that this column should be blank.

Partitioning Datasets

A while back, I had an idea for an algorithm I gave up on, simply because I had too much going on, but the gist is, my algorithms can flag predictions that are probably wrong, and so pop all those rows into a queue, and you let the rest of the predictions go through, in what will be real time, even on a personal computer. The idea is to apply a separate model to these “rejected” rows, since they probably don’t work with the model generated by my algorithm. This would allow you to efficiently process the most simple corners of a dataset in polynomial time, and then apply more computationally intense methods to the remainder, using threading, all the normal capacity allocation techniques, which will still allow you to fly in close to real time, you just delay the difficult rows until they’re ready. The intuition is, you stage prediction based upon whether the data is locally consistent, or not, and this can vary row by row within a dataset. And this really is a bright-line, binary distinction (just read the paper in the last link), and so you can rationally allocate processing capacity in this way, where if a prediction is “rejected”, you bounce it to a queue, until it has some critical mass, and you then apply whatever method you’ve got that works for data that isn’t locally consistent, which is basically everyone else’s models of deep learning.

Another Thought on Waves

If we interpret waves literally, then you can have a wave doesn’t have an exact location, but instead has a density or quantity function given a location. That is, I can’t tell you where a wave is, though I can delimit its boundaries, and provide a function that, given a coordinate, will tell you the density or quantity at that point. If information is conserved physically, and I show it isn’t in some cases, simply because energy is not conserved, unlike momentum, which is, apparently, always conserved, as far as I know. Specifically, gravity causes unbounded acceleration, which violates conservation of energy (macroscopic potential energy just doesn’t make any sense), but you don’t need to violate the conservation of momentum, if you assume that either an offset occurs when gravity gives up energy (e.g., the emission of some other particle or set of particles), or, gravity has non-finite momentum to begin with (see Equations (9) and (10) of, A Computational Model of Time-Dilation). Gravity is by definition unusual, since it cannot be insulated against, and appears to have the ability to give up unbounded quantities of momentum to other systems. At a minimum, the number of gravitational force carriers that can be emitted by a mass of any size appears to be unbounded over time. As a result, the force carrier of gravity is not light. The same is true of electrostatic charge and magnetism, neither of which can possibly be carried by a photon, given these properties.

If information is conserved in this case, then when a particle transitions from a point particle to a wave, the amount of information required to describe the particle should be constant. Let’s assume arguendo, that the amount of information required to describe the properties of the particle in question doesn’t change. That is, for example, the code for an electron is the same whether it’s in a wave state, or a point state. If this is the case, then the only remaining property is its position, which is now substituted by a function that describes the density of the electron at all positions in space, which will in turn delimit its boundaries, if it has any (i.e., a density of zero at all points past the boundary). Again, assuming information is conserved, it implies that the amount of information required to describe the density function of the wave will be equal to the amount of information required to describe its position, as a point particle. If it turns out that space is truly infinite, then that function cannot have finite complexity.

Plain English Summary of Algorithms

I imagine if you read this blog, you can probably figure out for yourself how things work, though it’s always nice to have a high-level explanation, since even for a sophisticated reader, this could mean the difference between taking the time to truly understand something, and simply dismissing it out of the interest of limited time. As a result, I’ve written a very straightforward explanation of the core basics of my deep learning software, that links to more formal papers that describe it in greater detail. I did this because I’m giving a brief talk at a MeetUp today, and since I already did the work, I figured I’d share it publicly.

Black Tree AutoML (Plain English Summary).

Defining a Wave

It just dawned on me you can construct a clean definition of a total wave, as a collection of individual waves, by simply stating their frequencies and their offsets from some initial position. For example, we can define a total wave T as a set of frequencies \{f_1, f_2, \ldots, f_k\}, and a set of positional offsets \{\delta_1, \delta_2, \ldots, \delta_k \}, where each f_i is a proper frequency, and each \delta_i is the distance from the starting point of the wave to where frequency f_i first appears in the total wave. This would create a juxtaposition of waves, just like you find in an audio file. Then, you just need a device that translates this representation into the relevant sensory phenomena, such as a speaker that takes the frequencies and articulates them as an actual sound. The thing is, this is even cleaner than an uncompressed audio file, because there’s no averaging of the underlying frequencies –

You would instead define the pure, underlying tones individually, and then express them, physically on some device.

Vector Entropy and Periodicity

Revisiting the topic of periodicity, the first question I tried to answer is, given a point, how do you find the corresponding points in the following cycles of the wave? If for example, you have a simple sine wave, then this reduces to looking for subsequent points in the domain that have exactly the same range value. However, this is obviously not going to work if you have a complex wave, or if you have some noise.

You would be in this view looking for the straightest line possible across the range, parallel to the x-axis, that hits points in the wave, since those points will for a perfectly straight line, have exactly equal range values. So then the question becomes, using something like this approach, how do I compare two subsets of the wave, in the sense that, one could be better than the other, in that one more accurately captures some sub-frequency of the total wave. This would be a balancing act between the number of points captured, and the error between their respective values. For example, how do you compare two points that have exactly the same range value, and ten points that have some noise?

The Logarithm of a Vector

This led me to what is fairly described as vector entropy, in that it is a measure of diffusion that produces a vector quantity. And it seems plausible that maximizing this value will allow you to pull the best sets of independent waves from a total wave, though I’ve yet to test this hypothesis, so for now, I’ll just introduce the notion of vector entropy, which first requires defining the logarithm of a vector.

Defining the logarithm of a vector is straight forward, at least for this purpose:

If \log(v_1) = v_2, then 2^{||v_2||} = ||v_a||, and \frac{v_1}{||v_1||} = \frac{v_2}{||v_2||}.

That is, raising 2 to the power of the norm of v_1 produces the norm of v_2, and both vectors point in the same direction.

This also implies a notion of bits as vectors, where the total amount of information is the norm of the vector, which is consistent with my work on the connections between length and information. It also implies that if you add two opposing vectors, the net information is zero. As a consequence, considering physics for a moment, two offsetting momentum vectors would carry no net momentum, and no net information, which is exactly how I describe wave interference.

Vector Entropy

Now, simply read the wave from left to right (assuming a wave in the plane), and each point will define a v_i = (x_i,y_i) vector, in order. Take the vector difference between each adjacent pair of vectors, and take the logarithm of that difference, as defined above. Then take the vector sum over the resultant set of difference vectors. This will produce a vector entropy, and the norm of that vector entropy is the relevant number of bits.

Expressed symbolically, we have,

\overrightarrow{H} = \sum_{\forall i} \log(v_i - v_{i+1}),

Where ||\overrightarrow{H}|| has units of bits.

Lemma 1. The vector entropy \overrightarrow{H} is maximized when all \Delta_i = v_i - v_{i+1} are equal in magnitude and direction.

Proof. We begin by proving that,

\bar{H} = \sum_{\forall k} \log(||\Delta_k||),

is maximized when the norms of all \Delta_i are equal. Assume this is not the case, and so there is some ||\Delta_i|| > ||\Delta_j||. We can restate \bar{H} as,

\bar{H} = \sum_{\forall k \neq (i,j)}[\log(||\Delta_k||)] + \log(\Delta_i) + \log(\Delta_j).

Now let L = ||\Delta_i|| + ||\Delta_j||, and let F(x) =  \log(L - x) + \log(x). Note that if x = ||\Delta_j||, then F(x) = \log(\Delta_i) + \log(\Delta_j). Let us maximize F(x), which will in turn maximize \bar{H} by taking the first derivative of F, with respect to x, which yields,

F' = \frac{1}{x} - \frac{1}{L - x}.

Setting F' to zero, we find x = \frac{L}{2}, which in turn implies that F(x) has an extremal point when the arguments to the two logarithm functions are equal to each other. The second derivative is negative for any positive value of L, and because L is the sum of the norm of two vectors, L is always positive, which implies that F is maximized when ||\Delta_i|| = ||\Delta_j||. Since we assumed that \bar{H} is maximized for ||\Delta_i|| > ||\Delta_j||, we have a contradiction. And because this argument applies to any pair of vectors, it must be the case that all vectors have equal magnitudes.

For any set of vectors, the norm of the sum is maximized when all vectors point in the same direction, and taking the logarithm does not change the direction of the vector, which completes the proof. □

Note that this is basically the same formula I presented in a previous note on spatial diffusion, though in this case, we have a vector quantity of entropy, which is, as far as I know, a novel idea, but these ideas have been around for decades, so it’s possible someone independently discovered the same idea.

Compressing Data Over Time

In a previous article, I introduced an algorithm that can partition data observed over time, as an analog of my core image partitioning algorithm, which of course operates over space. I’ve refined the technique, and below is an additional algorithm that can partition a dataset (e.g., an audio wave file) given a fixed compression percentage, and it runs in about .5 seconds, per 1 second of audio (44100 Hz mono).

Here are the facts:

If you fix compression at about 90% of the original data, leaving a file of 10% of the original size, you get a decent sounding file, that visually plainly preserves the original structure of the underlying wave. Compression in this case takes about .5 seconds, per 1 second of underlying mono audio, which is close to real time (run on an iMac). If you want the algorithm to solve for “ideal” compression, then that takes about 1 minute, per 1 second of underlying mono audio, which is obviously not real time, but still not that bad.

What’s interesting, is that not only is the audio quality pretty good, even when, e.g., compressing 98% of the underlying audio data, you also preserve the visual shape of the wave (see above). For a simple spoken audio dataset, my algorithm places the “ideal” compression percentage at around 98%, which is not keyed off of any normal notions of compression, because it’s not designed for humans, but is instead designed for a machine to be able to make sense of the actual underlying data, despite compression. So, even if you think the ostensibly ideal compression percentage sounds like shit, the reality is, you get a wave file that contains the basic structure of the original underlying wave, with radical compression, which i’ll admit, I’ve yet to really unpack and apply to any real tasks (e.g., speech recognition, which I’m just starting). However, if your algorithms (e.g., classification or prediction) work on the underlying wave file, then it is at least at this point not yet absurd that they would also work on the compressed wave, and that’s what basically all of my work in A.I. is based upon:

Compression for machines, not people.

And so, any function that turns on macroscopic structure, and not particulars, which is obviously the case for many tasks in A.I., like classification, can probably be informed by a dataset that makes use of far more compression than a human being would like. Moreover, if you have fixed capacity for parallel computing, then these types of algorithms allow you to take an underlying signal, and compress it radically, so that you can then generate multiple threads, so that you can, e.g., apply multiple experiments to the same compressed signal. Note that this is not yet fully vectorized, though it plainly can be, because it relies upon the calculation of independent averages. So even if, e.g., Octave doesn’t allow for full vectorization in this case, as a practical matter, you can definitely make it happen, because the calculations are independent.

Attached is the applicable code, together with a simple audio dataset of me counting in English.

Counting Dataset

Partitioning Data Over Time

My basic image partition algorithm partitions an image into rectangular regions that maximize the difference between the average colors of adjacent regions. The same can be done over time, for example, when given a wave file that contains audio, other amplitude data, or any time-series generally. And the reasoning is the same, in that minor changes in timing prevent row by row comparison of two sets of observations over time, just like noise in an image prevents pixel by pixel comparison in space, making averaging useful in both cases, since you group observations together, blurring any idiosyncrasies due to small changes in observation.

Below is the result of this process as applied to a simple sin wave in the plane, with the partitions generated by the algorithm on the left, and the resultant square wave produced by replacing each value with the average associated with the applicable region. Also attached is the Octave command line code.

Note that I made what is arguably a mathematical error in the function that partitions the time-series. Attached is a standalone function that accomplishes exactly this. Also attached is an algorithm that does the same given a fixed compression percentage.