An Apparent Paradox In Probability

I believe it was Gödel who made the statement, “S = S is false”, famous, through his work on the foundations of mathematics, but I’m not completely certain of its history. In any case, it is an interesting statement to consider, because it plainly demonstrates the limitations of logic, which are known elsewhere, e.g., in the case of Turing’s Halting Problem, which similarly demonstrates the limits of computation. You can resolve S by simply barring self-reference, and requiring ultimate reference to some physical system. This would connect logic and reality, and since reality cannot realistically be inconsistent, you shouldn’t have any paradoxical statements in such a system. The Halting Problem however does not have a resolution, it is instead a fundamental limit on computing using UTMs. I discovered something similar that relates to the Uniform Distribution in work my on uncertainty, specifically, this is in Footnote 4 of my paper, Information, Knowledge, and Uncertainty:

“Imagine … you’re told there is no distribution [for some source], in that the distribution is unstable, and changes over time. Even in this case, you have no reason to assume that one [outcome is more] likely than any other. You simply have additional information, that over time, recording the frequency with which each [outcome occurs] will not produce any stable distribution. As a result, ex ante, your expectation is that each [outcome] is equally likely, [since you have no basis to differentiate between outcomes], producing a uniform distribution, despite knowing that the actual observed distribution cannot be a uniform distribution, since you are told beforehand that there is no stable distribution at all.”

From the perspective of pure logic, which is not addressed in the paper, this is an example where the only answer to a problem, is known to be wrong. We can resolve this apparent paradox by saying that there is no answer to the problem, since the only possible answer is also known to be wrong. Nonetheless, it is a bit unnerving, because logic leaves you with exactly one possibility, that is wrong, suggesting the possibility of other problems that have superficially correct answers, that are nonetheless wrong, because of non-obvious, and possibly logically independent considerations. This is not such a problem, because it is actually resolved by simply saying there is no distribution, which is perfectly consistent with the description of the problem, that assumes the distribution is not stable. The point being, that you could have superficially correct answers to some other problem, that are ultimately wrong due to information that you simply don’t have access to.

Probability, Complexity, and Indicia of Sentience

The market tanked about 4.0% this Tuesday, and I naturally searched for causes, like everyone else does, because the event is unlikely, in the sense that most trading days don’t produce such large moves in either direction. But I also acknowledge that it could have been a random outcome, in the most literal sense, that the motion of the S&P 500 was on that day determined by a repeated toll of a random variable, and that’s not something you can simply rule out. Nonetheless, the intuition is there, that low probability events are connected to causation, in the sense that they shouldn’t happen without an intervening cause, simply because they’re so unlikely. This view is however incomplete, and it’s something I’ve commented on in the past, specifically, that what you’re looking to is a property that has a low probability, rather than the event itself. In the case of an asset price, we’d be looking at paths that go basically straight up or straight down, which are few in number, compared to the rest of the paths that generally meander in no particular direction. This can be formalized using one of the first original datasets I put together, that generates random paths that resemble asset prices. In this dataset, there will be exactly two extremal paths that go maximally up and maximally down, and for that reason, those two paths are special, and from a simple counting perspective, there will be exactly two of them out of many.

To continue this intuition with another example, consider sequences generated by repeated coin tosses, e.g., HTH, being the product of flipping heads, then tails, then heads. The probability of any sequence is simply 2^{-n}, where n is the length of the string, but this fails to capture the surprisal generated by, e.g., producing a sequence that is billions of entries long, and comprised of only heads. If this really happened, you’d be astonished, but it nonetheless is no more or less likely than any other sequence of equal length. As a consequence, measuring surprisal using the Shannon Entropy produces the same result, treating all outcomes equally, because all sequences have equal probabilities. The intuition at work here can instead be formalized using the Kolmogorov Complexity. Specifically, consider all strings of a given length, now calculate the Kolmogorov Complexity of each such string, producing a distribution of complexities. Now your surprisal can be described objectively, since the probability of generating, e.g., an alternating sequence of (HTHTHTH …) of any appreciable length, is also low, just like generating a uniform sequence of (HHHHHHH …) of any appreciable length. The point here, is that what produces surprisal in at least some cases is the Kolmogorov Complexity of an observation, in that large highly structured objects have a low probability over the distribution of complexities, since most strings are Kolmogorov-Random (i.e., have high complexities).

There is moreover, a plain connection between complexity and sentience, because anecdotally, that is generally how large and highly structured objects are produced, i.e., through deliberate action, though this is at times erroneous, since e.g., gravity produces simply gigantic, and highly structured systems, like our Solar System, and gravity is not in any scientific sense, sentient, and instead it has a simple mechanical behavior. However, there is nonetheless a useful, intuitive connection between the Kolmogorov Complexity and sentience, in that as you increase Kolmogorov Complexity from the mundane (e.g., (HHHHH …) or (HTHTHT …)) to the elaborate, but nonetheless patterned, it becomes intuitively more difficult to dismiss the possibility that the sequence was produced by a sentient being, as opposed to being randomly generated. Just image e.g., someone telling you that a randomly generated set of pixels produced a Picasso – you would refuse to believe it, justifiably, because highly structured macroscopic objects just don’t get generated that way.

And I’ve said many times, it is simply not credible in this view that life is the product of a random sequence, because that assumption produces probabilities so low, that there’s simply not enough time in the Universe to generate systems as complex as living systems. At the same time, an intervening sentient creator, only produces the same problem, because that sentience would in turn require another sentient creator, and so on. The article I linked to goes through some admittedly imprecise math, that is nonetheless impossible to argue against, but to get the intuition, there are 3 billion base pairs in human DNA. Each base pair is comprised of two selections from four possible bases, adenine (A), cytosine (C), guanine (G) [GWA-NeeN] or thymine (T). Ignoring restrictions, basic combinatorics says there are 4^2 = 16 possible base pairs. Because there are 3 billion base pairs in human DNA, the total number of possible genetic sequences is 16^n, where n is 3 billion. This is a number so large, it cannot be calculated on most machines (e.g., Google cannot calculate it), and for context, the number of seconds since the Big Bang is about 10^{17} (i.e. a number with 18 digits), whereas the number of possible DNA sequences has at least 3 billion digits. Note that while this is plainly rough arithmetic, the number of possible base pairs does not have to be 16 to produce this problem, since if the number of possible base pairs is anything greater than 1 (and it obviously is greater than 1), then you still have a number with roughly billions of digits –

This is a joke of an idea, and instead, I think it is more likely that we just don’t understand physics, and that certain conditions can produce giant molecules like DNA, just like stars produce giant atoms. There is also a branch of mathematics known as Ramsey Theory, that is simply astonishing, and imposes structure on real-world systems, that simply must be there as a function of scale. There could be unknown results of Ramsey Theory, there could be unknown physics, probably both, but I don’t need to know what’s truly at work, since I don’t think it’s credible to say that e.g., DNA is “randomly generated”, as it’s so unlikely as popularly stated, that it’s unscientific.

Finally, in this view, we can make a distinct and additional connection between complexity and sentience, since we all know sentience is real, subjectively, and so it must have an objective cause, which could have something to do with complexity itself, since it seems to exist only in complex systems. Specifically, that once a system achieves a given level of complexity, it gives rise to sentience, as an objective phenomenon distinct from e.g., the body itself. This is not unscientific thinking at all, since it should be measurable, and we already know that certain living systems give rise to poorly understood fields, that are nonetheless measurable. Sentience would in this view be a field generated by a sufficiently complex system, that produces what we all know as a subjective experience of reality itself.

On Dissipating Charges

I noticed a long time ago that electrostatic attraction and repulsion seem fundamentally different from the dissipation of a charge in the form of a bolt, but I dropped the work (I have too much going on). Specifically, when you have a surplus of charges in one system, and a deficiency in the other, you get the normal acceleration of both systems (e.g., a balloon rubbed on someone’s hair will cause their hair to stand up towards the balloon).

Now consider for contrast, a bolt dissipating from a cloud. This is definitely due to the accumulation of a large amount of charge in the cloud. But if it were an explosion, you would have dissipating charges moving in all directions, which is exactly what you get from a kinetic explosion (e.g., a bomb blowing inside a container). Instead, what you see is a macroscopically contiguous system that we know is made of charges.

There are two obvious problems with this, the first is that explosions should cause dissipations that increase entropy, the second is that charges should be repelling each other, not following the same path. This suggests the possibility that a bolt is a fundamentally different state of a set of electrons, something along the lines of a macroscopic wave. This would solve both problems, since it would travel along a single path because it is a single system, and wouldn’t repel itself, because it’s one gigantic charge. Intuitively, it’s like a tau particle, in that it’s a massive single charge, that is obviously not stable.

There’s also the question as to why this would happen, and one simple explanation is that you have electrons leaving one configuration, and entering another. In contrast, the current in a wire is a set of electrons all traveling through what is effectively a single orbital that extends through the wire, the “valence orbital”, that isn’t really particular to any given atom. Where you have a break in the wire, you have what is basically a lightning bolt, again consistent with the idea that when an electron moves in one configuration of charges, it behave like a free electron, i.e. a single particle that changes position. When it changes configuration, you instead have a discrete change, e.g., jumping from a cloud to the ground, or from one valence to another, and it behaves like a bolt, which is just not the same as a free electron, since it is plainly comprised of more than one electron. If I had to guess, it travels at exactly c (i.e., the speed of light), when traveling as a bolt, whereas as a free electron, it does not, and again, I think this is because it is simply not the same state of matter as a free electron. Though a “lightning bolt” is comprised of many individual components, which would be bolts in this view, jumping from one configuration to the next, its velocity could be and is in fact slower than c, for the simple reason that it travels as a free electron in any given configuration, and only as a bolt (i.e., at c) between configurations.

Final Optimization Algorithm

I’ve finalized the N-Dimensional optimization algorithm I’ve been writing about lately, and this instance of it is set up to sort a list, though it can do anything. You need only remove the sections of code that prevent selection with replacement, and change the “eval” function (i.e., the function being optimized) to your liking. The reason it is set up to sort a list is to demonstrate the power of the algorithm, since there are approximately 13! \approx 7 billion permutations of a list with 13 items, only two of which are sorted, and this algorithm can successfully find the sorted solution. As a general matter, even if the probability of finding an answer is remote, it can find it. The algorithm is described in a short paper, that includes this code, Universal Optimization.

Another Updated N-Dimensional Optimization Algorithm

This is the final version, and as expressed, the optimization algorithm balances a set of weights on a beam, with corresponding symmetrical values intended to be equal. So far, it has found an exact solution every time I’ve run it. This same algorithm can also solve for interpolations, and any other goal-based problem. For interpolations, I’ve run it up to 12 variables, and the performance is excellent and fast.

Updated N-Dimensional Optimization

Here’s the finished product, it can optimize N-dimensions, and you can subdivide them however you like. In the attached example, it’s broken into 3 x variables and 3 y variables, producing a curve in 3 space, but this could also be treated as a single six variable function, and the algorithm is indifferent. The original curve is on the left, the interpolated curve is on the right, and this took just over two minutes to run. The results are awesome. Note you could also use this to find a generalized goal state, as I’ve simply set this example up to find coefficients of a polynomial, but again, the method is generalized. As a consequence, it is a generalized N-Dimensional state space algorithm, that runs very quickly, and has a deterministic runtime, though you are of course not guaranteed any particular results.

N-Dimensional Optimization

I wrote an algorithm that models interference between multiple iterations of the same path. The basic idea is simple: you have a random path, and you generate it some number of times. However, once you traverse a particular location in the path, that point becomes more likely. As a consequence, over time, the probabilities become non-uniform (obviously they start out uniform). When I articulated this, I noted that this is plainly an optimization algorithm, since you can update the probabilities of a given point (treated as a domain value) with the distance to some goal state (in the range). This can done in N-dimensions, by simply having N paths all doing exactly this. Moreover, you can vectorize many of the steps for this N-dimensional case easily in Octave. Here’s the two dimensional code (on dropbox).

Sort-Based Classification

I’ve introduced three algorithms that will form the basis of Black Tree Massive, one is a more efficient but mathematically identical version of my Supervised Delta Algorithm, and the other two are sort-based classification algorithms that have runtimes that completely eclipse even my prior work, which already has unprecedented runtimes, and are themselves almost certainly the fastest algorithms on the market. The Supervised Delta Algorithm is rooted in my paper, Analyzing Dataset Consistency, in which I prove that polynomial time algorithms can produce literally perfect clustering and classification, under certain reasonable assumptions. This typically translates into very high and at times perfect accuracy for benchmark datasets. In another paper, Sorting, Information, and Recursion, I proved a surprising connection between sorting and the nearest neighbor algorithm, specifically, that if a list of vectors is sorted, then the adjacent entries in the sorted list are nearest neighbors of each other (see, Theorem 2.1). As a consequence, if you begin at a given entry in a sorted list, and proceed to adjacent entries in both directions, you will encounter the set of points that are included in some sphere in the Euclidean space of the dataset. This must be true, since the distance between two vectors in a sorted list, can never exceed the sum of the distances over the adjacent pairs that are between them, since this sum is by definition either equal to or greater than the straight-line distance between the two vectors. As a result, you can cluster data by simply collecting vectors in a sorted list, proceeding in both directions, from a given index in the list. Using this method, you don’t have to calculate the norm of the difference between a given vector, and all other vectors in order to find the vectors that are within a given sphere (i.e., within a given radius). This saves significant time during processing, leading to algorithms that can classify at a rate of approximately 4,500 Testing Rows over 25,500 Training rows, in roughly 3 seconds (running on a MacBook Air).

Because the runtimes are so fast, you can therefore run the algorithms hundreds of times, and still produce a practical, and very fast runtime, which in turn allows you to calculate a meaningful measure of confidence that I introduced in yet another paper, Information, Knowledge, and Uncertainty. This process requires running classification a few hundred times in order to generate a distribution of confidence intervals, that are then applied to a final prediction step. Ideally, accuracy should increase as a function of confidence, and this is empirically what happens. This allows you to take the solid raw accuracies produced by the initial sort-based classification algorithms (typically about 80%), and filter predictions, increasing required confidence until you achieve a maximum (or desired) accuracy (typically maximized between 90% and 100%). The net result is very high accuracy, making use of algorithms that can handle simply enormous datasets, running on a consumer device.

This is likely a fundamental breakthrough in computer science, and A.I.

Here are links to notes regarding the algorithms in question, which in each case, includes the relevant Octave code:

Massive Supervised Delta Classification

Massive Supervised Classification

Massive Unsupervised Modal Classification