Black Tree AutoML Presentation

I’ll be presenting my AutoML software, Black Tree, during a Meetup event, which you can dial into here. I’ll be walking through basically everything in some detail, so it should be an interesting presentation, with an opportunity to ask questions at the end.




On the Probability of Independent Variables

Imagine you have N lightbulbs, and each lightbulb can be either on or off. Further, posit that you don’t know what it is that causes any of the lightbulbs to be on or off. For example, you don’t know whether there’s a single switch, or N switches, and moreover, we leave open the possibility that there’s a logic that determines the state of a given bulb, assuming the state of all other bulbs. We can express this formally, as a rule of implication F from subset C \subset S to S, such that given any C, the state of S is determined.

This is awfully formal, but it’s worth it, as you can now describe what you know to be true: there’s basically no way there are N switches for any N lightbulbs in the same room.

The rule of implication F is by definition a graph in that if the state of lightbulb A implies the state of lightbulb B, then you can draw an edge from A to B. There is only one graph that has no rule of implication, i.e., the empty graph. As a consequence, the least likely outcome, given no information ex ante, is that there is no correlation between the states of the lightbulbs. This would require N independent switches.

As a general matter, given N variables, the least likely outcome is that either all variables are perfectly independent, or all variables are perfectly dependent, since there is only one empty graph, and only one complete graph. It is therefore more reasonable to assume some correlation between a set of variables, than not, given no information ex ante. This is similar to Ramsey’s theorem on friends and strangers, except it’s probabilistic, and not limited to a single size graph, and is instead true in general.

More generally, this implies that the larger a set of random elements is, the less likely it is the elements are independent of each other. To test this empirically, I generated a dataset of 1,000 random vectors, with 1,000 randomly generated classifiers, 1, 2, 3, and 4, and tried to predict the classifier of each vector. I ran Black Tree‘s size-based confidence prediction algorithm on this dataset, which should produce an accuracy of 25%, since these are random vectors, unless something unusual is going on. Unfortunately, this is not included in the free version of Black Tree, but you can do something similar, since it pulls clusters near a given vector and then uses the modal class of the cluster as the predicted class of the given vector. The bigger the cluster, the higher the confidence metric. However, the bigger the cluster, the less likely the vectors in the cluster are to be independent of each other, according to this analysis. And as it turns out, the accuracy actually increases as a function of cluster size, starting out around 27%, and peaking around 70%. The average accuracy for the 90th percentile confidence is about 34%. This makes no sense, in the absence of a theory like the one above. Here’s the code I used to generate the dataset:

clear all

num_rows = 1000;
N = 1000;

dataset = 10*rand(num_rows,N);
dataset(:,N+1) = randi([1 4],[num_rows 1]);

file = “/Users/charlesdavi/Desktop/random_vectors.txt”;

Here are the results, with accuracy plotted as a function of confidence, together with the distribution of accuracies at 90% confidence. I ran this algorithm 5,000 times, over 5,000 random subsets of this dataset, and this is the result over all iterations, so it cannot be a fluke outcome. It seems instead the theory is true empirically.


Note that this is not causation. It is instead best thought of as a fact of reality, just like Ramsey’s friends and strangers theorem. Specifically, in this case, it means that information about some of the predictions is almost certainly in their related clusters, because the vectors in the cluster are so unlikely to be independent of one another, despite the fact they were randomly generated. In fact, because they were randomly generated, and not deliberately made to be independent of one another, the more likely outcome is that the vectors are not independent. Therefore, we can use the classifiers of the vectors in the cluster to predict the classifier of the input vector. The larger the cluster, the less likely the vectors are to be mutually independent, resulting in accuracy increasing as a function of cluster size.

As a general matter, if you deliberately cause a set of systems to be totally independent or totally dependent, then those systems are of course not random. If however, you allow the systems to be randomly generated, then the most likely case is that they’re partially dependent upon each other. This is most certainly counterintuitive, but it makes perfect sense, since allowing a random process to control the generation of the systems causes you to lose control of their relationships, and the most likely outcome is that their relationships are dependent upon one another.

Object Classification by Texture

I’ve already put the subject of single object classification to bed, however, it dawned on me that there’s an entirely different approach to object classification based upon the properties of the object. We can already say what an object is, that’s not interesting anymore. Instead, what I’m proposing is to describe the object based upon its properties, in particular, texture, which is discernible through image information. Specifically, you have a dictionary of objects, and each object in the dictionary has columns of descriptors such as metallic, matted, shiny, ceramic, and you can of course include color information as well. Then, once an observation has been made, you restrict the cluster returned to those objects that satisfy the observed properties. For example, if the object observed is shiny and ceramic, then less than all objects in the dictionary will be returned. The point is not to say necessarily what the object is, though that would be interesting as well, and instead, to build what is in essence a conceptual model of the object within the machine, beyond a single label.

The applications here are numerous, from unsupervised association between objects based upon visceral information, to potentially new ways to identify objects in a scene based upon small portions of the objects. This latter approach would probably allow you to quickly identify things like leaves, birds, bark, small stones, all of which have unusual textures. Though it would also allow you to make abstract associations between scenes based upon the textures found within them, which could be the seed of real creativity for machines, since it is by definition an abstract association, that is not necessarily dependent upon subject matter or any individual object.

Vectorized Edge Detection

I’m pretty sure I came up with this a long time ago, but I was reminded of it, so I decided to implement it again, and all it does is find edges in a scene. However, it’s a cheap and dirty version of the algorithm I describe here, and the gist is, rather than attempt to optimize the delimiter value that says whether or not an edge is found, you simply calculate the standard deviation of the colors in each row / column (i.e., horizontal and vertical) in the image. If two adjacent regions differ by more than the applicable standard deviation, a marker is placed as a delimiter, indicating part of an edge.

The runtime is insane, about 0.83 seconds running on an iMac, to process a full color image with 3,145,728 RGB pixels. The image on the left is the original image, the image in the center is a super-pixel image generated by the process (the one that is actually analyzed by the delimiter), and the image on the right shows the edges detected by the algorithm. This is obviously fast enough to process real-time information, which would allow for instant shape detection by robots and other machines. Keep in mind this is run on a consumer device, so specialized hardware could conceivably be orders of magnitude faster, which would allow for real-time, full HD video processing. The code is below, and you can simply download the image.

Andamanese MtDNA

Another one-off, I found a few Andamanese mtDNA genomes in the NIH Database, and the results are consistent with what I’ve read elsewhere, which is that they’re not closely related to Africans, despite being black – turns out being racist is not scientifically meaningful, shocking. However, one interesting twist, it looks like they are closely related to the Iberian Roma and the people of Papua New Guinea. You can find links to the database and the code that will allow you to generate this chart in my paper, A New Model of Computational Genomics.

Astonishingly, they’re monotheistic, and since they’ve been there for about 50,000 years, their religion could be the oldest monotheistic religion in the world, and I’ve never heard anyone even bother to note that they’re monotheistic, which is itself amazing. Moreover, their creation story involves two men and two women, thereby avoiding incest, suggesting they have a better grasp of genetics than Christians – shocking.

Munda People mtDNA

I’ve been doing a lot of one-off inquiries in my free time into genetic lineage, and I’m basically always astonished, though this time it’s really something. Specifically, I found this Munda Genome (an ethnic group in India), and it turns out they’re closely related to the Basque and the Swedes, as 50% of the Basque in my dataset and 41% of the Swedes in my dataset are a a 92% match with this genome. You can find links to the dataset and the code to produce this chart in my paper, A New Model of Computational Genomics [1]. The Northern Europeans really do have incredibly interesting maternal lineage, and you can read more about it in [1], as the Finns in particular are truly ancient people, with some of them closely related to Denisovans.

Modeling Sexual Reproduction and Inheritance

I’ve put together a very simple model of genetic inheritance that mimics sexual reproduction. The basic idea is that parents meet randomly, and if they satisfy each other’s mating criteria, they have children that then participate in the same population. Individuals die at a given age, and the age of mortality is a function of genetic fitness, and overall it’s already pretty good, though I have planned improvements.

However, that’s not what’s important, though I will follow up tonight with more code, and instead, I stumbled upon what looks like a theorem of maximizing fitness through sexual reproduction. Specifically, if individuals mate by setting a minimum total fitness that is approximately the same as theirs (i.e., my mate is roughly as fit, overall, as I am), and that mate is at the same time maximally genetically distinct from the individual in question, then you maximize the fitness of the population. So to be crude, if I’m short, you’re tall, but neither of us are notably short or tall. As a consequence, we have roughly the same total fitness, but maximally different individual traits.

Just imagine for example two roughly unfit people, that are nonetheless maximally distinct from one another, in that if an aspect of one individual is minimally fit, the other is maximally fit, though neither are fit in total. Their children could inherit the best of their respective genomes, provided they have a sufficiently large number of children. As a consequence, there is at least a chance they have a child that is categorically more fit than either of them. Of course most of the outcomes will not be the ideal case where the child inherits the best of both parent’s genes, but compare this to two identically unfit people (i.e., incest between twins). In that case, the chance the child will be superior to the parents is much lower, and there is also a chance the child will be inferior. As a consequence, the case where the parents are maximally different, creates at least the possibility of superior offspring.

This will be true of all levels of fitness in a population, and as a consequence, a population that seeks out diverse mates, should be categorically superior to a population that seeks out identical mates, provided the environment kills off the weakest of all populations. Moreover, without the possibility of upside, you leave open the possibility that you fail to beat entropy, and as a result, attempting to maintain the status quo could lead to annihilation. That is, there’s definitely error in genetic reproduction, and if the upside of evolution doesn’t outstrip the downside of error, it’s at least plausible that the species just dies due to failure to evolve.

You can also think about this in terms of financial markets, specifically, by maximizing diversity between mating partners, but being reasonable with respect to total fitness, you increase the spread of outcomes, creating both upside and downside. If the environment kills the downside (which it definitely does in Nature), you leave the upside. In contrast, a homogenous strategy would be at best where it started, creating no upside, and leaving the downside for dead. If there’s error during reproduction, which there definitely is, then that alone could wipe out a homogenous strategy. The net point being, risk is necessary for survival, because without it, you don’t produce the upside that is necessary for survival to beat error, which is analogous to beating inflation. Continuing with this analogy, there’s a window of risk, within which, you get free genetic money, because Nature is so skewed against weakness, risk creates net returns within that window.

We can make this more precise by considering two genomes A and B, and assuming that alleles are selected from either A or B with equal probability, and corresponding to traits in the child of A and B. If we list the genes in order, along the genomes, we can assign each a rank of fitness at each index. Ideally, at a given index, we select the gene that has the higher of the two rankings (e.g., if A(i) > B(i), the child has the trait associated with A(i)). Because this process is assumed to be random, the probability of success equals the probability of failure, in that the better of the two genes is just as likely to be selected as the lesser of the two genes. As a consequence, there will be an exact symmetry in the distribution of outcomes, in that every pattern of successes and failures has a corresponding compliment. This is basically the binomial distribution, except we don’t care about the actual probabilities, just the symmetry, because the point again above is that lesser outcomes are subject to culling by the environment. Therefore, Nature should produce a resultant distribution that is skewed to the upside.

Note that this is in addition to the accepted view regarding the benefits of diversity, which is that diseases can be carried by mutated bases or genes, and because genetic diseases are often common to individual populations, mating outside your population reduces the risk of passing on the genetic diseases to children. It is instead, at least for humanity, a reproductive strategy that made sense a long time ago, when we were still subject to the violence of Nature. Today, diversity is instead more sensibly justified by the fact that it reduces the risk of inheriting genetic diseases. Note that wars are temporary in the context of human history, which is the result of hundreds of thousands of years of selection, and as such, mating strategies predicated upon the culling effect of Nature or other violence don’t really make sense anymore.

Here’s the code that as I noted is not done yet, but gets you there on intuition:

Solomon Islands mtDNA

I read an article about recent upheaval in Melanesia, specifically, that the people of Bougainville are ethnically closer to the people of the Solomon Islands, than they are to the people of Papua New Guinea. The people of Papua New Guinea are a 99% match to the Iberian Roma, and both are in turn very closely related to Heidelbergensis. See Section 6.1 of A New Model of Computational Genomics [1]. This sounds superficially implausible, but the Roma are from India, and so it’s perfectly sensible that many of them settled in Papua New Guinea.

I thought I would look into the genetics of the people of the Solomon Islands, and I found this genome in the NIH Database. It turns out the maternal line of this Solomon Island individual is completely different from that of the people of Papua New Guinea, and in fact, the individual doesn’t match to a single genome from Papua New Guinea. These are completely different people, despite superficial similarities.

Above is a chart that shows the distribution of 99% matches by population for the Solomon Islands genome. The x-axis shows the population acronym (you can find the full population name at the end of [1]), and the y-axis shows the percentage of the population in question that is a 99% match to the Solomon Islands genome. You can plainly see that this individual is closely related to many Mexicans and Hungarians. This is a truly complex world, and our notions of race are just garbage. You can find the software to build this graph in [1], and there are links to the full mtDNA dataset as well.

Saqqaq mtDNA

I downloaded this Saqqaq genome from the NIH and built clusters of genomes that are a 99% match to the Saqqaq genome, and as you can plainly see, many modern Europeans are a 99% match. The Saqqaq lived in Greenland about 4,500 years ago. I’ve seen some reports lately questioning the origins of native peoples in the Western Hemisphere, and in my work, it’s clear, that many of them are of the same maternal line as Europeans. The exception is the Mayans, who are instead related to the Iberian Roma. Here’s the distribution of 99% matches to the single Saqqaq genome. The y-axis gives the percentage of genomes in the population that are a 99% match.

Global Alignments for Heidelbergensis

I ran an algorithm on the full dataset that finds the best global alignment when comparing two genomes. I applied this to a complete Heidelbergensis mtDNA genome, comparing it to all other mtDNA genomes in the dataset below (405 complete genomes), and it turns out, you get exactly the same population using the default NIH alignment. See Section 1.3 of Vectorized Computational Genomics [1], for a discussion of the default NIH alignment. Note that the acronyms for the population names in the graphs below can also be found at the end of that paper. Specifically, on the left below is the distribution of genomes that are at least a 96% match with Heidelbergensis using the default NIH alignment, and on the right is the distribution of genomes that are at least a 96% match with Heidelbergensis using the global alignment that maximizes the number of matching bases. The latter is achieved by shifting the genome one index at a time, and counting matching bases. Because mtDNA is circular, the bases that go past the end of the genome are pushed back to the beginning in a loop. The obvious conclusion is that these populations really are anomalously closely related to Heidelbergensis.

However, this is not true for lower threshold values below 96%, as the global alignment algorithm quickly produces a much more dense distribution for all populations. For example, below are the same two distributions produced using a minimum 80% match to Heidelbergensis. As you can plainly see, the global alignment (right) is much more dense, with nearly 100% of all populations at least an 80% match to Heidelbergensis. The plain takeaway here is that using the default NIH alignment is much more meaningful, because it filters the results, forcing acknowledgment of insertions and deletions, which again, can cause drastic changes to morphology and behavior.

Also, the nearest neighbor of the Heidelbergensis genome is unchanged, whether you use the default NIH alignment, or search for the globally best alignment, suggesting again, it’s more trouble than it’s worth to search for a globally best alignment, unless you’re deliberately searching for insertions and deletions within a pair of genomes. Specifically, it takes about an hour to find the nearest neighbor of every genome using the globally best alignment, whereas it takes about 25 seconds using the single default NIH alignment. Finally, I’ll note that using the default NIH alignment allows you to reliably predict ethnicity using mtDNA alone (i.e., only the maternal line). See [1] generally. This is actually astonishing, and though I haven’t tested the question, given the distribution above on the right, I would wager you’re not going to get good results using the best global alignment, since it causes all genomes to be roughly the same, precisely because it ignores insertions and deletions.

Here’s the dataset and the code: