Modeling Credit Using ML

My AutoML software, Black Tree AutoML, can already predict credit outcomes with no specialization at all. But it just dawned on me that with a bit of work, you can use any clustering and classification system to model credit in a meaningful way. First let’s define the relevant properties of a credit, which are its assets, and its liabilities, and for simplicity, we’ll include the equity capital of the credit in its liabilities. This will allow us to express a credit, at a given moment in time t, as a vector h(t) = (a_1, \dots, a_k;l_1, \ldots, l_m), where each a_i is the value of assets of type i owned by the credit, and each l_i is the value of liabilities of type i owed by the credit. Because this is so abstract, this allows you to consider not only corporates, but SPV’s as well, and individuals.

Now let’s posit a dataset of credits S = \{h_1, \ldots, h_M\}, that were sampled over time. That is, h_i is actually a time-series of a given credit, and we can evaluate h_i(t), for any t within some ordinal interval, though you could also consider specific periods of time as well. The overall gist being, we have observed and recorded the state of a given credit over time, in the form of the vector h(t) = (a_1, \dots, a_k;l_1, \ldots, l_m). We can therefore, pull all credits that are sufficiently similar to some new input credit h(t), which will produce a cluster of similar credits. Because our dataset contains time-series data for each of the credits returned in the cluster, we can form possible future paths for h(t). This will allow us to say, as a general matter, what the future of h will look like, given its present state h(t). Moreover, we can easily construct a probability of default, again using the cluster, since all of the credits in the cluster either paid or didn’t, though you could have some unknowns as well as a practical matter (i.e., those credits are still outstanding).

Applying this process repeatedly to the initial state of some credit h(t_0), we will construct a set of possible future paths for h(t_0), given that initial state. Specifically, first we find the cluster associated with h(t_0). Then, we find the next state for each credit in that cluster. So, e.g., if credit x(t_j) is in the cluster associated with h(t_0), we find the next state of x(t_j) in the cluster, which we can represent as x(t_{j+1}). We do this again, for all such x, and continue as desired, and this will produce a dataset of possible future paths for the credit, which will grow exponentially as a function of time, and at each ordinal interval of time, there will be some probability of default based upon the dataset.

My AutoML software is typically really accurate, so I would wager that if you use my software for the clustering step, you’re going to get great answers, and probably make a lot of money as a consequence, and so it’s another great reason to buy my software, which is comically better than everyone else’s.

Advertisement

Determining Paternal Ancestry Using mtDNA

It sounds superficially impossible, but my work shows unambiguously that paternal selection impacts mtDNA, for the simple reason that males appear to select females on the basis of maximizing the number of bases they have in common with their female mates. Therefore, if you replace an existing male population with a new male population, the original female population will be selected for on the basis of maximizing the number of bases they have in common with the new male population, which will cause the lower-matching females to die off. See Section 5 of A New Model of Computational Genomics [1]. This will over time cause the two maternal lines to converge (i.e., the maternal line of the new paternal line will converge with the existing maternal line).

Today I developed a new technique that looks at the gaps in matches that occur as a result of insertions and deletions. That is, if you align to genomes, and then shift one of them by a single base at index i, the match count along indexes i + 1 through N (i.e., the genome size) will be reduced to chance. The question is, is there selection in those gaps? The answer is plainly yes, and by testing for it, you can determine paternal ancestry. Specifically, start with a genome, and then compare that genome to every genome in a given population (e.g., a single Ancient Egyptian genome compared to every Norwegian genome in a dataset). If Norwegians played a role in selecting Ancient Egyptian women (probably not true), then in the gaps, we should find evidence of selection, which would manifest as a number of matches that exceed chance. We can then measure the density of these above-chance matches within the gaps, which is consistent with selection by males from that population. Note that the criteria of materially exceeding chance provides an objective test, which is attractive, because you don’t have such a test in the non-gaps.

 

Doing exactly this for a Pre-Roman Ancient Egyptian genome, produces the distribution above, suggesting that the paternal line of the earliest Egyptians is from Asia, specifically, the ancestors of present day Javanese (JV) and Solomon Islands (SI) people. The full list of acronyms is included in [1]. The Ancient Egyptian maternal line is unquestionably from Asia, specifically the ancestors of modern day Thai people. See [1] generally. This is obvious when you look at the earliest Egyptians, who were visibly Asian, some of whom seem to have had straight hair. The code to produce the distribution above is attached below, and the dataset and any missing code is attached to [1]. Below are Menkaure and Khamerernebty II (circa 2532 BCE), courtesy of Boston MFA, both of whom are visibly Asian people. Overall, this algorithm provides even more support for the hypothesis that many modern Europeans and Africans are actually from Asia. Specifically, the Nigerians have Asian and Classical paternal lines (e.g., Sri Lankan and Phoenician), and the Classical people all also seem to be from Asia, not just the Egyptians. This is consistent with fairly recent academic results showing that there was an even earlier migration back to Africa, from Asia, about 70,000 years ago. My hypothesis is that this happened more than once, and that humanity began in Africa, then migrated to Eurasia and Asia, with many groups coming back to Africa, and some going to Europe, in particular Scandinavia.

Here’s the code:

https://www.dropbox.com/s/6gvcaxxkrskhqy0/Matching_Bases_Loc_Distribution.m?dl=0

Ancient mtDNA

I stumbled upon a simply incredibly resource, specifically a database of ancient mtDNA maintained by Czech academics. All of the genomes have provenance files, with links to the source. Moreover, they all have dates, and locations, which is simply fantastic. I’ve been running analysis all morning, and one thing is clear: ancient Europeans are from Asia, specifically, they are related to the people of modern day Sri Lanka. You can run my genetics software on, e.g., this genome, which is from Belgium, around 33,000 BCE, and it is a 99% match to the Sri Lankans, and the Phoenicians (who are obviously related to the Sri Lankans). You find exactly this throughout history, in Germany, Austria, and Italy, suggesting that people came to Europe from Asia a very long time ago. Moreover, many Africans, specifically the Nigerians, are also closely related to the Sri Lankans, and Asians more generally. This is all consistent with a migration-back hypothesis, where humanity began in Africa, migrated to Asia, and then migrated back to both Europe and Africa.

Moreover, the second find is that the Roma people are definitely Russian, not Indian, which I already suspected based upon the fact that nearly a majority of Russians are a 99% match to the Iberian Roma. I’m not doubting that they spent a lot of time in India, and it seems reasonable that their culture developed there, however, their ultimate genetic origin is Siberia. Specifically, there’s a Siberian genome from about 45,000 BCE that is a 99.5% match to the Iberian Roma, and many modern Russians. Moreover, I’ve found no evidence of Roma people in Europe even up to the Middle Ages, except in Iron Age Finland and Ötzi the Ice Man. Both suggest that the Roma were originally Arctic people, and moreover, suggesting as a general matter, that the Roma people were fairly recent arrivals in Europe, at least in the numbers present today, which are significant (i.e., you have many people that are definitely of Roma descent that don’t declare it in genetic studies).

Base Distribution, Cellular Function, and Selection

A while ago, I wrote an algorithm that calculates the distribution of bases at each index of a genome, over a given population. For mtDNA, there are 16,579 bases in each genome (in the dataset attached below), and the algorithm calculates the density of Adenine (A), and Thymine (T), Guanine (G), and Cytosine (C), at each index, over an entire dataset. This associates each of the 16,579 bases, with 4 real numbers, each in [0,1], that sum to 1, for the simple reason that the bases must be one of the four. However, note there are minor deviations due to missing bases. The distribution of bases over the entire dataset are unequal, and that’s interesting in and of itself. However, I just tested the distribution of bases within each population ethnicity, to see if there’s meaningful variation in the distribution of bases. It turns out there is meaningful variation, suggesting at least the possibility that different populations select for different distributions of bases.

Specifically, the method I employed first calculates the distribution of bases at each index in a given population. It then calculates the overall distribution of bases within that population, producing four numbers, that represent the densities of the four bases, within that population, that should generally sum to 1, save for missing bases. This is done for each population, and there are 76 populations in this dataset, producing a 76 x 4 matrix. I then calculated the standard deviation in each column, as a measure of the variation in the densities of each base, across each of the 76 populations. The standard deviations of the distributions of A,T,G, and C, across all populations, are 0.0464%, 0.0367%, 0.0295%, and 0.0914%, respectively. These are small percentages, but they’re not uniform, and moreover, they’re not totally negligible quantities.

One initial observation, the Guanine-Cytosine bond is the strongest, and G is plainly more variable than the other bases. This is consistent with selection around the density of Guanine and Cytosine. Why would the distribution of bases matter? Well, the cytoplasm contains at any given moment a fixed distribution of bases, from which all genomes will draw upon to effectuate replication. As a consequence, there is by definition competition among genomes, and perhaps other parts of the cell, for the free-floating bases in the cytoplasm. As a consequence, the distribution of bases within a genome could impact cellular function and health. If the supply of bases in the cytoplasm is adequate to effectuate replication without any shortfalls, then an excess of a particular base in any genome, would cause an excess of that base in the cytoplasm. This could in turn affect cellular function and health. If the supply of bases in the cytoplasm is not adequate to effectuate replication without any shortfalls, in light of an excess of a particular base in a given genome, then this could again impact cellular function and health. As a general matter, intuition suggests a parity between the supply of bases in the cytoplasm and demand for bases based upon the distribution of bases in the genomes generally. However, as noted, even in the absence of such a parity, the distribution of bases in the genomes and the cytoplasm could impact cellular function and health.

This however does not explain why there we would be greater selection for Guanine-Cytosine bonds (i.e., the standard deviation is significantly higher for C), and moreover, why the standard deviation for C is the lowest of them all. One simple explanation is that there’s a handedness to the genome, and that the side being read by the sequencer is at least partially determinative of the distribution. Note that the opening sequence of bases in the dataset below is the same, and as a consequence, the sequencer is always reading the same side of the genome. One simple theory to explain the handedness of mtDNA is that it’s circular, and as a consequence, the genome has an “inside” and an “outside”. Now of course the other side of the genome must have a corresponding distribution (i.e., the other side of the genome must have a highly variable density of Guanine). However, the point is, the variability of Cytosine is much higher than the other bases, and as a consequence, it’s at least consistent with greater selection for Guanine-Cytosine bonds. The Guanine-Cytosine bonds happen to be the strongest (i.e., they are stronger than Adenine-Thymine bonds), and so it’s perfectly reasonable that they e.g., impact the ease of replication, and the integrity of the genome generally, which could present both advantages and disadvantages.

Note that the distribution at a given index in the genome will not directly determine the total distribution of bases in the genome. That is, if you e.g., swap two indexes in the genome, the overall distribution is unchanged, and therefore, all of the arguments presented above regarding the distribution in the cytoplasm are still true. However, this does not rule out selection at each index, since as noted, Guanine-Cytosine bonds are stronger than Adenine-Thymine bonds, and as a consequence, there could be advantages and disadvantages with respect to the integrity of the genome, based upon the distribution of bases along the genome indexes. This is a fascinating and to my knowledge, unexplored corner of genetics, that relates to selection based upon the overall distribution of bases (for purposes of cellular function), and their location along the genome (for purposes of genome integrity). This could explain the empirical fact that statistical imputation for mtDNA is categorically superior to sequential imputation. See Section 7 of A New Model of Computational Genomes. That is, there is significant selection taking place based upon whole-genome functions (i.e., random bases), beyond protein production (i.e., sequential bases). This suggests at least the possibility that whole-genome functions are more aggressively selected for than protein production. At first this sounds counter-intuitive, but it makes perfect sense:

If the cell can’t function, and the genome’s not stable, then it doesn’t matter what proteins are coded for, because that’s like writing on a burning piece of paper.

Putting it all together, once a population-specific mutation occurs, the genome’s integrity as a whole is subject to the possibility of deterioration, for the simple reason that a complex set of bonds has by definition been disrupted. This should therefore, cause intense selection in the rest of the genome, that allows for the mutation to exist in a structurally sound context. This would cause whole-genome selection, and explain the empirical fact that statistical imputation is stronger than sequential imputation, at least with respect to mtDNA. Moreover, the arguments above regarding the cytoplasm support the same hypothesis. If e.g., a mutation drastically changes the distribution of bases in a genome, then replication will draw differently on the bases floating in the cytoplasm, potentially causing disruptions to cellular function and health. As a consequence, whole-genome selection could remedy any drastic changes to the draw on the distribution of bases in the cytoplasm. However, unlike the structural integrity of a given genome, which requires whole-genome selection, the overall distribution of bases in the cytoplasm could itself adjust to the mutation, or other genomes could adjust to the mutation.

Attached is the code and the dataset:

https://www.dropbox.com/s/lrgw7wn3im0zcri/Calc_Base_Density_CMNDLINE.m?dl=0

https://www.dropbox.com/s/zwt1bcqqmqkleca/mtDNA.zip?dl=0

mtDNA, Health, and Selection

I’ve shown empirically that mtDNA is selected for, in particular by male humans when selecting female mates. See A New Model of Computational Genomics, in particular, Section 5. The data suggests quite plainly that males are selecting females on the basis of having overall similar mtDNA genomes. I didn’t explain how it is that males or females would know this, though I now think the answer is, that mtDNA is deeply connected to overall health. Therefore, it makes perfect sense for a species to develop a means of communicating information about mtDNA, since doing so, would allow for better mating decisions, and therefore, healthier offspring. I don’t know what the specific mechanism is, but I’d wager mtDNA actually does impact appearance in some subtle way that informs attraction between mates. This would in turn allow males to select females on the basis of having a similar overall mtDNA genome.

The Holocaust, Communist Revolutions, and Intelligence

It’s not exactly a secret that both Nazis and Communists killed and persecuted intellectuals. The question is, did this have a meaningful impact on the demographics of countries occupied by Nazis and Communists? This answer is a clear yes, in my opinion. To understand why, the first component is that many modern humans are a 70% match on the maternal line with Denisovans, as measured using mtDNA. See my paper, A New Model of Computational Genomics [1], generally. This is not unique to Denisovans, and in fact, basically all Iberian Roma and all Papuans are a 96% match to Heidelbergensis, and many people globally are a 96% match to Neanderthals.  See [1] generally. However, approximately a majority of people have no real relationship to any archaic humans, beyond chance (i.e., about 25% of the genome in common). All of this is based upon the dataset linked to in [1], which at this point consists of 600 complete human mtDNA genomes, all taken from the NIH Database. The most current dataset is linked to below, and an older version is linked to in [1].

The root observation is that many Jews (both Ashkenazi and Sephardic) are a 70% match to Denisovans, which you can see in the chart above. Note that all of the population acronyms (e.g., KN stands for Kenya) can be found at the end of [1]. Because so many Jews were killed during the Holocaust, and many Jews are technically Denisovan, we would expect countries occupied by Nazi Germany to have fewer Denisovans than countries that were not occupied by Nazi Germany. This is exactly what you find, in that Belarus and Poland have absolutely no Denisovan people, and Germany has almost none. Germany, Poland, and Belarus contain matches to Ashkenazi Jews (at 90% of the genome), which you can see in the chart below, suggesting that the Denisovan Jews were disproportionately impacted by the Holocaust. That is, even though some people in these countries are genetically Jewish, you don’t find any Denisovans, suggesting that Denisovans generally were nearly annihilated during the Holocaust. Moreover, Finland avoided occupation by both Germans and Russians, and did not actively persecute Jews, and as you can see in the first chart above, they still have a sizable Denisovan population. In contrast, Russia has no relationship to Denisovans, and the Russians also persecuted Jews, separate from the Holocaust. Though we cannot know using this data alone what the populations of Europe looked like prior to the Holocaust, it’s at least consistent with the assumption that Denisovan populations were disproportionately impacted by the Holocaust. This is not surprising, given that many Jews are related to Denisovans. However, Jewish populations seem to persist today genetically in all of these countries, which you can see below. This suggests a disproportionate impact on Denisovans that is independent of being Jewish.

Interestingly, Taiwan and Mongolia also have significant Denisovan populations, which you can see in the first chart above. China has no meaningful Denisovan population, and again, Russia has none at all. This suggests at least the possibility that people of Denisovan ancestry actively fled communist revolutions in China and Russia, heading to Taiwan and Mongolia. Now, it is simply not credible to claim that ethnically Taiwanese and Mongolian people are somehow Jewish, especially not in the numbers implied by the chart above. Moreover, it’s not credible to claim that they knew that they were of Denisovan ancestry either. The more rational explanation is that the persecution of intellectuals disproportionately impacted individuals of Denisovan heritage. One sensible explanation is that Denisovans are simply more intelligent than most people. If this is true, then the average IQ in Sweden and Norway (where there isn’t much of a Denisovan population) should be lower than it is in Finland (where there is a sizable Denisovan population).

 

It turns out this is true, and drastically so. Moreover, the average IQ’s in Norway and Sweden are almost exactly the same, whereas it’s drastically higher in Finland. See the chart above, where Sweden and Norway are at the bottom, just below the U.S., and Finland is ranked 8th, globally, at the top. Russia is slightly off-screen. You simply cannot explain this differential with economics, political dysfunction, or geography, as all three countries are rich, high-functioning, socialist, capitalist, democracies, that have plainly similar geographies, and very similar demographics, save for Finland’s anomalous Denisovan population. Moreover, the average IQ is also higher in Taiwan than it is in China. This all suggests the possibility that by simply killing intellectuals, the Nazis and Communists accidentally killed off our living ancestors, that are also more intelligent than them. The barbarism of the Nazis also sent a ton of scientists to the U.S., so thanks for that. The image above is courtesy of World Population Review.

Here’s the current dataset:

https://www.dropbox.com/s/zwt1bcqqmqkleca/mtDNA.zip?dl=0

Proofs in the Infinite Case

I’ve noted before that proofs start to fail when you have an infinite number of summands, e.g., in the case of the sum over all Fibonacci numbers, which produces -1. This is obviously wrong, and implies that algebra fails given an infinite number of terms. This suggests that there is a logical independence between limits, and the infinite case, proper.

That is, e.g., the sum,  \lim_{n \to \infty} \sum_{i=1}^n = 10^{-i}*9, does not imply that the number 1 is in fact the sum over the infinite set of such terms. You can easily construct a proof by induction that you can always find a value of n that will cause the sum to be arbitrarily close to 1. As a consequence, the sum at infinity cannot be less than 1. An intuitive proof would simply note that all terms present in each finite sum, must be present in the infinite sum, and since the finite sums get arbitrarily close to 1, the infinite sum cannot be less than 1. More formally, assume to the contrary that the sum in the infinite case is x < 1. Since there is always a value of n that will cause the sum to be arbitrarily close to 1, we can therefore always find a value of n that will cause the sum to exceed x.  Therefore, the sum is not less than 1 in the infinite case. The sum can however be equal to 1, without contradicting such a proof by induction (though it could contradict other fundamental assumptions beyond the scope of this discussion), since that proof requires only that all finite sums are less than, yet arbitrarily close to 1. Interestingly, there is similar logical independence in the case that the sum exceeds 1, since that again, does not contradict the proof by induction that the sum gets arbitrarily close to 1 in all finite cases.

I don’t think this is academic, and instead, I think it’s a potentially deep point about algebra in the infinite case, and non-Turing computing, since the sum in the infinite case is not computable, since all computable functions make use of a finite number of operations. Specifically, if infinite systems really exist in Nature, then there could be a correct answer to whether the sum is actually 1, or greater than 1. This reminds us that all of mathematics is ultimately rooted in reality itself, and if our assumptions are wrong, then our theorems will be physically meaningless. For combinatorics (e.g., graph theory, counting problems, etc.), it’s simply not credible to doubt the assumptions, since they’re plainly physically true. But when you get into this kind of mathematics, it’s not obvious what the right answer is.

A Potentially New Species

I’m in the process of unpacking the history of humanity using my machine learning software, and as part of that process, I decided to take a closer look at Denisovans. Specifically, many modern populations have individuals that are a 70% match to Denisovans. In particular, the Jews and Finns have large populations of people that are a 70% match. Below is a distribution that shows a normalized percentage of each population that is at least a 70% match to the Denisovans. The x-axis shows the acronym for the particular population, and all of the acronyms can be found at the end of my paper, A New Model of Computational Genomics [1]. You can also find all the software you need to run these experiments, in addition to the related technical information on alignment, process, etc., in [1].

The natural question is, are all Denisovans the same? Or do they have a unique history of their own? Denisovan remains are generally found in Asia. However, as you can see above, there are modern populations in Europe, Africa, the Middle East, and Asia, that all contain matches to Denisovans. This suggests at least the possibility, that Denisovans have a unique history, that could predate human language altogether. The process I used to test this question is straightforward: First, I found all genomes that are at least a 70% match to at least one Denisovan genome. Then, I constructed clusters, as a second test, like the one below for the Swedish, effectively counting what percentage of each Denisovan population matched to the Swedish Denisovans. As you can see, the Swedish Denisovans are plainly related to the Norwegian Denisovans (which is not surprising based upon geography), though they’re also related to the Chinese Denisovans. Why? Well, Denisovan fossils are generally found in Asia, so this not surprising either.

 

This is all very interesting on its own, but what’s far more interesting, is that when I tested the distribution of German Denisovans, they failed to match to any of the actual Denisovan genomes. That is, the German Denisovans (a modern population) failed to match to any of the actual ancient Denisovans in the second test. At first I thought I had made a mistake in the code, but I then isolated the Denisovan row of the dataset that the modern Germans match to, and it’s row 378 in the dataset attached below. However, this Denisovan genome itself does not match to any of the other Denisovan genomes, even at 30% of the genome. This suggests that row 378 of the dataset below, is not Denisovan, and is instead, an otherwise unknown species, that seems to be most related to people in the Jharkhand region of India, based upon the chart below, that shows the distribution of matches at 30% of the genome. Note that all of these genomes are taken from the National Institute of Health Database, and the dataset includes provenance files for all genomes, with links to the NIH Database.

It is of course possible that this genome would map to some other Denisovan genome not included in the dataset. However, I would instead wager that this genome is a very early Neanderthal, since it is a match to some Neanderthals at 30%. I think this find, at a minimum, suggests that archeology is limited in some sense, since it doesn’t look to the genome. As such the label Denisovan is questionable in this case. Moreover, the methods introduced in [1], can predict ethnicity (including archaic humans) with an accuracy of about 80%. As a consequence, it’s at least worth looking into. As noted, all of the code you need to run these experiments are included in [1], save for the additional script attached below. The dataset is also attached below.

Code:

https://www.dropbox.com/s/n34niioi63apczf/Extract_Class_Rows.m?dl=0

Dataset:

https://www.dropbox.com/s/zwt1bcqqmqkleca/mtDNA.zip?dl=0