# Some More Methods in Computational Genomics

Two more methods in genomics I wanted to share, the first being a simple measure of signal versus noise rooted in my work on epistemology. The second is a clustering algorithm I came up with a long time ago but never implemented, rooted in the Nearest Neighbor method.

Assume your population is stored as an $M \times N$ matrix, with $M$ sequences, and $N$ bases per sequence. Now assume we measure the density of each of the four bases in column i. If it turns out that the densities are given by $(0.9, 0.05, 0.02, 0.03)$ for A,C,G,T, respectively, it’s reasonable to conclude that the modal base of A with a density of 0.9 is signal, and not noise, in the sense that basically all of the sequences in the dataset contain an A at that index. Moreover, the other three bases are roughly equally distributed, in context, again suggesting that the modal base of A is signal, and not noise.

When you have a continuous signal, noise is easier to think about, because if it’s both additive and subtractive, it should cancel itself out through repeated observation. A population drawn from a single species is analogous, because you are in effect repeatedly sampling some “true” underlying genetic sequence, subject to mutations / noise. Unfortunately, you cannot meaningfully take the average of a discrete signal sequence like a genome. However, this doesn’t change the fact that a uniform distribution is indicative of noise. So if for example, the densities were instead $(0.24, 0.26, 0.25, 0.25)$, then we could reasonably dismiss this index as noise, since the bases are uniformly distributed at that index across the entire population. In contrast, if we’re given the densities $(0.9, 0.1, 0.0, 0.0)$, then we should be confident that the A is signal, but we should not be as confident that the C is noise, since the other two bases both have a density of 0.

We can formalize this by applying the measures of Information, Knowledge, and Uncertainty I presented in the paper of the same name. Information is in this case the applicable maximum entropy, which is $log(4) = 2$ for all four bases, $log(3)$ for three bases, etc. Uncertainty is is the entropy of the distribution, and Knowledge is the balance of Information over Uncertainty, $K = I - U$. Interestingly, this suggests an equivalence between Knowledge and signal, which is not surprising, as I’ve defined Knowledge as, “information that reduces uncertainty.”

I also wrote an interesting clustering algorithm that keeps applying Nearest Neighbor in a chain, until it hits a loop. So for example, first we find the Nearest Neighbor of row i, let’s say it’s row j. Then we find the Nearest Neighbor of row j, and keep doing this until we create a loop, by returning the same row more than once. I haven’t tested it too much, but it seems meaningful because the clusters are generally very small, suggesting that it’s not flying around the dataset, and is instead pulling rows that are proximate to each other.