Supervised Genomic Classification

Attached is code that implements an analog of my core supervised classification algorithm to genetic sequences. The original algorithm operates on Euclidean data, and because genetic sequences are discrete, this algorithm simply increases the minimum matching number of bases between two sequences, rather than increasing a spherical volume of Euclidean space. You can read about the original algorithm in my paper, Vectorized Deep Learning, and this works exactly the same way, it’s just not Euclidean. I tested it on three datasets from Kaggle, that contains raw genetic sequences for dogs, humans, and chimpanzees, the classification task being to identify common ancestor classes, separately, for each species (i.e., it’s three independent datasets). The accuracy was roughly 100% for all three datasets. The runtimes were 78 seconds (695 training rows, 122 testing rows, 18,907 columns); 371.314 seconds (1,424 training rows, 251 testing rows, 18,922 columns); and 1,752 seconds (3,085 training rows, 544 testing rows, 18,922 columns), respectively. This is consistent with how this algorithm performs generally, as it is extremely efficient and highly accurate. The only difference here is that the data is not Euclidean. This is more empirical evidence for the claim that genetic data is locally consistent, which in turn implies that polynomial-time algorithms can be used to accurately classify genetic data.

 

Advertisement

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s