Attached is a clustering algorithm specific to genetics. It pulls all genomes that have a minimum number of K matching bases in a dataset, with respect to a given input sequence. This algorithm is mentioned in this paper. There’s an additional segment of code that is based upon my unsupervised clustering algorithm that generates a value of K for you.
Also attached is an algorithm that analyzes the inter-connectivity of a cluster. Specifically, it’s hard to find circuits and other structures in graphs, so as a workaround, this algorithm simply takes the union of the rows in a sequence of clusters, and tracks the rate of change in the number of elements. So for example, if the first cluster contains 400 elements, and the next row contains 410 elements, only 10 of which are novel, then the total cardinality of the union will be relatively unchanged by 10. By tracking this rate of change we get a measure of the interconnectivity of a sequence of clusters.