Measuring Dataset Consistency

In a previous article, I showed that you can use the nearest neighbor algorithm to build a simple graph based upon which vectors in a dataset are nearest neighbors of one another. In simple terms, if two vectors are nearest neighbors, then you connect them with an edge. Once you do this for all vectors in the dataset, you will have a graph. It dawned on me earlier today that you can define a measure of dataset consistency by analyzing the properties of the resultant graph. Specifically, take the set of maximal paths in the dataset S, and label the vertices with their classifiers. Now, for each such path in S, simply take the entropy of the distribution of classifier labels. This is exactly the technique I introduced in another previous article, but I suppose I didn’t make it plain that it can apply to any dataset. You can also perhaps use this to measure whether any errors are due to truly similar classes by first converting the labels to measure-based classifiers, using the method I introduced in the article before this one. That is, if the errors only appear where two classes are truly similar, then maybe that’s the way it is and that’s the best you can do, or, you can somehow tighten your lens on the dataset, which is something you can actually do in my model of machine learning.

To be perfectly clear, if any given path in S contains more than one classifier, then that path by definition implies that the nearest neighbor algorithm will generate an error, and therefore, the dataset is what I would call not locally consistent. Here’s a brief proof:

Assume that v_i and v_j are both vertices in some path P within the set S. Because they are both part of the same path, there must a subset of P that connects v_i to v_j, which we’ll call P_{i,j}. Assume that the classifier of v_i is m and that the classifier of v_j is k, for m \neq k. It follows that as we traverse P_{i,j}, beginning at v_i, the classifiers of the vertices along that path must at some point change from m to k, and let e = (v_p, v_l) be the edge where this change occurs. Note that it could be the case that v_l = v_j. Because v_l is the nearest neighbor of v_p, and by definition the classifiers of v_p and v_l are unequal, it is the case that the nearest neighbor algorithm generated an error when given the input v_p.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s