I just realized it’s probably more fair to report the error for my core clustering algorithm (not the others) differently than I do in my deck, while working on my embedding algorithm. I disclosed exactly how I calculate error, so it’s not a lie, but it’s not right, in the sense that even though the clusters are not mutually exclusive, the better answer is to say that the error is given as follows:
,
whereas in the deck, I say the error is given by,
.
The latter measure does capture something, in that the number of rows is the number of opportunities to make errors, but that’s not what you want in this context, which is the accuracy of the cluster itself, regardless of how many rows there are.