I’m of course working on my AutoML software, and in trying in to figure out what to offer at what price, I’ve rediscovered Nearest Neighbor, which I wrote about in my paper, “Analyzing Dataset Consistency“, proving a few results about its accuracy. Specifically, I showed that if your dataset is what I call “locally consistent”, in that classifications don’t change over some fixed distance, then the accuracy of the Nearest Neighbor algorithm will be perfect. As a practical matter, it means that for many real world datasets, the is accuracy is very high. As a consequence, I think it makes sense to offer only Nearest Neighbor and data normalization in the free version, and $5 per month version, with no clustering. Then, for significantly more money, I think about $100 per month, you get clustering, plus my confidence software, that allows you to “magically” increase accuracy, at the expense of the number of rows that satisfy the stated confidence criteria. I think this is both economically fair and practical, because what it does, is gives people commercially viable software based upon known technology, at a low price per month, in a convenient GUI format. Then, for real money, you get something that is geared towards an audience that is trying to capitalize directly from predictions (e.g., making credit decisions), as opposed to maybe making routine use of basic machine learning, as an expense, not a driver of revenue. Stepping back, I think the big picture is, my software imposes efficiency on the market for A.I., because routine machine learning can be totally commoditized, allowing an admin to simply process CSV files all day, and when problem datasets arise (i.e., they produce low accuracy), even if you’re too cheap to buy the better version of my software, you can kick that dataset up to a real data scientist, who will then spend more time on the hard problems, and basically no time, on the easy ones.
I’m not saying I’ll change my mind, but if you think that this is a bad idea commercially, shoot me an email at charles dot cd dot davi at [gmail dot com].
The attached code demonstrates this, and I’ve run it on the UCI Wine, Iris, and Parkinson’s dataset, all of which produce 90%+ accuracy.