Recently, I was reading about the "Minimum viable dataset". Of course, this is a nod to the concept of MVPs ("minimum viable product"). MVPs have the bare minimum of features necessary to be a product and be valuable to a customer and therefore willing to pay for it. Building an MVP is a good way to test out your idea, without committing to a full product. In the context of datasets, "minimum viable" relates to the speed of iteration. What is the smallest possible dataset that allows you to move with great speed, but still has everything you need to be able to determine whether your models will work in practice.
In computer vision and handwriting recognition, a common benchmark dataset is the MNIST dataset. I've never been fond of it, because it's over-used. When everyone is reporting accuracies over 99%, it starts to look like an overfitting problem—overfitting in the community. A benchmark can be very useful to stimulate new ideas and techniques (ImageNet has been instrumental in the rise of Deep Learning), but the static nature of a benchmark has its downsides. At some point the rates of improvement go down and these improvements always seemed to apply only to this static, clean and neat dataset, not to a realistic problem. Is it still a "viable" dataset then?
On the other hand, a small fixed dataset such as MNIST has a big advantage in terms of speed of iteration. Even though it's not representative of actual real-world OCR problems, you can still use MNIST to quickly try out a new idea. For example, when studying hidden Markov models, I wanted to know if I could train the models in a discriminative way. Hidden Markov models are generative models, and if used in a Bayesian way1, models for different classes might have ended up very close together. This is not desirable, because it's then easy to confuse these classes in production. A discriminative training method would take all data into account and make sure the classes are pushed apart during training. Using a small, and well-known dataset, I could test out some of the hypotheses I had, while not worrying about compute-power or about feasibility in a real-world case.
It is a delicate balance: We want to be able to quickly train and test our new models, but also want to know about performance in the real world. Generally, people outside of academia worry more about the data than the actual models and algorithms (or at least they should, see also this blogpost by Pete Warden on why you need to improve your training data). Using a benchmark dataset isn't wrong in itself, but relying on them to make decisions for your production pipeline is a bad idea.
What minimum viable datasets do you like to use? Let me know on Twitter or LinkedIn.
That is, you train a model per class and use the arg-max over the likelihoods to figure out the best matching class↩