Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Shift happens: Crowdsourcing metrics and test datasets beyond ImageNet

Classifiers Should Do Well Even on Their Worst Classes

Julian Bitterwolf · Alexander Meinke · Valentyn Boreiko · Matthias Hein


Abstract:

The performance of a vision classifier on a given test set is usually measured by its accuracy. For reliable machine learning systems, however, it is important to avoid the existence of areas of the input space where they fail severely. To reflect this, we argue, that a single number does not provide a complete enough picture even for a fixed test set, as there might be particular classes or subtasks where a model that is generally accurate performs unexpectedly poorly. Without using new data, we motivate and establish a wide selection of interesting worst-case performance metrics which can be evaluated besides accuracy on a given test set. Some of these metrics can be extended when a grouping of the original classes into superclasses is available, indicating if the model is exceptionally bad at handling inputs from one superclass.

Chat is not available.