Oral
Do ImageNet Classifiers Generalize to ImageNet?
Benjamin Recht · Rebecca Roelofs · Ludwig Schmidt · Vaishaal Shankar
Generalization is the main goal in machine learning, but few researchers systematically investigate how well models perform on truly unseen data. This raises the danger that the community may be overfitting to excessively re-used test sets. To investigate this question, we conduct a novel reproducibility experiment on CIFAR-10 and ImageNet by assembling new test sets and then evaluating a wide range of classification models. Despite our careful efforts to match the distribution of the original datasets, the accuracy of many models drops around 10%. However, accuracy gains on the original test sets translate to larger gains on the new test sets. Our results show that the accuracy drops are likely not caused by adaptive overfitting, but by the models' inability to generalize reliably to slightly "harder" images than those found in the original test set.