Timezone: »

Oral
The advantages of multiple classes for reducing overfitting from test set reuse
Vitaly Feldman · Roy Frostig · Moritz Hardt

Wed Jun 12 03:00 PM -- 03:05 PM (PDT) @ Room 103
Excessive reuse of holdout data can lead to overfitting. Yet, there is no concrete evidence of significant overfitting due to holdout reuse in popular multiclass benchmarks. Known results show that, in the worst-case, revealing the accuracy of $k$ adaptively chosen classifiers on a data set of size $n$ allows to create a classifier with bias of $\Theta(\sqrt{k/n})$ for any binary prediction problem. We show a new upper bound of $\tilde O(\max\{\sqrt{k\log(n)/(mn)},k/n\})$ on the bias that any attack with $k \geq \tilde\Omega(m)$ queries can achieve in a prediction problem with $m$ classes. Moreover, we show a natural attack that, under plausible technical condition, achieves the nearly matching bias of $\Omega(\sqrt{k/(mn)})$. Complementing our theoretical work, we give new practical attacks to stress test multiclass benchmarks by aiming to create as large a bias as possible with a given number of queries. Through extensive experiments, we show that the additional uncertainty of prediction with a large number of classes indeed mitigates the effect of our best attacks. Our work extends important developments in understanding of overfitting in adaptive data analysis to multiclass prediction problems. In addition it bears out the surprising fact that multiclass prediction problems are significantly more robust to overfitting from reusing the test set. This helps to explain why popular multiclass prediction benchmarks, such as ImageNet, may enjoy a longer lifespan than what intuition from the binary case would have suggested.