Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Data-centric Machine Learning Research (DMLR): Datasets for Foundation Models

In Search of Forgotten Domain Generalization

Prasanna Mayilvahanan · Roland S. Zimmermann · Thaddäus Wiedemer · Evgenia Rusak · Attila Juhos · Matthias Bethge · Wieland Brendel


Abstract:

Out-of-Domain (OOD) generalization is the ability of a model trained on one or more domains to generalize to unseen domains. In the ImageNet era of computer vision, evaluation sets for measuring a model's OOD performance were designed to be strictly OOD with respect to their style. However, the emergence of foundation models and expansive web-scale datasets has obfuscated this evaluation process, as datasets cover a broad range of domains and risk test data contamination. In search of the forgotten domain generalization, we create large-scale datasets subsampled from LAION—LAION-Natural and LAION-Rendition—that are strictly OOD to corresponding ImageNet and DomainNet test sets in terms of style. By training CLIP models on these datasets, we find that OOD generalization challenges from the ImageNet era still prevail. Furthermore, through a systematic exploration of combining natural and stylistic datasets at varying proportions, we identify optimal ratios for model generalization across several style domains. Our datasets and results re-enable meaningful assessment of OOD robustness at scale—a crucial prerequisite for improving model robustness. Overall, we make the sobering point that large-scale data merely obscures the issue of OOD generalization, which remains an unsolved problem.

Chat is not available.