## Shift happens: Crowdsourcing metrics and test datasets beyond ImageNet

### Roland S. Zimmermann · Julian Bitterwolf · Evgenia Rusak · Steffen Schneider · Matthias Bethge · Wieland Brendel · Matthias Hein

##### Ballroom 4

Abstract:

Deep vision models are prone to short-cut learning, vulnerable to adversarial attacks, as well as natural and synthetic image corruptions. While OOD test sets have been proposed to measure the vulnerability of DNNs to distribution shifts of different kinds, it has been shown that the performance on popular OOD test sets such as ImageNet-C or ObjectNet is strongly correlated to the performance on clean ImageNet. Since performance on clean ImageNet clearly tests IID but not OOD generalization, this calls for new challenging OOD datasets testing different aspects of generalization.Our goal is to bring the robustness, domain adaptation, and out-of-distribution detection communities together to work on a new broad-scale benchmark that tests diverse aspects of current computer vision models and guides the way towards the next generation of models. Submissions to this workshop will contain novel datasets, metrics and evaluation settings.

Chat is not available.
Timezone: America/Los_Angeles »

### Schedule

 Fri 6:00 a.m. - 6:10 a.m. Introduction and opening remarks (Talk) Julian Bitterwolf · Roland S. Zimmermann · Steffen Schneider · Evgenia Rusak 🔗 Fri 6:10 a.m. - 6:25 a.m. Contributed Talk 1: When does dough become a bagel?Analyzing the remaining mistakes on ImageNet (Oral) []   link »    Image classification accuracy on the ImageNet dataset has been a barometer for progress in computer vision over the last decade. Several recent papers have questioned the degree to which the benchmark remains useful to the community, yet innovations continue to contribute gains to performance, with today's largest models achieving 90%+ top-1 accuracy. To help contextualize progress on ImageNet and provide a more meaningful evaluation for today's state-of-the-art models, we manually review and categorize every remaining mistake that a few top models make and provide insights into the long-tail of errors on one of the most benchmarked datasets in computer vision. We focus on the multi-label subset evaluation of ImageNet, where today's best models achieve upwards of 97% top-1 accuracy. Our analysis reveals that nearly half of the supposed mistakes are not mistakes at all, and we uncover new valid multi-labels, demonstrating that, without careful review, we are significantly underestimating the performance of these models. On the other hand, we also find that today's best models still make a significant number of mistakes (40%) that are obviously wrong to human reviewers. To calibrate future progress on ImageNet, we provide an updated multi-label evaluation set, and we curate ImageNet-Major: a 68-example "major error" slice of the obvious mistakes made by today's top models---a slice where models should achieve near perfection, but today are far from doing so. Link » Vijay Vasudevan · Benjamin Caine · Raphael Gontijo Lopes · Sara Fridovich-Keil · Rebecca Roelofs 🔗 Fri 6:25 a.m. - 7:05 a.m. Invited Talk 1: Aleksander Mądry (Talk) Aleksander Madry 🔗 Fri 7:05 a.m. - 7:35 a.m. Coffee Break (Break) 🔗 Fri 7:35 a.m. - 7:50 a.m. Contributed Talk 2: MetaShift: A Dataset of Datasets for Evaluating Contextual Distribution Shifts (Oral) []   link »    Understanding the performance of machine learning models across diverse data distributions is critically important for reliable applications. Motivated by this, there is a growing focus on curating benchmark datasets that capture distribution shifts. In this work, we present MetaShift---a collection of 12,868 sets of natural images across 410 classes---to address this challenge. We leverage the natural heterogeneity of Visual Genome and its annotations to construct MetaShift. The key construction idea is to cluster images using its metadata, which provides context for each image (e.g. cats with cars or cats in bathroom) that represent distinct data distributions. MetaShift has two important benefits: first, it contains orders of magnitude more natural data shifts than previously available. Second, it provides explicit explanations of what is unique about each of its data sets and a distance score that measures the amount of distribution shift between any two of its data sets. Importantly, to support evaluating ImageNet trained models on MetaShift, we match MetaShift with ImageNet hierarchy. The matched version covers 867 out of 1,000 classes in ImageNet-1k. Each class in the ImageNet-matched Metashift contains 19.3 subsets capturing images in different contexts. Link » Weixin Liang · Xinyu Yang · James Zou 🔗 Fri 7:50 a.m. - 8:30 a.m. Invited Talk 2: Lucas Beyer (Talk) 🔗 Fri 8:30 a.m. - 9:10 a.m. Invited Talk 3: Chelsea Finn (Talk) Chelsea Finn 🔗 Fri 9:10 a.m. - 10:10 a.m. Lunch Break (Break) 🔗 Fri 10:10 a.m. - 10:50 a.m. Invited Talk 4: Alexei Efros (Talk) Alexei Efros 🔗 Fri 10:50 a.m. - 10:52 a.m. OOD-CV: A Benchmark for Robustness to Individual Nuisances in Real-World Out-of-Distribution Shifts (Oral) []   link »    Enhancing the robustness of vision algorithms in real-world scenarios is challenging. One reason is that existing robustness benchmarks are limited, as they either rely on synthetic data or ignore the effects of individual nuisance factors. We introduce ROBIN, a benchmark dataset that includes out-of-distribution examples of 10 object categories in terms of pose, shape, texture, context and the weather conditions, and enables benchmarking models for image classification, object detection, and 3D pose estimation. Our experiments using popular baseline methods reveal that: 1) Some nuisance factors have a much stronger negative effect on the performance compared to others, also depending on the vision task. 2) Current approaches to enhance robustness have only marginal effects, and can even reduce robustness. 3) We do not observe significant differences between convolutional and transformer architectures. We believe our dataset provides a rich testbed to study robustness and will help push forward research in this area. Link » Bingchen Zhao · Shaozuo Yu · Wufei Ma · Mingxin Yu · Shenxiao Mei · Angtian Wang · Ju He · Alan Yuille · Adam Kortylewski 🔗 Fri 10:52 a.m. - 10:54 a.m. Wild-Time: A Benchmark of in-the-Wild Distribution Shift over Time (Oral) []   link » Distribution shifts occur when the test distribution differs from the training distribution, and can considerably degrade performance of machine learning models deployed in the real world. While recent works have studied robustness to distribution shifts, distribution shifts arising from the passage of time have the additional structure of timestamp metadata. Real-world examples of such shifts are underexplored, and it is unclear whether existing models can leverage trends in past distribution shifts to reliably extrapolate into the future. To address this gap, we curate Wild-Time, a benchmark of 7 datasets that reflect temporal distribution shifts arising in a variety of real-world applications. On these datasets, we systematically benchmark 9 approaches with various inductive biases. Our experiments demonstrate that existing methods are limited in tackling temporal distribution shift: across all settings, we observe an average performance drop of 21\% from in-distribution to out-of-distribution data. Link » Huaxiu Yao · Caroline Choi · Yoonho Lee · Pang Wei Koh · Chelsea Finn 🔗 Fri 10:54 a.m. - 10:56 a.m. Growing ObjectNet: Adding speech, VQA, occlusion, and measuring dataset difficulty (Oral) []   link »    Building more difficult datasets is largely an ad-hoc enterprise, generally relying on scale from the web or focusing on particular domains thought to be challenging. ObjectNet is an attempt to create a more difficult dataset, one that eliminates biases that may artificially inflate machine performance, in a systematic way. ObjectNet images are meant to decorrelate objects from their backgrounds, have randomized object orientations, and randomized viewpoints. ObjectNet appears to be much more difficult for machines. Spoken ObjectNet is a retrieval benchmark constructed from spoken descriptions of ObjectNet images. These descriptions are being used to create a captioning and VQA benchmark. In each case large performance drops were seen. The next variant of ObjectNet will focus on real-world occlusions since it is suspected that models are brittle when shown partially-occluded objects. Using large-scale psychophysics on ObjectNet we have constructed a new objective difficulty benchmark applicable to any dataset: the minimum presentation time for an image before the object contained within it can be reliably recognized by humans. This difficulty metric is well predicted by quantities computable from the activations of models, although not necessarily their ultimate performance. We hope that this suite of benchmarks will enable more robust models, prove better images for neuroscientific and behavioral experiments, and contribute to a systematic understanding of the dataset difficulty and progress in computer vision. Link » David Mayo · David Lu · Chris Zhang · Jesse Cummings · Xinyu Lin · Boris Katz · James Glass · Andrei Barbu 🔗 Fri 10:56 a.m. - 10:58 a.m. Classifiers Should Do Well Even on Their Worst Classes (Oral) []   link »    The performance of a vision classifier on a given test set is usually measured by its accuracy. For reliable machine learning systems, however, it is important to avoid the existence of areas of the input space where they fail severely. To reflect this, we argue, that a single number does not provide a complete enough picture even for a fixed test set, as there might be particular classes or subtasks where a model that is generally accurate performs unexpectedly poorly. Without using new data, we motivate and establish a wide selection of interesting worst-case performance metrics which can be evaluated besides accuracy on a given test set. Some of these metrics can be extended when a grouping of the original classes into superclasses is available, indicating if the model is exceptionally bad at handling inputs from one superclass. Link » Julian Bitterwolf · Alexander Meinke · Valentyn Boreiko · Matthias Hein 🔗 Fri 10:58 a.m. - 11:00 a.m. Towards Systematic Robustness for Scalable Visual Recognition (Oral) []   link » There is widespread interest in developing robust classification models, that can handle challenging object, scene, or image properties. While work in this area targets diverse kinds of robust behaviour, we argue in this work in favour requirement that should apply more generally: For robust behaviour to be scalable, it should transfer flexibly across familiar object classes, and not be separately learned for every class of interest. To this end, we propose the systematic robustness setting, in which certain combinations of classes and attributes are systematically excluded during training. Unlike prior work which studies systematic generalisation in DNNs or their susceptibility to spurious correlations, we use synthetic operations and data sampling to scale such experiments up to large-scale naturalistic datasets. This allows for a compromise between ecological validity of the task and strict experimental controls. We analyse a variety of models and learning objectives, and find that robustness to different shifts such as image corruptions, image rotations, and abstract object depictions are perhaps harder to deal with than previous results would suggest. This extended abstract describes the general experimental setting, our specific instantiations, and a metric to measure systematic robustness. Link » Mohamed Omran · Bernt Schiele 🔗 Fri 11:00 a.m. - 11:02 a.m. Lost in Translation: Modern Image Classifiers still degrade even under simple Translations (Oral) []   link »    Modern image classifiers are used potentially in safety-critical applications and thus should not be vulnerable to natural transformations of the image as it can happen due to variations in the image acquisition.While it is known that image classifiers can degrade significantly in performance with respect to translations and rotations, the corresponding works did not ensure that the object of interest is fully contained in the image and also introduce boundary artefacts so that the input is not a natural image. In this paper we leverage pixelwise segmentations of the ImageNet-S dataset in order to search for the translation and rotation which ensures that the object is i) fully contained in the image (potentially together with a zoom) and ii) the image is natural (no padding with black pixels) such that the resulting natural image is misclassified. We observe a consistent drop in accuracy over a large set of image classifiers showing that natural adversarial changes are an important threat model which deserves more attention. Link » Leander Kurscheidt · Matthias Hein 🔗 Fri 11:02 a.m. - 11:04 a.m. Evaluating Model Robustness to Patch Perturbations (Oral) []   link »    Recent advances in Vision Transformer (ViT) have demonstrated its impressive performance in image classification, which makes it a promising alternative to Convolutional Neural Network (CNN). Unlike CNNs, ViT represents an input image as a sequence of image patches. The patch-based input image representation makes the following question interesting: How does ViT perform when individual input image patches are perturbed with natural corruptions or adversarial perturbations, compared to CNNs? In this submission, we propose to evaluate model robustness to patch-wise perturbations. Two types of patch perturbations are considered to model robustness. One is natural corruptions, which is to test models' robustness under distributional shifts. The other is adversarial perturbations, which are created by an adversary to specifically fool a model to make a wrong prediction. The experimental results on the popular CNNs and ViTs are surprising. We find that ViTs are more robust to naturally corrupted patches than CNNs, whereas they are more vulnerable to adversarial patches. Given the architectural traits of state-of-the-art ViTs and the interesting results above, we propose to add the robustness to natural patch corruption and adversarial patch attack into the robustness benchmark. Link » Jindong Gu · Volker Tresp · Yao Qin 🔗 Fri 11:04 a.m. - 11:06 a.m. ImageNet-Cartoon and ImageNet-Drawing: two domain shift datasets for ImageNet (Oral) []   link »    Benchmarking the robustness to distribution shifts traditionally relies on dataset collection which is typically laborious and expensive, in particular for datasets with a large number of classes like ImageNet. An exception to this procedure is ImageNet-C (Hendrycks & Dietterich, 2019), a dataset created by applying common real-world corruptions at different levels of intensity to the (clean) ImageNet images. Inspired by this work, we introduce ImageNet-Cartoon and ImageNet-Drawing, two datasets constructed by converting ImageNet images into cartoons and colored pencil drawings, using a GAN framework (Wang & Yu, 2020) and simple image processing (Lu et al., 2012), respectively. Link » Tiago Salvador · Adam Oberman 🔗 Fri 11:06 a.m. - 11:08 a.m. CCC: Continuously Changing Corruptions (Oral) []   link »    Many existing datasets for robustness and adaptation evaluation are limited to static distribution shifts. We propose a well-calibrated dataset for continuously changing image corruptions on ImageNet scale. Our benchmark builds on the established common corruptions of ImageNet-C and extends them by applying two corruptions at the same time with finer-grained severities to allow for smooth transitions between corruptions. The benchmark contains random walks through different corruption types with different controlled difficulties and speeds of domain shift. Our dataset can be used to benchmark test-time and domain adaptation algorithms in challenging settings that are closer to real-world applications than typically used static adaptation benchmarks. Link » Ori Press · Steffen Schneider · Matthias Kuemmerer · Matthias Bethge 🔗 Fri 11:08 a.m. - 11:10 a.m. SI-Score (Oral) []   link »    Before deploying machine learning models it is critical to assess their robustness. In the context of deep neural networks for image understanding, changing the object location, rotation and size may affect the predictions in non-trivial ways. SI-Score is a synthetic image dataset that allows one to do fine-grained analysis of robustness to object location, rotation and size. Link » Jessica Yung · Rob Romijnders · Alexander Kolesnikov · Lucas Beyer · Josip Djolonga · Neil Houlsby · Sylvain Gelly · Mario Lucic · Xiaohua Zhai 🔗 Fri 11:10 a.m. - 11:12 a.m. ImageNet-D: A new challenging robustness dataset inspired by domain adaptation (Oral) []   link »    We propose a new challenging dataset to benchmark robustness of ImageNet-trained models: ImageNet-D. ImageNet-D has six different domains (Real'',Painting'', Clipart'',Sketch'', Infograph'' andQuickdraw''). We show that even state-of-the-art models struggle on this dataset and find that they make well-interpretable errors. Link » Evgenia Rusak · Steffen Schneider · Peter V Gehler · Oliver Bringmann · Wieland Brendel · Matthias Bethge 🔗 Fri 11:12 a.m. - 11:14 a.m. The Semantic Shift Benchmark (Oral) []   link »    Most benchmarks for detecting semantic distribution shift do not consider how the semantics of the training set are defined. In other words, it is often unclear whether the 'unseen' images contain semantically different objects from the same distribution (e.g 'birds' for a model trained on 'cats' and 'dogs') or to a different distribution entirely (e.g Gaussian noise for a model trained on 'cats' and 'dogs'). In this work, we propose 'open-set' class splits for models trained on ImageNet-1K which come from ImageNet-21K. Critically, we structure the open-set classes based on semantic similarity to the closed-set using the WordNet hierarchy --- we create 'Easy' and 'Hard' open-set splits to allow more principled analysis of the semantic shift phenomenon.Together with similar challenges based on FGVC datasets, these evaluations comprise the 'Semantic Shift Benchmark'. Link » Sagar Vaze · Kai Han · Andrea Vedaldi · Andrew Zisserman 🔗 Fri 11:14 a.m. - 11:16 a.m. 3D Common Corruptions for Object Recognition (Oral) []   link »    We introduce a set of image transformations that can be used as corruptions to evaluate the robustness of models. The primary distinction of the proposed transformations is that, unlike existing approaches such as Common Corruptions, the geometry of the scene is incorporated in the transformations – thus leading to corruptions that are more likely to occur in the real world. We apply these corruptions to the ImageNet validation set to create 3D Common Corruptions (ImageNet-3DCC) benchmark. The evaluations on recent ImageNet models with robustness mechanisms show that ImageNet-3DCC is a challenging benchmark for object recognition task. Furthermore, it exposes vulnerabilities that are not captured by Common Corruptions, which can be informative during model development. Link » Oguzhan Fatih Kar · Teresa Yeo · Amir Zamir 🔗 Fri 11:50 a.m. - 12:50 p.m. Poster session 🔗 Fri 12:50 p.m. - 1:20 p.m. Tea break (Break) 🔗 Fri 1:20 p.m. - 2:00 p.m. Invited Talk 5: Ludwig Schmidt (Talk) 🔗 Fri 2:00 p.m. - 3:00 p.m. Panel discussion (Panel)  link »    Please ask your questions here: https://app.sli.do/event/4vwRK9oVTL7Pzby8eZzAUH/ Link » Steffen Schneider · Aleksander Madry · Alexei Efros · Chelsea Finn · Soheil Feizi 🔗 Fri 3:00 p.m. - 3:15 p.m. Community presentation 1: Robust Vision Challenge (Talk)  link » Adam Kortylewski 🔗 Fri 3:15 p.m. - 3:30 p.m. Community presentation 2: Challenge on Out-of-Distribution Generalization in Computer Vision (Talk)  link » Adam Kortylewski 🔗 Fri 3:30 p.m. - 3:45 p.m. Community presentation 3: Shifts Challenge 2.0 (Talk) Andrey Malinin 🔗 Fri 3:45 p.m. - 4:00 p.m. Contributed Talk 3: ImageNet-Patch: A Dataset for Benchmarking Machine Learning Robustness against Adversarial Patches (Oral) []   link »    Adversarial patches are optimized contiguous pixel blocks in an input image that cause a machine-learning model to misclassify it.However, their optimization is computationally demanding and requires careful hyperparameter tuning.To overcome these issues, we propose ImageNet-Patch, a dataset to benchmark machine-learning models against adversarial patches.It consists of a set of patches optimized to generalize across different models and applied to ImageNet data after preprocessing them with affine transformations.This process enables an approximate yet faster robustness evaluation, leveraging the transferability of adversarial perturbations. Link » Maura Pintor · Daniele Angioni · Angelo Sotgiu · Luca Demetrio · Ambra Demontis · Battista Biggio · Fabio Roli 🔗 Fri 4:00 p.m. - 4:15 p.m. Closing remarks (Talk) Evgenia Rusak · Roland S. Zimmermann · Julian Bitterwolf · Steffen Schneider 🔗 - OOD-CV: A Benchmark for Robustness to Individual Nuisances in Real-World Out-of-Distribution Shifts (Poster) []   link » Enhancing the robustness of vision algorithms in real-world scenarios is challenging. One reason is that existing robustness benchmarks are limited, as they either rely on synthetic data or ignore the effects of individual nuisance factors. We introduce ROBIN, a benchmark dataset that includes out-of-distribution examples of 10 object categories in terms of pose, shape, texture, context and the weather conditions, and enables benchmarking models for image classification, object detection, and 3D pose estimation. Our experiments using popular baseline methods reveal that: 1) Some nuisance factors have a much stronger negative effect on the performance compared to others, also depending on the vision task. 2) Current approaches to enhance robustness have only marginal effects, and can even reduce robustness. 3) We do not observe significant differences between convolutional and transformer architectures. We believe our dataset provides a rich testbed to study robustness and will help push forward research in this area. Link » Bingchen Zhao · Shaozuo Yu · Wufei Ma · Mingxin Yu · Shenxiao Mei · Angtian Wang · Ju He · Alan Yuille · Adam Kortylewski 🔗 - Wild-Time: A Benchmark of in-the-Wild Distribution Shift over Time (Poster) []   link » Distribution shifts occur when the test distribution differs from the training distribution, and can considerably degrade performance of machine learning models deployed in the real world. While recent works have studied robustness to distribution shifts, distribution shifts arising from the passage of time have the additional structure of timestamp metadata. Real-world examples of such shifts are underexplored, and it is unclear whether existing models can leverage trends in past distribution shifts to reliably extrapolate into the future. To address this gap, we curate Wild-Time, a benchmark of 7 datasets that reflect temporal distribution shifts arising in a variety of real-world applications. On these datasets, we systematically benchmark 9 approaches with various inductive biases. Our experiments demonstrate that existing methods are limited in tackling temporal distribution shift: across all settings, we observe an average performance drop of 21\% from in-distribution to out-of-distribution data. Link » Huaxiu Yao · Caroline Choi · Yoonho Lee · Pang Wei Koh · Chelsea Finn 🔗 - Growing ObjectNet: Adding speech, VQA, occlusion, and measuring dataset difficulty (Poster) []   link » Building more difficult datasets is largely an ad-hoc enterprise, generally relying on scale from the web or focusing on particular domains thought to be challenging. ObjectNet is an attempt to create a more difficult dataset, one that eliminates biases that may artificially inflate machine performance, in a systematic way. ObjectNet images are meant to decorrelate objects from their backgrounds, have randomized object orientations, and randomized viewpoints. ObjectNet appears to be much more difficult for machines. Spoken ObjectNet is a retrieval benchmark constructed from spoken descriptions of ObjectNet images. These descriptions are being used to create a captioning and VQA benchmark. In each case large performance drops were seen. The next variant of ObjectNet will focus on real-world occlusions since it is suspected that models are brittle when shown partially-occluded objects. Using large-scale psychophysics on ObjectNet we have constructed a new objective difficulty benchmark applicable to any dataset: the minimum presentation time for an image before the object contained within it can be reliably recognized by humans. This difficulty metric is well predicted by quantities computable from the activations of models, although not necessarily their ultimate performance. We hope that this suite of benchmarks will enable more robust models, prove better images for neuroscientific and behavioral experiments, and contribute to a systematic understanding of the dataset difficulty and progress in computer vision. Link » David Mayo · David Lu · Chris Zhang · Jesse Cummings · Xinyu Lin · Boris Katz · James Glass · Andrei Barbu 🔗 - MetaShift: A Dataset of Datasets for Evaluating Contextual Distribution Shifts (Poster) []  []   link » Understanding the performance of machine learning models across diverse data distributions is critically important for reliable applications. Motivated by this, there is a growing focus on curating benchmark datasets that capture distribution shifts. In this work, we present MetaShift---a collection of 12,868 sets of natural images across 410 classes---to address this challenge. We leverage the natural heterogeneity of Visual Genome and its annotations to construct MetaShift. The key construction idea is to cluster images using its metadata, which provides context for each image (e.g. cats with cars or cats in bathroom) that represent distinct data distributions. MetaShift has two important benefits: first, it contains orders of magnitude more natural data shifts than previously available. Second, it provides explicit explanations of what is unique about each of its data sets and a distance score that measures the amount of distribution shift between any two of its data sets. Importantly, to support evaluating ImageNet trained models on MetaShift, we match MetaShift with ImageNet hierarchy. The matched version covers 867 out of 1,000 classes in ImageNet-1k. Each class in the ImageNet-matched Metashift contains 19.3 subsets capturing images in different contexts. Link » Weixin Liang · Xinyu Yang · James Zou 🔗 - Classifiers Should Do Well Even on Their Worst Classes (Poster) []   link » The performance of a vision classifier on a given test set is usually measured by its accuracy. For reliable machine learning systems, however, it is important to avoid the existence of areas of the input space where they fail severely. To reflect this, we argue, that a single number does not provide a complete enough picture even for a fixed test set, as there might be particular classes or subtasks where a model that is generally accurate performs unexpectedly poorly. Without using new data, we motivate and establish a wide selection of interesting worst-case performance metrics which can be evaluated besides accuracy on a given test set. Some of these metrics can be extended when a grouping of the original classes into superclasses is available, indicating if the model is exceptionally bad at handling inputs from one superclass. Link » Julian Bitterwolf · Alexander Meinke · Valentyn Boreiko · Matthias Hein 🔗 - Lost in Translation: Modern Image Classifiers still degrade even under simple Translations (Poster) []   link » Modern image classifiers are used potentially in safety-critical applications and thus should not be vulnerable to natural transformations of the image as it can happen due to variations in the image acquisition.While it is known that image classifiers can degrade significantly in performance with respect to translations and rotations, the corresponding works did not ensure that the object of interest is fully contained in the image and also introduce boundary artefacts so that the input is not a natural image. In this paper we leverage pixelwise segmentations of the ImageNet-S dataset in order to search for the translation and rotation which ensures that the object is i) fully contained in the image (potentially together with a zoom) and ii) the image is natural (no padding with black pixels) such that the resulting natural image is misclassified. We observe a consistent drop in accuracy over a large set of image classifiers showing that natural adversarial changes are an important threat model which deserves more attention. Link » Leander Kurscheidt · Matthias Hein 🔗 - Towards Systematic Robustness for Scalable Visual Recognition (Poster) []   link » There is widespread interest in developing robust classification models, that can handle challenging object, scene, or image properties. While work in this area targets diverse kinds of robust behaviour, we argue in this work in favour requirement that should apply more generally: For robust behaviour to be scalable, it should transfer flexibly across familiar object classes, and not be separately learned for every class of interest. To this end, we propose the systematic robustness setting, in which certain combinations of classes and attributes are systematically excluded during training. Unlike prior work which studies systematic generalisation in DNNs or their susceptibility to spurious correlations, we use synthetic operations and data sampling to scale such experiments up to large-scale naturalistic datasets. This allows for a compromise between ecological validity of the task and strict experimental controls. We analyse a variety of models and learning objectives, and find that robustness to different shifts such as image corruptions, image rotations, and abstract object depictions are perhaps harder to deal with than previous results would suggest. This extended abstract describes the general experimental setting, our specific instantiations, and a metric to measure systematic robustness. Link » Mohamed Omran · Bernt Schiele 🔗 - Evaluating Model Robustness to Patch Perturbations (Poster) []   link » Recent advances in Vision Transformer (ViT) have demonstrated its impressive performance in image classification, which makes it a promising alternative to Convolutional Neural Network (CNN). Unlike CNNs, ViT represents an input image as a sequence of image patches. The patch-based input image representation makes the following question interesting: How does ViT perform when individual input image patches are perturbed with natural corruptions or adversarial perturbations, compared to CNNs? In this submission, we propose to evaluate model robustness to patch-wise perturbations. Two types of patch perturbations are considered to model robustness. One is natural corruptions, which is to test models' robustness under distributional shifts. The other is adversarial perturbations, which are created by an adversary to specifically fool a model to make a wrong prediction. The experimental results on the popular CNNs and ViTs are surprising. We find that ViTs are more robust to naturally corrupted patches than CNNs, whereas they are more vulnerable to adversarial patches. Given the architectural traits of state-of-the-art ViTs and the interesting results above, we propose to add the robustness to natural patch corruption and adversarial patch attack into the robustness benchmark. Link » Jindong Gu · Volker Tresp · Yao Qin 🔗 - ImageNet-Cartoon and ImageNet-Drawing: two domain shift datasets for ImageNet (Poster) []   link » Benchmarking the robustness to distribution shifts traditionally relies on dataset collection which is typically laborious and expensive, in particular for datasets with a large number of classes like ImageNet. An exception to this procedure is ImageNet-C (Hendrycks & Dietterich, 2019), a dataset created by applying common real-world corruptions at different levels of intensity to the (clean) ImageNet images. Inspired by this work, we introduce ImageNet-Cartoon and ImageNet-Drawing, two datasets constructed by converting ImageNet images into cartoons and colored pencil drawings, using a GAN framework (Wang & Yu, 2020) and simple image processing (Lu et al., 2012), respectively. Link » Tiago Salvador · Adam Oberman 🔗 - CCC: Continuously Changing Corruptions (Poster) []   link » Many existing datasets for robustness and adaptation evaluation are limited to static distribution shifts. We propose a well-calibrated dataset for continuously changing image corruptions on ImageNet scale. Our benchmark builds on the established common corruptions of ImageNet-C and extends them by applying two corruptions at the same time with finer-grained severities to allow for smooth transitions between corruptions. The benchmark contains random walks through different corruption types with different controlled difficulties and speeds of domain shift. Our dataset can be used to benchmark test-time and domain adaptation algorithms in challenging settings that are closer to real-world applications than typically used static adaptation benchmarks. Link » Ori Press · Steffen Schneider · Matthias Kuemmerer · Matthias Bethge 🔗 - SI-Score (Poster) []   link » Before deploying machine learning models it is critical to assess their robustness. In the context of deep neural networks for image understanding, changing the object location, rotation and size may affect the predictions in non-trivial ways. SI-Score is a synthetic image dataset that allows one to do fine-grained analysis of robustness to object location, rotation and size. Link » Jessica Yung · Rob Romijnders · Alexander Kolesnikov · Lucas Beyer · Josip Djolonga · Neil Houlsby · Sylvain Gelly · Mario Lucic · Xiaohua Zhai 🔗 - ImageNet-D: A new challenging robustness dataset inspired by domain adaptation (Poster) []   link » We propose a new challenging dataset to benchmark robustness of ImageNet-trained models: ImageNet-D. ImageNet-D has six different domains (Real'',Painting'', Clipart'',Sketch'', Infograph'' andQuickdraw''). We show that even state-of-the-art models struggle on this dataset and find that they make well-interpretable errors. Link » Evgenia Rusak · Steffen Schneider · Peter V Gehler · Oliver Bringmann · Wieland Brendel · Matthias Bethge 🔗 - The Semantic Shift Benchmark (Poster) []   link » Most benchmarks for detecting semantic distribution shift do not consider how the semantics of the training set are defined. In other words, it is often unclear whether the 'unseen' images contain semantically different objects from the same distribution (e.g 'birds' for a model trained on 'cats' and 'dogs') or to a different distribution entirely (e.g Gaussian noise for a model trained on 'cats' and 'dogs'). In this work, we propose 'open-set' class splits for models trained on ImageNet-1K which come from ImageNet-21K. Critically, we structure the open-set classes based on semantic similarity to the closed-set using the WordNet hierarchy --- we create 'Easy' and 'Hard' open-set splits to allow more principled analysis of the semantic shift phenomenon.Together with similar challenges based on FGVC datasets, these evaluations comprise the 'Semantic Shift Benchmark'. Link » Sagar Vaze · Kai Han · Andrea Vedaldi · Andrew Zisserman 🔗 - When does dough become a bagel?Analyzing the remaining mistakes on ImageNet (Poster) []   link » Image classification accuracy on the ImageNet dataset has been a barometer for progress in computer vision over the last decade. Several recent papers have questioned the degree to which the benchmark remains useful to the community, yet innovations continue to contribute gains to performance, with today's largest models achieving 90%+ top-1 accuracy. To help contextualize progress on ImageNet and provide a more meaningful evaluation for today's state-of-the-art models, we manually review and categorize every remaining mistake that a few top models make and provide insights into the long-tail of errors on one of the most benchmarked datasets in computer vision. We focus on the multi-label subset evaluation of ImageNet, where today's best models achieve upwards of 97% top-1 accuracy. Our analysis reveals that nearly half of the supposed mistakes are not mistakes at all, and we uncover new valid multi-labels, demonstrating that, without careful review, we are significantly underestimating the performance of these models. On the other hand, we also find that today's best models still make a significant number of mistakes (40%) that are obviously wrong to human reviewers. To calibrate future progress on ImageNet, we provide an updated multi-label evaluation set, and we curate ImageNet-Major: a 68-example "major error" slice of the obvious mistakes made by today's top models---a slice where models should achieve near perfection, but today are far from doing so. Link » Vijay Vasudevan · Benjamin Caine · Raphael Gontijo Lopes · Sara Fridovich-Keil · Rebecca Roelofs 🔗 - 3D Common Corruptions for Object Recognition (Poster) []   link » We introduce a set of image transformations that can be used as corruptions to evaluate the robustness of models. The primary distinction of the proposed transformations is that, unlike existing approaches such as Common Corruptions, the geometry of the scene is incorporated in the transformations – thus leading to corruptions that are more likely to occur in the real world. We apply these corruptions to the ImageNet validation set to create 3D Common Corruptions (ImageNet-3DCC) benchmark. The evaluations on recent ImageNet models with robustness mechanisms show that ImageNet-3DCC is a challenging benchmark for object recognition task. Furthermore, it exposes vulnerabilities that are not captured by Common Corruptions, which can be informative during model development. Link » Oguzhan Fatih Kar · Teresa Yeo · Amir Zamir 🔗 - ImageNet-Patch: A Dataset for Benchmarking Machine Learning Robustness against Adversarial Patches (Poster) []   link » Adversarial patches are optimized contiguous pixel blocks in an input image that cause a machine-learning model to misclassify it.However, their optimization is computationally demanding and requires careful hyperparameter tuning.To overcome these issues, we propose ImageNet-Patch, a dataset to benchmark machine-learning models against adversarial patches.It consists of a set of patches optimized to generalize across different models and applied to ImageNet data after preprocessing them with affine transformations.This process enables an approximate yet faster robustness evaluation, leveraging the transferability of adversarial perturbations. Link » Maura Pintor · Daniele Angioni · Angelo Sotgiu · Luca Demetrio · Ambra Demontis · Battista Biggio · Fabio Roli 🔗