Machine learning models often break when deployed in the wild, despite excellent performance on benchmarks. In particular, models can learn to rely on apparently unnatural or irrelevant features. For instance, 1) in detecting lung disease from chest Xrays, models rely on the type of scanner rather than physiological signals, 2) in natural language inference, models rely on the number of shared words rather than the subject’s relationship with the object, 3) in precision medicine, polygenic risk scores for diseases like breast cancer rely on genes prevalent mainly in European populations, and predict poorly in other populations. In examples like these and others, the undesirable behavior stems from the model exploiting a spurious correlation. Improper treatment of spurious correlations can discourage the use of ML in the real world and lead to catastrophic consequences in extreme cases. The recent surge of interest in this issue is accordingly welcome and timely: more than 50 closely related papers have been published just in ICML 2021, NeurIPS 2021, and ICLR 2022. However, the most fundamental questions remain unanswered— e.g., how should the notion of spurious correlations be made precise? How should one evaluate models in the presence of spurious correlations? In which situations can a given method be expected to work, or fail? Which notions of invariance are fruitful and tractable? Further, relevant work has sprung up ad hoc from several distinct communities, with limited interplay between them: invariance and independenceconstrained learning in causalityinspired ML, methods to decorrelate predictions and protected features (e.g. race) in algorithmic fairness, and stress testing procedures to discover unexpected model dependencies in reliable ML. This workshop will bring together these different communities to make progress on common foundational problems, and facilitate their interaction with domainexperts to build impactful collaborations.
Fri 5:45 a.m.  6:00 a.m.

Introductory Remarks
(Presentation)
SlidesLive Video » 
🔗 
Fri 6:00 a.m.  6:25 a.m.

Invited Talks 1, Bernhard Schölkopf and David LopezPaz
(Invited Talk)
SlidesLive Video » Bernhard Schölkopf  What is a causal representation? David LopezPaz  On invariance 
Bernhard Schölkopf · David LopezPaz 🔗 
Fri 6:55 a.m.  7:10 a.m.

Invited talks I, Q/A
(Q/A session)

Bernhard Schölkopf · David LopezPaz 🔗 
Fri 7:10 a.m.  7:30 a.m.

Break

🔗 
Fri 7:30 a.m.  8:25 a.m.

Invited talks 2, Christina HeinzeDeml and Marzyeh Ghassemi
(Invited talk)
SlidesLive Video » Christina HeinzeDeml  Marzyeh Ghassemi  
Christina HeinzeDeml · Marzyeh Ghassemi 🔗 
Fri 8:25 a.m.  8:40 a.m.

Invited talks 2 Q/A, Christina and Marzyeh
(Q/A)

Christina HeinzeDeml · Marzyeh Ghassemi 🔗 
Fri 8:40 a.m.  9:30 a.m.

Spotlights
SlidesLive Video » 
Pratyush Maini · JIVAT NEET KAUR · Anil Palepu · Polina Kirichenko · Revant Teotia 🔗 
Fri 9:30 a.m.  10:30 a.m.

Lunch break
(Break)

🔗 
Fri 10:30 a.m.  11:50 a.m.

Poster Session (inperson only)
(Inperson poster session)

🔗 
Fri 11:50 a.m.  1:10 p.m.

Invited talks 3, Amy Zhang, Rich Zemel and Liting Sun
(Invited talk)
SlidesLive Video » 
Amy Zhang · Richard Zemel · Liting Sun 🔗 
Fri 1:10 p.m.  1:30 p.m.

Invited talks 3, Q/A, Amy, Rich and Liting
(Live Q/A session)

Liting Sun · Amy Zhang · Richard Zemel 🔗 
Fri 1:35 p.m.  2:35 p.m.

SCIS 2022 Panel
(Live panel over zoom)
SlidesLive Video » 
🔗 
Fri 2:40 p.m.  2:45 p.m.

Closing remarks
(Presentation)
SlidesLive Video » 
🔗 
Fri 2:45 p.m.  4:30 p.m.

Poster Session (inperson only)
(Inperson poster session)

🔗 
Fri 2:45 p.m.  3:30 p.m.

Breakout sessions
(Breakout sessions (inperson and virtual))

🔗 


Towards Better Understanding of SelfSupervised Representations
(Poster)
link »
Selfsupervised learning methods have shown impressive results in downstream classification tasks. However, there is limited work in understanding and interpreting their learned representations. In this paper, we study the representation space of several stateoftheart selfsupervised models including SimCLR, SwaV, MoCo V2 and BYOL. Without the use of class label information, we first discover discriminative features that are highly active for various subsets of samples and correspond to unique physical attributes in images. We show that, using such discriminative features, one can compress the representation space of selfsupervised models up to 50% without affecting downstream linear classification significantly. Next, we propose a samplewise SelfSupervised Representation Quality Score (or, QScore) that can be computed without access to any label information. QScore, utilizes discriminative features to reliably predict if a given sample is likely to be misclassified in the downstream classification task achieving AUPRC of 0.91 on SimCLR and BYOL trained on ImageNet100. QScore can also be used as a regularization term to remedy lowquality representations leading up to 8% relative improvement in accuracy on all 4 selfsupervised baselines on ImageNet100, CIFAR10, CIFAR100 and STL10. Moreover, through heatmap analysis, we show that QScore regularization enhances discriminative features and reduces feature noise, thus improving model interpretability. 
Neha Mukund Kalibhat · Kanika Narang · Hamed Firooz · Maziar Sanjabi · Soheil Feizi 🔗 


Causal Balancing for Domain Generalization
(Poster)
link »
While machine learning models rapidly advance the stateoftheart on various realworld tasks, outofdomain (OOD) generalization remains a challenging problem given the vulnerability of these models to spurious correlations. We propose a balanced minibatch sampling strategy to reduce the domainspecific spurious correlations in the observed training distributions. More specifically, we propose a twophased method that 1) identifies the source of spurious correlations, and 2) builds balanced minibatches free from spurious correlations by matching on the identified source. We provide an identifiability guarantee of the source of spuriousness and show that our proposed approach samples from a balanced, spuriousfree distribution under ideal scenario. Experiments are conducted on three domain generalization datasets, demonstrating empirically that our balanced minibatch sampling strategy improves the performance of four different established domain generalization model baselines compared to the random minibatch sampling strategy. 
Xinyi Wang · Michael Saxon · Jiachen Li · Hongyang Zhang · Kun Zhang · William Wang 🔗 


In the Eye of the Beholder: Robust Prediction with Causal User Modeling
(Poster)
link »
Accurately predicting the relevance of items to users is crucial to the success of many social platforms. Conventional approaches train models on logged historical data; but recommendation systems, media services, and online marketplaces all exhibit a constant influx of new contentmaking relevancy a moving target, to which standard predictive models are not robust. In this paper, we propose a learning framework for relevance prediction that is robust to changes in the data distribution. Our key observation is that robustness can be obtained by accounting for \emph{how users causally perceive the environment}. We model users as boundedlyrational decision makers whose causal beliefs are encoded by a causal graph, and show how minimal information regarding the graph can be used to contend with distributional changes. Experiments in multiple settings demonstrate the effectiveness of our approach. 
Amir Feder · Guy Horowitz · Yoav Wald · Roi Reichart · Nir Rosenfeld 🔗 


Causal Prediction Can Induce Performative Stability
(Poster)
link »
Predictive models affect the world through inducing a strategic response or reshaping the environment in which they are deployed—a property called performativity. This results in the need to constantly adapt and redesign the model. We show that prediction using only causal features—those that directly affect the prediction target, and not those that are otherwise correlated to the target—can achieve an equilibrium and close this feedback loop. Thus, a causal predictive model does not require any further adaptation after deployment even if it change its environment. 
Bogdan Kulynych 🔗 


Evaluating and Improving Robustness of SelfSupervised Representations to Spurious Correlations
(Poster)
link »
SlidesLive Video » Recent empirical studies have found inductive biases in supervised learning toward simple features that may be spuriously correlated with the label, resulting in suboptimal performance on certain subgroups. In this work, we explore whether recent SelfSupervised Learning (SSL) methods would produce representations which exhibit similar behaviour. First, we show that classical approaches in combating spurious correlations, such as resampling the dataset, do not necessarily lead to invariant representations during SSL. Second, we discover that spurious information is represented disproportionately heavily in the later layers of the encoder. Motivated by these findings, we propose a method to remove spurious information from SSL representations during pretraining, by pruning or reinitializing later layers of the encoder. We find that our method produces representations which outperform the baseline on 5 datasets, without the need for group or label information during SSL. 
Kimia Hamidieh · Haoran Zhang · Marzyeh Ghassemi 🔗 


Domain Adaptation under Open Set Label Shift
(Poster)
link »
SlidesLive Video »
We introduce the problem of domain adaptation under Open Set Label Shift (OSLS) where the label distribution can change arbitrarily and a new class may arrive during deployment, but the classconditional distributions $p(xy)$ are domaininvariant. The learner's goals here are twofold: (a) estimate the target label distribution, including the novel class; and (b) learn a target classifier. %for the target domain. First, we establish necessary and sufficient conditions for identifying these quantities. Second, we propose practical methods for both tasks. Unlike typical Open Set Domain Adaptation (OSDA) problems, which tend to be illposed and amenable only to heuristics, OSLS offers a wellposed problem amenable to more principled machinery. Experiments across numerous semisynthetic benchmarks on vision, language, and medical datasets demonstrate that our methods consistently outperform OSDA baselines, achieving $10$$25\%$ improvements in target domain accuracy. Finally, we analyze the proposed methods, establishing finitesample convergence to the true label marginal and convergence to optimal classifier for linear models in a Gaussian setup.

Saurabh Garg · Sivaraman Balakrishnan · Zachary Lipton 🔗 


Towards Domain Adversarial Methods to Mitigate Texture Bias
(Poster)
link »
SlidesLive Video » Shapetexture conflict is key to our understanding of the behavior of Convolutional Neural Networks (CNNs) and their observably good performance. This work proposes a domain adversarial traininginspired technique as a novel approach to mitigate texture bias. In our work, instead of looking at the domains as the source from which the images are from, we look at the domains as inherent features of the image.The model is trained in a method similar to Domain Adversarial training, where we define the source and target domains as the dataset and its augmented versions with minimal texture information (edge maps and stylized images), respectively. We show that using domain invariant learning to capture a prior based on the shapetexture information helps models learn robust representations. We perform extensive experiments on three subsets of ImageNet, namely, ImageNet20, ImageNet200, ImageNet9. The results show that the proposed method outperforms standard Empirical Risk Minimization (ERM) in terms of test accuracy and also as evidenced by the high accuracy on the OutOfDistribution (OOD) datasets ImageNetR and NICO. 
Dhruva Kashyap · Sumukh K Aithal · Rakshith C · Natarajan Subramanyam 🔗 


Modeling the DataGenerating Process is Necessary for OutofDistribution Generalization
(Poster)
link »
Realworld data collected from multiple domains can have multiple, distinct distribution shifts over multiple attributes. However, stateofthe art advances in domain generalization (DG) algorithms focus only on specific shifts over a single attribute. We introduce datasets with multiattribute distribution shifts and find that existing DG algorithms fail to generalize. Using causal graphs to characterize the different types of shifts, we show that each multiattribute causal graph entails different constraints over observed variables, and therefore any algorithm based on a single, fixed independence constraint cannot work well across all shifts. We present Causally Adaptive Constraint Minimization (CACM), an algorithm for identifying the correct independence constraints for regularization. Experiments confirm our theoretical claim: correct independence constraints lead to the highest accuracy on unseen domains. Our results demonstrate the importance of modeling the causal relationships inherent in a datagenerating process, without which it can be impossible to know the correct regularization constraints for a dataset. 
JIVAT NEET KAUR · Emre Kiciman · Amit Sharma 🔗 


Invariance Discovery for Systematic Generalization in Reinforcement Learning
(Poster)
link »
In the sequential decision making setting, an agent aims to achieve systematic generalization over a large, possibly infinite, set of environments. Such environments are modeled as discrete Markov decision processes with both states and actions represented through a feature vector. The underlying structure of the environments allows the transition dynamics to be factored into two components: one that is environmentspecific and another one that is shared. Consider a set of environments that share the laws of motion as an illustrative example. In this setting, the agent can take a finite amount of rewardfree interactions from a subset of these environments. The agent then must be able to approximately solve any planning task defined over any environment in the original set, relying on the above interactions only. Can we design a provably efficient algorithm that achieves this ambitious goal of systematic generalization? In this paper, we give a partially positive answer to this question. First, we provide the first tractable formulation of systematic generalization by employing a causal viewpoint. Then, under specific structural assumptions, we provide a simple learning algorithm that allows us to guarantee any desired planning error up to an unavoidable suboptimality term, while showcasing a polynomial sample complexity. 
Mirco Mutti · Riccardo De Santi · Emanuele Rossi · Juan Calderon · Michael Bronstein · Marcello Restelli 🔗 


Probing Classifiers are Unreliable for Concept Removal and Detection
(Poster)
link »
Neural network models trained on text data have been found to encode undesired linguistic or sensitive attributes in their representation. Removing such attributes is nontrivial because of a complex relationship between the attribute, text input, and the learnt representation. Recent work has proposed posthoc and adversarial methods to remove such unwanted attributes from a model's representation. Through an extensive theoretical and empirical analysis, we show that these methods can be counterproductive: they are unable to remove the attributes entirely, and in the worst case may end up destroying all taskrelevant features. The reason is the methods' reliance on a probing classifier as a proxy for the attribute, which we prove is difficult to train correctly in presence of spurious correlation. 
Abhinav Kumar · Chenhao Tan · Amit Sharma 🔗 


Are We Viewing the Problem of Robust Generalisation through the Appropriate Lens?
(Poster)
link »
We discuss different approaches to the challenge of robust object recognition under distribution shifts. We advocate a view of this challenge that is more closely informed by the problem of visual recognition, and which emphasizes dynamic model behaviour as opposed to centering the statistical properties of training and test distributions. We introduce an experimental setting geared towards developing models that can exhibit robust behaviour in a reliable and scalable manner. We refer to this setting as` "systematic robustness", which involves excluding certain combinations of classes and image attributes are systematically during training. Unlike prior work which studies systematic generalisation in DNNs or their susceptibility to spurious correlations, we use synthetic operations and data sampling to scale such experiments up to largescale naturalistic datasets. 
Mohamed Omran · Bernt Schiele 🔗 


Detecting Shortcut Learning using Mutual Information
(Poster)
link »
SlidesLive Video » The failure of deep neural networks to generalize to outofdistribution data is a wellknown problem and raises concerns about the deployment of trained networks in safetycritical domains such as healthcare, finance, and autonomous vehicles. We study a particular kind of distribution shift — shortcuts or spurious correlations in the training data. Shortcut learning is often only exposed when models are evaluated on realworld data that does not contain the same spurious correlations, posing a serious dilemma for AI practitioners to properly assess the effectiveness of a trained model for realworld applications. In this work, we propose to use the mutual information (MI)between the learned representation and the input as a metric to find where in training the network latches onto shortcuts. Experiments demonstrate that MI can be used as a domainagnostic metric for detecting shortcut learning. 
Mohammed Adnan · Yani Ioannou · ChuanYung Tsai · Angus Galloway · Hamid Tizhoosh · Graham Taylor 🔗 


Selection Bias Induced Spurious Correlations in Large Language Models
(Poster)
link »
SlidesLive Video » In this work we explore the role of dataset selection bias in inducing and amplifying spurious correlations in large language models (LLMs). To highlight known discrepancies in gender representation between what exists in society and what is recorded in datasets, we developed a gender pronoun prediction task. We demonstrate and explain a doseresponse relationship in the magnitude of the correlation between gender pronoun prediction and a variety of seemingly gender neutral variables like date and location on pretrained (unmodified) BERT, DistilBERT, and XLMRoBERTa models. We also finetune several models with the gender pronoun prediction task to further highlight the spurious correlation mechanism, and make an argument about its generalizability to far more datasets. Finally, we provide an online demo, inviting readers to experiment with their own interventions. 
Emily McMilin 🔗 


Invariance Principle Meets OutofDistribution Generalization on Graphs
(Poster)
link »
SlidesLive Video » Despite recent success in using the invariance principle for outofdistribution (OOD) generalization on Euclidean data (e.g., images), studies on graph data are still limited. Different from images, the complex nature of graphs poses unique challenges to adopting the invariance principle. In particular, distribution shifts on graphs can appear in a variety of forms such as attributes and structures, making it difficult to identify the invariance. Moreover, domain or environment partitions, which are often required by OOD methods on Euclidean data, could be highly expensive to obtain for graphs. To bridge this gap, we propose a new framework, called Graph OutOfDistribution Generalization (GOOD), to capture the invariance of graphs for guaranteed OOD generalization under various distribution shifts. Specifically, we characterize potential distribution shifts on graphs with causal models, concluding that OOD generalization on graphs is achievable when models focus only on subgraphs containing the most information about the causes of labels. Accordingly, we propose an informationtheoretic objective to extract the desired subgraphs that maximally preserve the invariant intraclass information. Learning with these subgraphs is immune to distribution shifts. Extensive experiments on both synthetic and realworld datasets, including a challenging setting in AIaided drug discovery, validate the superior OOD generalization ability of GOOD. 
Yongqiang Chen · Yonggang Zhang · Yatao Bian · Han Yang · Kaili MA · Binghui Xie · Tongliang Liu · Bo Han · James Cheng 🔗 


Latent Variable Models for Bayesian Causal Discovery
(Poster)
link »
SlidesLive Video » Learning predictors that do not rely on spurious correlations involves building causal representations. However, learning such a representation is very challenging. We, therefore, formulate the problem of learning a causal representation from high dimensional data and study causal recovery with synthetic data. This work introduces a latent variable decoder model, Decoder BCD, for Bayesian causal discovery and performs experiments in mildly supervised and unsupervised settings. We present a series of synthetic experiments to characterize important factors for causal discovery. 
Jithendaraa Subramanian · Jithendaraa Subramanian · Yashas Annadani · Ivaxi Sheth · Stefan Bauer · Derek Nowrouzezahrai · Samira Ebrahimi Kahou 🔗 


Understanding Generalization and Robustess of Learned Representations of Chaotic Dynamical Systems
(Poster)
link »
We investigate the generalization capabilities of different methods of learning representations via an extensible synthetic dataset of realworld chaotic dynamical systems introduced by Gilpin (2021).We propose an evaluation framework built on top of this dataset, called ValiDyna, which uses probes and multitask learning to study robustness and outofdistribution (OOD) generalization of learned representations across a range of settings, including changes in losses, architecture, etc. as well as changes in the distribution of the dynamical systems' initial conditions and parameters.Our evaluation framework is of interest for generalization and robustess broadly, but we focus our assessment here on evaluating learned representations of ecosystem dynamics, with the goal of using these representations in ecological impact assesments, with applications to biodiversity conservation and climate change mitigation. 
Luã Streit · Vikram Voleti · Tegan Maharaj 🔗 


Policy Architectures for Compositional Generalization in Control
(Poster)
link »
Several tasks in control, robotics, and planning can be specified through desired goal configurations for entities in the environment. Learning goalconditioned policies is a natural paradigm to solve such tasks. Current approaches, however, struggle to learn and generalize as task complexity increases, such as due to variations in number of entities or compositions of goals. To overcome these challenges, we first introduce the EntityFactored Markov Decision Process (EFMDP), a formal framework for modeling the entitybased compositional structure in control tasks. Subsequently, we outline policy architecture choices that can successfully leverage the geometric properties of the EFMDP model. Our framework theoretically motivates the use of SelfAttention and Deep Set architectures for control, and results in flexible policies that can be trained endtoend with standard reinforcement and imitation learning algorithms. On a suite of simulated robot manipulation tasks, we find that these architectures achieve significantly higher success rates with less data, compared to the standard multilayer perceptron. Our structured policies also enable broader and more compositional generalization, producing policies that \textbf{extrapolate} to different numbers of entities than seen in training, and \textbf{stitch} together (i.e. compose) learned skills in novel ways. Video results can be found at https://sites.google.com/view/compgenanon. 
Allan Zhou · Vikash Kumar · Chelsea Finn · Aravind Rajeswaran 🔗 


Representation Learning as Finding Necessary and Sufficient Causes
(Poster)
link »
Representation learning constructs lowdimensional representations tosummarize essential features of highdimensional data. This learningproblem is often approached by describing various desiderataassociated with learned representations; e.g., that they benonspurious or efficient. It can be challenging, however, to turnthese intuitive desiderata into formal criteria that can be measuredand enhanced based on observed data. In this paper, we take a causalperspective on representation learning, formalizing nonspuriousnessand efficiency (in supervised representation learning) usingcounterfactual quantities and observable consequences of causalassertions. This yields computable metrics that can be used to assessthe degree to which representations satisfy the desiderata of interestand learn nonspurious representations from single observationaldatasets. 
Yixin Wang · Michael Jordan 🔗 


Unsupervised Learning under Latent Label Shift
(Poster)
link »
SlidesLive Video »
What sorts of structure might enable a learner to discover classes from unlabeled data? Traditional unsupervised learning approaches risk recovering incorrect classes based on spurious dataspace similarity. In this paper, we introduce unsupervised learning under Latent Label Shift (LLS), where the label marginals $p_d(y)$ shift but the class conditionals $p(\mathbf{x}y)$ do not. This setting suggests a new principle for identifying classes: elements that shift together across domains belong to the same true class. For finite input spaces, we establish an isomorphism between LLS and topic modeling; for continuous data, we show that if each label's support contains a separable region, analogous to an anchor word, oracle access to $p(d\mathbf{x})$ suffices to identify $p_d(y)$ and $p_d(y\mathbf{x})$ up to permutation. Thus motivated, we introduce a practical algorithm that leverages domaindiscriminative models as follows: (i) push examples through domain discriminator $p(d\mathbf{x})$; (ii) discretize the data by clustering examples in $p(d\mathbf{x})$ space; (iii) perform nonnegative matrix factorization on the discrete data; (iv) combine recovered $p(yd)$ with discriminator outputs $p(d\mathbf{x})$ to compute $p_d(y\mathbf{x}) \; \forall d$. In semisynthetic experiments, we show that our algorithm can use domain information to overcome a failure mode of standard unsupervised classification in which dataspace similarity does not indicate true groupings.

Pranav Mani · Manley Roberts · Saurabh Garg · Zachary Lipton 🔗 


A Study of Causal Confusion in PreferenceBased Reward Learning
(Poster)
link »
SlidesLive Video » There has been a recent growth of anecdotal evidence that learning reward functions from preferences is prone to spurious correlations, leading to reward hacking behaviors. While there is much empirical and theoretical analysis of causal confusion and reward gaming behaviors in reinforcement learning and behavioral cloning approaches, we provide the first systematic study of causal confusion in the context of learning reward functions from preferences. We identify a set of three benchmark domains where we observe causal confusion when learning reward functions from offline datasets of pairwise trajectory preferences: a simple reacher domain, an assistive feeding domain, and an itchscratching domain. To gain insight into this observed causal confusion, we perform a sensitivity analysis on the effect of different factorsthe reward model capacity and feature dimensionalityon the robustness of rewards learned from preferences. We find evidence that learning rewards from preferences is highly sensitive and nonrobust to spurious features and increasing model capacity. 
Jeremy Tien · Zhiyang He · Zackory Erickson · Anca Dragan · Daniel S Brown 🔗 


Learning to induce causal structure
(Poster)
link »
The fundamental challenge in causal induction is to infer the underlying graph structure given observational and/or interventional data. Most existing causal induction algorithms operate by generating candidate graphs and evaluating them using either scorebased methods (including continuous optimization) or independence tests. In our work, we instead treat the inference process as a black box and design a neural network architecture that learns the mapping from \emph{both observational and interventional data} to graph structures via supervised training on synthetic graphs. The learned model generalizes to new synthetic graphs, is robust to traintest distribution shifts, and achieves stateoftheart performance on naturalistic graphs for low sample complexity. 
Rosemary Nan Ke · Silvia Chiappa · Jane Wang · Jorg Bornschein · Anirudh Goyal · Melanie Rey · Matthew Botvinick · Theophane Weber · Michael Mozer · Danilo J. Rezende 🔗 


Repeated Environment Inference for Invariant Learning
(Poster)
link »
We study the problem of invariant learning when the environment labels are unknown. We focus on the Invariant representation notion when the Bayes optimal conditional label distribution is the same across different environments. Previous work conducts the Environment Inference (EI) by maximizing the penalty term in the Invariant Risk Minimization (IRM) framework. The EI step uses a reference model which focuses on spurious correlations to efficiently reach a good environment partition. However, it is not clear how to find such a reference model. In this work, we propose to repeat the EI process and retrain an ERM model on the \textit{majority} environment inferred by the EI step in the previous step. Under mild assumptions, we find that this iterative process helps learn a representation capturing the spurious correlation better than the single step. This results in better Environment Inference and better Invariant Learning. We show that this method outperforms baselines on both synthetic and realworld datasets. 
Aayush Mishra · Anqi Liu 🔗 


Finding Spuriously Correlated Visual Attributes
(Poster)
link »
SlidesLive Video » Deep neural models learn to use spurious features present in image datasets which hurts their outofdistribution performance and makes them unreliable for critical application like medical imaging. To help develop robust models, it becomes essential to find spurious features in training datasets. Existing methods to find spurious features do not give any semantic meaning to the features and rely on human interpretation of the discovered correlated features to find if they are spurious or not. In this paper, we propose to first rotate the latent features into visual attributes and then learn correlation between the attributes and object classes by training a simple linear classifier. Correlated visual attributes are easily interpretable because they have well defined semantic meaning and makes it easier to find if they are spurious or not. Through visualizaions and experiments, we show how to find spurious visual attributes, their extent in existing dataset and failure mode examples showing negative impact of learned spurious correlations on outofdistribution generalization. 
Revant Teotia · Chengzhi Mao · Carl Vondrick 🔗 


BARACK: Partially Supervised Group Robustness With Guarantees
(Poster)
link »
SlidesLive Video » While neural networks have shown remarkable success on classification tasks in terms of averagecase performance, they often fail to perform well on certain groups of the data, for instance when spurious correlations are present. Unfortunately, group information may be expensive to obtain; thus, recent works in robustness and fairness have proposed ways to improve worstgroup performance even when group labels are unavailable. However, these methods generally underperform methods that utilize group information at training time. In this work, we assume access to a small number of group labels alongside a larger dataset without group labels. We propose BARACK, a simple twostep framework to utilize this partial group information to improve worstgroup performance: train a model to predict the missing group labels for the training data, and then use these predicted group labels in a robust optimization objective. Theoretically, we provide generalization bounds for our approach in terms of the worstgroup performance, which scale with respect to both the total number of training points and the number of training points with group labels. Empirically, across four spurious correlation and robustness benchmark tasks, our method outperforms the baselines that do not use group information, even when only 133% of points have group labels. 
Nimit Sohoni · Maziar Sanjabi · Nicolas Ballas · Aditya Grover · Shaoliang Nie · Hamed Firooz · Christopher Re 🔗 


Towards EnvironmentInvariant Representation Learning for Robust Task Transfer
(Poster)
link »
SlidesLive Video » To train a classification model that is robust to distribution shifts upon deployment, auxiliary labels indicating the various ``environments'' of data collection can be leveraged to mitigate reliance on environmentspecific features. This paper investigates how to evaluate whether a model has formed environmentinvariant representations, and proposes an objective that encourages learning such representations, as opposed to an invariant classifier. We also introduce a novel paradigm for evaluating environmentinvariant performance, to determine if learned representations can robustly transfer to a new task. 
Benjamin Eyre · Richard Zemel · Elliot Creager 🔗 


Doubly Right Object Recognition
(Poster)
link »
SlidesLive Video » Existing deep neural networks are optimized to predict the right thing, yet they may rely on the wrong evidence. Using the wrong evidence for prediction undermines outofdistribution generalization, underscoring the gap between machine perception and human perception. In this paper, we introduce an overlooked but important problem: ``doubly right object recognition,'' which requires the model not only to predict the right outcome, but also to use the right reasons that are aligned with human perception. The existing benchmark fails to learn or evaluate the doubly right object recognition task, because both the right reason and spurious correlations are predictive of the final outcome. Without additional supervision and annotation for what is the right reason for recognition, doubly right object recognition is impossible. To address this, we collect a dataset, which contains annotated right reasons that are aligned with human perception and train a fully interpretable model that only uses the attributes from our collected dataset for object prediction. Through empirical experiments, we demonstrate that our method can train models that are more likely to predict the right thing with the right reason, providing additional generalization ability on ObjectNet, and demonstrating zeroshot learning ability. 
Revant Teotia · Chengzhi Mao · Carl Vondrick 🔗 


SimpleSpot and Evaluating Systemic Errors using Synthetic Image Datasets
(Poster)
link »
We introduce SynthSpot, a framework for generating synthetic datasets to use for evaluating methods for discovering blindspots (i.e., systemic errors) in image classifiers, and SimpleSpot, a method for discovering such blindspots. We use SynthSpot to run controlled studies of how various factors influence blindspot discovery method performance. Our experimental results reveal several important shortcomings of existing methods, such as their relatively poor performance in settings with multiple model blindspots and their sensitivity to hyperparameters. Further, we find that SimpleSpot is competitive with existing methods, which has promising implications for developing an interactive tool based on it. 
Gregory Plumb · Nari Johnson · Ángel Alexander Cabrera · Marco Ribeiro · Ameet Talwalkar 🔗 


"Why did the Model Fail?": Attributing Model Performance Changes to Distribution Shifts
(Poster)
link »
Performance of machine learning models may differ significantly in novel environments compared to during training due to shifts in the underlying data distribution. Attributing performance changes to specific data shifts is critical for identifying sources of model failures and designing stable models. In this work, we design a novel method for attributing performance difference between environments to shifts in the underlying causal mechanisms. To this end, we construct a cooperative game where the contribution of each mechanism is quantified as their Shapley value. We demonstrate the ability of the method to identify sources of spurious correlation and attribute performance drop to shifts in label and/or feature distributions on synthetic and realworld datasets. 
Haoran Zhang · Harvineet Singh · Shalmali Joshi 🔗 


Characterizing Datapoints via SecondSplit Forgetting
(Poster)
link »
The dynamics by which neural networks learn and forget examples throughout training has emerged as an object of interest along several threads of research. In particular, researchers have proposed metrics of example hardness based on these dynamics, including (i) the epoch at which examples are first correctly classified; (ii) the number of times their predictions flip during training; and (iii) whether their prediction flips if they are held out. However, an example might be considered hard for several distinct reasons, such as being a member of a rare subpopulation, being mislabeled, or being fundamentally ambiguous in their class. In this paper, we focus on the secondsplit forgetting time (SSFT): the epoch (if any) after which an original training example is forgotten as the network is finetuned on a randomly held out partition of the data. Across multiple benchmark datasets and modalities, we demonstrate that mislabeled examples are forgotten quickly, and seemingly rare examples are forgotten comparatively slowly. By contrast, metrics only considering the first split learning dynamics struggle to differentiate the two. Additionally, the SSFT tends to be robust to the choice of architecture, optimizer, and random seed. From a practical standpoint, the SSFT (i) can help to identify mislabeled samples, the removal of which improves generalization; and (ii) can provide insights about failure modes. 
Pratyush Maini · Saurabh Garg · Zachary Lipton · Zico Kolter 🔗 


Invariant and Transportable Representations for AntiCausal Domain Shifts
(Poster)
link »
Realworld classification problems must contend with domain shift, the (potential) mismatch between the domain where a model is deployed and the domain(s) where the training data was gathered. Methods to handle such problems must specify what structure is held in common between the domains and what is allowed to vary. A natural assumption is that causal (structural) relationships are invariant in all domains. Then, it is tempting to learn a predictor for label $Y$ that depends only on its causal parents. However, many realworld problems are ``anticausal'' in the sense that $Y$ is a cause of the covariates $X$in this case, $Y$ has no causal parents and the naive causal invariance is useless. In this paper, we study representation learning under a particular notion of domain shift that both respects causal invariance and that naturally handles the ``anticausal'' structure. We show how to leverage the shared causal structure of the domains to learn a representation that both admits an invariant predictor and that also allows fast adaptation in new domains. The key is to translate causal assumptions into learning principles that disentangle ``invariant'' and ``nonstable'' features. Experiments on both synthetic and realworld data demonstrate the effectiveness of the proposed learning algorithm.

Yibo Jiang · Victor Veitch 🔗 


Contrastive Adapters for Foundation Model Group Robustness
(Poster)
link »
While large pretrained foundation models (FMs) have shown remarkable zeroshot classification robustness to datasetlevel distribution shifts, their robustness to group shifts is relatively underexplored. We study this problem, and first find that popular FMs such as CLIP may not be robust to various group shifts. On prior robustness benchmarks, they achieve up to an 80.7 percentage point (pp) gap between average and worstgroup accuracy. Unfortunately, current methods to improve robustness require retraining, which can be prohibitively expensive for large FMs. We find existing ways to efficiently improve large model inference, e.g., by training adapters (lightweight MLPs) on top of FM embeddings, can also hurt group robustness compared to zeroshot. We thus propose a first adapter training method designed to improve FM robustness to group shifts. While prior work only trains adapters with class labels, we add a contrastive objective to explicitly learn similar embeddings for initially dissimilar FM embeddings. Across the same benchmarks, contrastive adapting effectively and efficiently improves group robustness, raising worstgroup accuracy by 16.0 to 56.0 pp over zeroshot without any FM finetuning. Beyond FM robustness, contrastive adapting achieves nearstateoftheart robustness on Waterbirds and CelebA, while only training 1% of other methods' model parameters. 
Michael Zhang · Christopher Re 🔗 


HyperInvariances: Amortizing Invariance Learning
(Poster)
link »
SlidesLive Video » Providing invariances in a given learning task conveys a key inductive bias that can lead to sampleefficient learning and good generalisation, if correctly specified. However, the ideal invariances for many problems of interest are often not known, which has led both to a body of engineering lore as well as attempts to provide frameworks for invariance learning. However, invariance learning is expensive and data intensive for popular neural architectures. We introduce the notion of amortizing invariance learning. In an upfront learning phase, we learn a lowdimensional manifold of feature extractors spanning invariance to different transformations using a hypernetwork. Then, for any problem of interest, both model and invariance learning are rapid and efficient by fitting a lowdimensional invariance descriptor an output head. Empirically, this framework can identify appropriate invariances in different downstream tasks and lead to comparable or better test performance than conventional approaches. Our HyperInvariance framework is also theoretically appealing as it enables generalisationbounds that provide an interesting new operating point in the tradeoff between model fit and complexity. 
Ruchika Chavhan · Henry Gouk · Jan Stuehmer · Timothy Hospedales 🔗 


Conditional Distributional Invariance through Implicit Regularization
(Poster)
link »
A significant challenge faced by models trained via standard Empirical Risk Minimization (ERM) is that they might learn features of the input X which help it predict label Y in the training set which shouldn’t matter, i.e. associations which might not hold in test data. Causality lends itself very well to separate such spurious correlations from genuine, causal, ones. In this paper, we present a simple causal model for data and a method using which we can train a classifier to predict a category Y from an input X, while being invariant to a variable Z which is spuriously associated with Y. Notably, this method is just a slightly modified ERM problem without any explicit regularization. We empirically demonstrate that our method does better than regular ERM on standard metrics on benchmark datasets. 
Tanmay Gupta 🔗 


Enhancing Unittests for Invariance Discovery
(Poster)
link »
SlidesLive Video » Recently, Aubin et al. (2021) proposed a set of linear lowdimensional problems to precisely evaluate different types of outofdistribution generalization. In this paper, we show that one of these problems can already be solved by established algorithms, simply by better hyperparameter tuning. We then propose an enhanced version of the linear unittests. To the best of our hyperparameter search and within the set of algorithms evaluated, ANDmask is the best performing algorithm on this new suite of tests. Our findings on synthetic data are further reinforced by experiments on an image classification task where we introduce spurious correlations. 
Piersilvio De Bartolomeis · Antonio Orvieto · Giambattista Parascandolo 🔗 


Diversify and Disambiguate: Learning from Underspecified Data
(Poster)
link »
SlidesLive Video » Many datasets are underspecified, meaning that there are several equally viable solutions to a given task. Underspecified datasets can be problematic for methods that learn a single hypothesis because different functions that achieve low training loss can focus on different predictive features and thus have widely varying predictions on outofdistribution data. We propose DivDis, a simple twostage framework that first learns a collection of diverse hypotheses for a task by leveraging unlabeled data from the test distribution. We then disambiguate by selecting one of the discovered hypotheses using minimal additional supervision, in the form of additional labels or inspection of function visualization. We demonstrate the ability of DivDis to find robust hypotheses in image classification and natural language processing problems with underspecification. 
Yoonho Lee · Huaxiu Yao · Chelsea Finn 🔗 


Unsupervised Causal Generative Understanding of Images
(Poster)
link »
SlidesLive Video » We present a novel causal generative model for unsupervised objectcentric 3D scene understanding that generalizes robustly to outofdistribution images.This model is trained to reconstruct multiview images via a latent representation describing the shapes, colours and positions of the 3D objects they show.We then propose an inference algorithm that can infer this latent representation given a single outofdistribution image as input.We conduct extensive experiments applying our approach to test datasets that have zero probability under the training distribution.Our approach significantly outperforms baselines that do not capture the true causal image generation process. 
Titas Anciukevičius · Patrick FoxRoberts · Edward Rosten · Paul Henderson 🔗 


Causal Discovery using Model Invariance through Knockoff Interventions
(Poster)
link »
SlidesLive Video » Causeeffect analysis is crucial to understand the underlying mechanism of a system. We propose to exploit model invariance through interventions on the predictors to infer causality in a nonlinear multivariate system of time series. We model nonlinear interaction in time series using DeepAR and then expose the model to different environments using knockoff intervention to test model invariance. Knockoffs are indistribution null variables generated without knowing the response. We test model invariance where we show that the distribution of the response residual does not change significantly upon interventions on noncausal features. We use synthetically generated time series to evaluate and compare our approach with other causality methods. Overall our proposed method outperforms other widely used methods. 
Wasim Ahmad · Maha Shadaydeh · Joachim Denzler 🔗 


Using causal modeling to analyze generalization of biomarkers in highdimensional domains: a case study of adaptive immune repertoires
(Poster)
link »
SlidesLive Video » Machine learning is increasingly used to discover diagnostic and prognostic biomarkers from highdimensional molecular data. However, a variety of factors related to experimental design may affect the ability to learn generalizable and clinically applicable diagnostics. Here, we discuss building a diagnostic based on a specific, recently established highdimensional biomarker – adaptive immune receptor repertoires (AIRRs), and investigate how causal modeling may improve the robustness and generalization of developed diagnostics. We examine how the main biological and experimental factors of the AIRR domain may influence the learned biomarkers, especially in the presence of dataset shifts, and provide simulations of such effects. We conclude that causal modeling could improve AIRRbased diagnostics, but alsothat causal modeling itself might find a powerful testbed with complex, highdimensional variables in the AIRR field. 
Milena Pavlović · Ghadi S. Al Hajj · Victor Greiff · Johan Pensar · Geir Kjetil Sandve 🔗 


The Importance of Background Information for Out of Distribution Generalization
(Poster)
link »
SlidesLive Video » Domain generalization in medical image classification is an important problem for trustworthy machine learning to be deployed in healthcare. We find that existing approaches which utilize groundtruth abnormality segmentations to control feature attributions have poor out of distribution (OOD) performance relative the standard baseline of empirical risk minimization (ERM). We investigate what regions of an image are important for the task and show that parts of the background, that which is not contained in the abnormality segmentation, provides helpful signal. We then develop a new taskspecific mask which covers all relevant regions. Utilizing this new segmentation mask significantly improves the performance of the methods on the OOD test sets. To obtain better generalization results than ERM, we find it necessary to scale up the training data size in addition to the usage of these taskspecific masks. 
Jupinder Parmar · Khaled Saab · Brian Pogatchnik · Daniel Rubin · Christopher Ré 🔗 


SelfSupervision on Images and Text Reduces Reliance on Visual Shortcut Features
(Poster)
link »
Deep learning models trained in a fully supervised manner have been shown to rely on socalled "shortcut" features. Shortcut features are inputs that are associated with the outcome of interest in the training data, but are either no longer associated or not present in testing or deployment settings. Here we provide experiments that show recent selfsupervised models trained on images and text provide more robust image representations and reduce the model's reliance on visual shortcut features on a realistic medical imaging example. Additionally, we find that these selfsupervised models "forget" shortcut features more quickly than fully supervised ones when finetuned on labeled data. Though not a complete solution, our experiments provide compelling evidence that selfsupervised models trained on images and text provide some resilience to visual shortcut features. 
Anil Palepu · Andrew Beam 🔗 


OutofDistribution Failure through the Lens of Labeling Mechanisms: An Information Theoretic Approach
(Poster)
link »
SlidesLive Video »
Machine learning models typically fail in deployment environments where the distribution of data does not perfectly match that of the training domains. This phenomenon is believed to stem from networks' failure to capture the invariant features that generalize to unseen domains. However, we attribute this phenomenon to the limitations that the labeling mechanism employed by humans imposes on the learning algorithm. We conjecture that providing multiple labels for each datapoint where each could describe the existence of particular objects/concepts on the data point, decreases the risk of capturing nongeneralizable correlations by the model. We theoretically show that learning over a multilabel regime, where $K$ labels for each data point are present, tightens the expected generalization gap by a factor of $1/\sqrt{K}$ compared to a similar case where only one label for each data point is in hand. Also, we show that learning under this regime is much more sample efficient and requires a fraction of training data to provide competitive results.

Soroosh Shahtalebi · Zining Zhu · Frank Rudzicz 🔗 


How much Data is Augmentation Worth?
(Poster)
link »
SlidesLive Video » Despite the clear performance benefits of data augmentations, little is known about why they are so effective. In this paper, we disentangle several key mechanisms through which data augmentations operate. Establishing an \textit{exchange rate} between augmented and additional real data, we find that augmentations can provide nearly the same performance gains as additional data samples for indomain generalization and even greater performance gains for outofdistribution test sets. We also find that neural networks with hardcoded invariances underperform those with invariances learned via data augmentations. Our experiments suggest that these benefits to generalization arise from the additional stochasticity conferred by randomized augmentations, leading to flatter minima. 
Jonas Geiping · Gowthami Somepalli · Ravid ShwartzZiv · Andrew Wilson · Tom Goldstein · Micah Goldblum 🔗 


On the Generalization and Adaption Performance of Causal Models
(Poster)
link »
Learning models that offer robust outofdistribution generalization and fast adaptation is a key challenge in modern machine learning. Modelling causal structure into neural networks holds the promise to accomplish robust zero and fewshot adaptation. Recent advances in differentiable causal discovery have proposed to factorize the data generating process into a set of modules, i.e. one module for the conditional distribution of every variable where only causal parents are used as predictors. Such a modular decomposition of knowledge allows to adapt to distributions shifts by only updating a subset of parameters. In this work, we systematically study the generalization and adaption performance of such causal models by comparing it to monolithic models and structured models where the set of predictors is not constrained to causal parents. Our analysis shows that causal models outperform other models on both zero and fewshot adaptation in low data regimes and offer robust generalization. We also found that the effects are more significant for sparser graphs as compared to denser graphs. 
Nino Scherrer · Anirudh Goyal · Stefan Bauer · Yoshua Bengio · Rosemary Nan Ke 🔗 


Learning Switchable Representation with Masked Decoding and Sparse Encoding
(Poster)
link »
SlidesLive Video » In this study, we explore the unsupervised learning based on private/shared factor decomposition, which decomposes the latent space into private factors that vary only in a specific domain the shared factors that vary in all domains. We study when/how we can force the model to respect the true private/shared factor decomposition that underlies the dataset. We show that, when we train a masked decoder and an encoder with sparseness regularization in the latent space, we can identify the true private/shared decomposition up to mixing within each component. We empirically confirm this result and study the efficacy of this training strategy as a representation learning method. 
Kohei Hayashi · Masanori Koyama 🔗 


Improving Groupbased Robustness and Calibration via Ordered Risk and Confidence Regularization
(Poster)
link »
Neural network trained via empirical risk minimization achieves high accuracy on average but low accuracy on certain groups, especially when there is a spurious correlation. To construct the unbiased model from spurious correlation, we build a hypothesis that the inference to the samples without spurious correlation should take relative precedence over the inference to the spuriously biased samples. Based on the hypothesis, we propose the relative regularization to induce the training risk of each group to follow the specific order, which is sorted according to the degree of spurious correlation for each group. In addition, we introduce the ordering regularization based on the predictive confidence of each group to improve the model calibration, where other robust models still suffer from large calibration errors. These result in our complete algorithm, Ordered Risk and Confidence regularization (ORC). Our experiments demonstrate that ORC improves both the group robustness and calibration performances against the various types of spurious correlation in both synthetic and realworld datasets. 
Seungjae Shin · Byeonghu Na · HeeSun Bae · JoonHo Jang · Hyemi Kim · Kyungwoo Song · Youngjae Cho · IL CHUL MOON 🔗 


Towards Group Robustness in the Presence of Partial Group Labels
(Poster)
link »
SlidesLive Video » Learning invariant representations is a fundamental requirement for training machine learning models that are influenced by spurious correlations. These spurious correlations, present in the training datasets, wrongly direct the neural network predictions resulting in reduced performance on certain groups, especially the minority groups. Robust training against such correlations requires the knowledge of group membership on every training sample. This need is impractical in situations where the data labeling efforts, for minority/rare groups, are significantly laborious or where the individuals comprising the dataset choose to conceal sensitive information pertaining to the groups. On the other hand, the presence of data collection efforts often results in datasets that contain partially labeled group information. Recent works, addressing the problem, have tackled fully unsupervised scenarios where no labels for groups are available. We aim to fill a missing gap in the literature that addresses a more realistic setting by leveraging partially available group information during training. First, we construct a constraint set and derive a high probability bound for the group assignment to belong to the set. Second, we propose an algorithm that optimizes for a worstoff group assignment from the constraint set. Through experiments on image and tabular datasets, we show improvements in the minority group's performance while preserving overall accuracy across groups. 
Vishnu Lokhande · Kihyuk Sohn · Jinsung Yoon · Madeleine Udell · ChenYu Lee · Tomas Pfister 🔗 


Towards Multilevel Fairness and Robustness on Federated Learning
(Poster)
link »
Federated learning (FL) has emerged as an important machine learning paradigm where a global model is trained based on the private data from distributed clients. However, federated model can be biased due to the spurious correlation or distribution shift over subpopulations, and it may disproportionately advantage or disadvantage some of the subpopulations, leading to the problem of unfarness and nonrobustness. In this paper, we formulate the problem of multilevel fairness and robustness on FL to train a global model performing well on existing clients, different subgroups formed by sensitive attribute(s), and newly added clients at the same time. To solve this problem, we propose a unifed optimization objective from the view of federated uncertainty set with theoretical analyses. We also develop an effcient federated optimization algorithm named Federated Mirror Descent Ascent with Momentum Acceleration (FMDAM) with convergence guarantee. Extensive experimental results show that FMDAM outperforms the existing FL algorithms on multilevel fairness and robustness. 
Fengda Zhang · Kun Kuang · Yuxuan Liu · Long Chen · Jiaxun Lu · Yunfeng Shao · Fei Wu · Chao Wu · Jun Xiao 🔗 


Learning Debiased Classifier with Biased Committee
(Poster)
link »
This paper proposes a new method for training debiased classifiers with no bias supervision. The key idea of the method is to employ a committee of classifiers as an auxiliary module that identifies biasconflicting data and assigns large weights to them when training the main classifier. The committee is learned as a bootstrapped ensemble so that a majority of its classifiers are biased as well as being diverse, and intentionally fail to predict classes of biasconflicting data accordingly. The consensus within the committee on prediction difficulty provides a reliable cue for identifying and weighting biasconflicting data. Moreover, the committee is trained also with knowledge transferred from the main classifier so that it gradually becomes debiased and emphasizes more difficult data as training progresses.On five realworld datasets, our method outperforms previous arts using no bias label like ours and even surpasses those relying on bias labels occasionally. 
Nayeong Kim · SEHYUN HWANG · Sungsoo Ahn · Jaesik Park · Suha Kwak 🔗 


Causal Omnivore: Fusing Noisy Estimates of Spurious Correlations
(Poster)
link »
SlidesLive Video » Spurious correlations are one of the biggest pain points for users of modern machine learning. To handle this issue, many approaches attempt to learn features that are causally linked to the prediction variable. Such techniques, however, suffer from various flawsthey are often prohibitively complex or based on heuristics and strong assumptions that may fail in practice. There is no onesizefitsall causal feature identification approach. To address this challenge, we propose a simple way to fuse multiple noisy estimates of causal features. Our approach treats the underlying causal structure as a latent variable and exploits recent developments in estimating latent structures without any access to ground truth. We propose new sources, including an automated way to extract causal insights from existing ontologies or foundation models. On multiple benchmark environmental shift datasets, our discovered features can train a model via vanilla empirical risk minimization that outperforms multiple baselines, including automated causal feature discovery techniques such as invariant risk minimization on three benchmark datasets. 
Dyah Adila · Sonia Cromp · SICHENG MO · Frederic Sala 🔗 


Robust Calibration with Multidomain Temperature Scaling
(Poster)
link »
Uncertainty quantification is essential for the reliable deployment of machine learning models to highstakes application domains. Uncertainty quantification is all the more challenging when training distribution and test distribution are different, even the distribution shifts are mild. Despite the ubiquity of distribution shifts in realworld applications, existing uncertainty quantification approaches mainly study the indistribution setting where the train and test distributions are the same. In this paper, we develop a systematic calibration model to handle distribution shifts by leveraging data from multiple domains. Our proposed methodmultidomain temperature scalinguses the heterogeneity in the domains to improve calibration robustness under distribution shift. Through experiments on three benchmark data sets, we find our proposed method outperforms existing methods as measured on both indistribution and outofdistribution test sets. 
Yaodong Yu · Stephen Bates · Yi Ma · Michael Jordan 🔗 


A Unified Causal View of Domain Invariant Representation Learning
(Poster)
link »
Machine learning methods can be unreliable when deployed in domains that differ from the domains on which they were trained. To address this, we may wish to learn representations of data that are domaininvariant in the sense that we preserve data structure that is stable across domains, but throw out spuriouslyvarying parts. There are many representationlearning approaches of this type, including methods based on data augmentation, distributional invariances, and risk invariance. Unfortunately, when faced with any particular realworld domain shift, it is unclear which, if any, of these methods might be expected to work. The purpose of this paper is to show how the different methods relate to each other, and clarify the realworld circumstances under which each is expected to succeed. The key tool is a new notion of domain shift relying on the idea that causal relationships are invariant, but noncausal relationships (e.g., due to confounding) may vary. 
Zihao Wang · Victor Veitch 🔗 


Last Layer ReTraining is Sufficient for Robustness to Spurious Correlations
(Poster)
link »
Neural network classifiers can largely rely on simple spurious features, such as backgrounds, to make predictions. However, even in these cases, we show that they still often learn core features associated with the desired attributes of the data, contrary to recent findings. Inspired by this insight, we demonstrate that simple last layer retraining can match or outperform stateoftheart approaches on spurious correlation benchmarks, but with profoundly lower complexity and computational expenses. Moreover, we show that last layer retraining on large ImageNettrained models can also significantly reduce reliance on background and texture information, improving robustness to covariate shift, after only minutes of training on a single GPU. 
Polina Kirichenko · Polina Kirichenko · Pavel Izmailov · Andrew Wilson 🔗 


Evaluating Robustness to Dataset Shift via Parametric Robustness Sets
(Poster)
link »
We give a method for proactively identifying small, plausible shifts in distribution which lead to large differences in model performance. To ensure that these shifts are plausible, we parameterize them in terms of interpretable changes in causal mechanisms of observed variables. This defines a parametric robustness set of plausible distributions and a corresponding worstcase loss. We construct a local approximation to the loss under shift, and show that problem of finding worstcase shifts can be efficiently solved. 
Michael Oberst · Nikolaj Thams · David Sontag 🔗 


Causally motivated multishortcut identification and removal
(Poster)
link »
For predictive models to provide reliable guidance in decision making processes, they are often required to be accurate and robust to distribution shift. Shortcut learningwhere a model relies on spurious correlations or shortcuts to predict the target labelundermines the robustness property, leading to models with poor outofdistribution accuracy despite good indistribution performance. Existing work on shortcut learning either assumes that the set of possible shortcuts is known a priori or is discoverable using interprability methods such as saliency maps. Instead, we propose a two step approach to (1) efficiently identify relevant shortcuts, and (2) leverage the identified shortcuts to build models that are robust to distribution shifts. Our approach relies on having access to a (possibly) high dimensional set of auxiliary labels at training time, some of which correspond to possible shortcuts. We show both theoretically and empirically that our approach is able to identify a small sufficient set of shortcuts leading to more efficient predictors in finite samples. 
Jiayun Zheng · Maggie Makar 🔗 


How robust are pretrained models to distribution shift?
(Poster)
link »
The vulnerability of machine learning models to spurious correlations has mostly been discussed in the context of supervised learning (SL). However, there is a lack of insight on how spurious correlations affect the performance of popular selfsupervised learning (SSL) and autoencoder based models (AE). In this work, we shed light on this by evaluating the performance of these models on both real world and synthetic distribution shift datasets. Following observations that the linear head itself can be susceptible to spurious correlations, we develop a new evaluation scheme with the linear head trained on outofdistribution (OOD) data, to isolate the performance of the pretrained models from a potential bias of the linear head used for evaluation. With this new methodology, we show that SSL models are consistently more robust to distribution shifts and thus better at OOD generalisation than AE and SL models. 
Yuge Shi · Imant Daunhawer · Julia Vogt · Phil Torr · Amartya Sanyal 🔗 


Understanding Rare Spurious Correlations in Neural Networks
(Poster)
link »
Neural networks are known to use spurious correlations such as background information for classification. While prior work has looked at spurious correlations that are widespread in the training data, in this work, we investigate how sensitive neural networks are to rare spurious correlations, which may be harder to detect and correct, and may lead to privacy leaks. We introduce spurious patterns correlated with a fixed class to a few training examples and find that it takes only a handful of such examples for the network to learn the correlation. Furthermore, these rare spurious correlations also impact accuracy and privacy. We empirically and theoretically analyze different factors involved in rare spurious correlations and propose mitigation methods accordingly. Specifically, we observe that $\ell_2$ regularization and adding Gaussian noise to inputs can reduce the undesirable effects.

YaoYuan Yang · ChiNing Chou · Kamalika Chaudhuri 🔗 


Optimizationbased Causal Estimation from Heterogenous Environments
(Poster)
link »
This paper presents an optimization approach to causal estimation. In classical machine learning, the goal of optimization is to maximize predictive accuracy. However, some covariates might exhibit noncausal association to the outcome. Such spurious associations provide predictive power for classical ML, but prevent us from interpreting the result causally. This paper proposes CoCo, an optimization algorithm that bridges the gap between pure prediction and causal inference. CoCo leverages the recentlyproposed idea of environments. Given datasets from multiple environmentsand ones that exhibit enough heterogeneityCoCo maximizes an objective for which the only solution is the causal solution. We describe the theoretical foundations of this approach and demonstrate its effectiveness on simulated and real datasets. Compared to classical ML and the recentlyproposed IRMv1, CoCo provides more accurate estimates of the causal model. 
Mingzhang Yin · Yixin Wang · David Blei 🔗 


Automated Invariance Testing for Machine Learning Models Using Sparse Linear Layers
(Poster)
link »
Machine learning testing and evaluation are largely overlooked by the community. In many cases, the only way to conduct testing is through formulabased scores, e.g., accuracy, f1, etc. However, these simple statistical scores cannot fully represent the performance of ML model. Therefore, new testing frameworks are attracting more attention. In this work, we propose a novel invariance testing approach that does not utilise traditional statistical scores. Instead, we train a series of sparse linear layers which are more easily to be compared due to their sparsity. We then use different divergence functions to numerically compare them and fuse the difference scores into a visual matrix. Additionally, testing using sparse linear layers allows us to conduct a novel testing oracle: associativity: by comparing merged weights and weights obtained by combined augmentation. Finally, we assess whether a model is invariant by checking the visual matrix, the associativity, and its sparse layers. We show that by using our testing framework, interrater reliability can be significantly improved. 
Zukang Liao · Michael Cheung 🔗 


Fairness and robustness in anticausal prediction
(Poster)
link »
Robustness to distribution shift and fairness have independently emerged as two important desiderata required of modern machine learning models. Here, we discuss these connections through a causal lens, focusing on anticausal prediction tasks, where the input to a classifier (e.g., an image) is assumed to be generated as a function of the target label and the protected attribute. By taking this perspective, we draw explicit connections between a common fairness criterionseparationand a common notion of robustnessrisk invariance. These connections provide new motivation for applying the separation criterion in anticausal settings, and show that fairness can be motivated entirely on the basis of achieving better performance. In addition, our findings suggest that robustnessmotivated approaches can be used to enforce separation, and that they often work better in practice than methods designed to directly enforce separation. Using a medical dataset, we empirically validate our findings on the task of detecting pneumonia from Xrays, in a setting where differences in prevalence across sex groups motivates a fairness mitigation. Our findings highlight the importance of considering causal structure when choosing and enforcing fairness criteria. 
Maggie Makar · Alexander D'Amour 🔗 


Are Vision Transformers Robust to Spurious Correlations ?
(Poster)
link »
SlidesLive Video » Deep neural networks may be susceptible to learning spurious correlations that hold on average but not in atypical test samples. As with the recent emergence of vision transformer (ViT) models, it remains underexplored how spurious correlations are manifested in such architectures. In this paper, we systematically investigate the robustness of vision transformers to spurious correlations on three challenging benchmark datasets and compare their performance with popular CNNs. Our study reveals that when pretrained on a sufficiently large dataset, ViT models are more robust to spurious correlations than CNNs. Key to their success is the ability to generalize better from the examples where spurious correlations do not hold. 
Soumya Suvra Ghosal · Yifei Ming · Yixuan Li 🔗 


DAFT: Distilling Adversarially Finetuned teachers for OOD Robustness
(Poster)
link »
SlidesLive Video » We consider the problem of OOD generalization,where the goal is to train a model that performs well on test distributions that are different from the training distribution. Deep learning models are known to be fragile to such shifts and can suffer large accuracy drops even for slightly different test distributions (Hendrycks & Dietterich, 2019).We propose a new method –DAFT– based on the intuition that adversarially robust combination of a large number of rich features should provide OOD robustness. Our method carefully distills the model from a powerful teacher that learns several discriminative features using standard training while combining them using adversarial training. The standard adversarial training procedure is modified to produce teachers which can guide the student better. We evaluate DAFT on standard benchmarks in the DomainBed framework, and find that DAFT consistently outperforms welltuned ERM and distillation baselines by up to 6%, with more pronounced gains for smaller networks 
Anshul Nasery · Sravanti Addepalli · Praneeth Netrapalli · Prateek Jain 🔗 


On the nonlinear correlation of ML performance across data subpopulations
(Poster)
link »
SlidesLive Video » Understanding the performance of machine learning models across diverse data distributions is critically important for reliable applications. Recent empirically works find that there is a strong linear relationship between indistribution (ID) and outofdistribution (OOD) performance, but we show that this is not necessarily true if there are subpopulation shifts. In this paper, we empirically show that outofdistribution performance often has nonlinear correlation with indistribution performance under subpopulation shifts. To understand this phenomenon, we decompose the model's performance into performance on each subpopulation. We show that there is a "moon shape" correlation (parabolic uptrend curve) between the test performance on the majority subpopulation and the minority subpopulation. This nonlinear correlations hold across model architectures, training durations and hyperparameters, and the imbalance between subpopulations. Moreover, we show that the nonlinearity increases in the presence of spurious correlations in the training data. We provide complementary theoretical and experimental analyses for this interesting phenomenon of nonlinear performance correlation across subpopulations. Finally, we discuss the implications of our findings for ML reliability and fairness. 
Weixin Liang · Yining Mao · Yongchan Kwon · Xinyu Yang · James Zou 🔗 


Lazy vs hasty: linearization in deep networks impacts learning schedule based on example difficulty
(Poster)
link »
A recent line of work has identified a socalled ‘lazy regime’ where a deep network can be well approximated by its linearization around initialization throughout training. Here we investigate the comparative effect of the lazy (linear) and featurelearning (nonlinear) regimes on subgroups of examples based on their difficulty. Specifically, we show that easier examples are given more weight in feature learning mode, resulting in faster training compared to more difficult ones. We illustrate this phenomenon across different ways to quantify example difficulty, including cscore, label noise, and in the presence of spurious correlations. 
Thomas George · Guillaume Lajoie · Aristide Baratin 🔗 


OODProbe: A Neural Interpretation of OutofDomain Generalization
(Poster)
link »
SlidesLive Video » The ability to generalize outofdomain (OOD) is an important goal for deep neural network development, and researchers have proposed many highperforming OOD generalization methods from various foundations. While many OOD algorithms perform well in various scenarios, these systems are evaluated as ``blackboxes''. Instead, we propose a flexible framework that evaluates OOD systems with finer granularity using a probing module that predicts the originating domain from intermediate representations. We find that representations always encode some information about the domain. While the layerwise encoding patterns remain largely stable across different OOD algorithms, they vary across the datasets. For example, the information about rotation (on RotatedMNIST) is the most visible on the lower layers, while the information about style (on VLCS and PACS) is the most visible on the middle layers. In addition, the high probing results correlate to the domain generalization performances, leading to further directions in developing OOD generalization systems. 
Zining Zhu · Soroosh Shahtalebi · Frank Rudzicz 🔗 


Linear Connectivity Reveals Generalization Strategies
(Poster)
link »
SlidesLive Video » It is widely accepted in the mode connectivity literature that when two neural networks are trained similarly on the same data, they are connected by a path through parameter space over which test set accuracy is maintained. Under some circumstances, including transfer learning from pretrained models, these paths are presumed to be linear. In contrast to existing results, we find that among text classifiers (trained on MNLI, QQP, and CoLA), some pairs of finetuned models have large barriers of increasing loss on the linear paths between them. On each task, we find distinct clusters of models which are linearly connected on the test loss surface, but are disconnected from models outside the clustermodels that occupy separate basins on the surface. By measuring performance on existing diagnostic datasets, we find that these clusters correspond to different generalization strategies: one cluster behaves like a bag of words model under domain shift, while another cluster uses syntactic heuristics. Our work demonstrates how the geometry of the loss surface can guide models towards different heuristic functions. 
Jeevesh Juneja · Rachit Bansal · Kyunghyun Cho · João Sedoc · Naomi Saphra 🔗 


SelecMix: Debiased Learning by Mixing up Contradicting Pairs
(Poster)
link »
Neural networks trained with ERM (empirical risk minimization) sometimes learn unintended decision rules, in particular when their training data is biased, i.e., when training labels are correlated with undesirable features. Techniques have been proposed to prevent a network from learning such features, using the heuristic that spurious correlations are ``too simple'' and learned preferentially during training by SGD. Recent methods resample or augment training data such that examples displaying spurious correlations (a.k.a. biasaligned examples) become a minority, whereas the other, biasconflicting examples become prevalent. These approaches are difficult to train and scale to realworld data e.g. because they rely on disentangled representations. We propose an alternative based on mixup that augments the available biasconflicting training data with convex combinations of existing examples and their labels. Our method, named SelecMix, applies mixup to selected pairs of examples, which show either (i) the same label but dissimilar biased features, or (ii) a different label but similar biased features. To comparing examples along biased features, we use an auxiliary model relying on the heuristic that biased features are learned preferentially during training by SGD. On semisynthetic benchmarks where this heuristic is valid, we obtain results superior to existing methods, in particular in the presence of label noise, which complicates the identification of biasconflicting examples. 
Inwoo Hwang · Sangjun Lee · Yunhyeok Kwak · Seong Joon Oh · Damien Teney · JinHwa Kim · ByoungTak Zhang 🔗 


Optimizing maintenance by learning individual treatment effects
(Poster)
link »
The goal in maintenance is to avoid machine failures and overhauls, while simultaneously minimizing the cost of preventive maintenance. Maintenance policies aim to optimally schedule maintenance by modeling the effect of preventive maintenance on machine failures and overhauls. Existing work assumes the effect of preventive maintenance is (1) deterministic or governed by a known probability distribution, and (2) machineindependent. Conversely, this work proposes to relax both assumptions by learning the effect of maintenance conditional on a machine's characteristics from observational data on similar machines using existing methodologies for causal inference. This way, we can estimate the number of overhauls and failures for different levels of maintenance and, consequently, optimize the preventive maintenance frequency. We validate our proposed approach using reallife data on more than 4,000 maintenance contracts from an industrial partner. Empirical results show that our novel, causal approach accurately predicts the maintenance effect and results in individualized maintenance schedules that are more accurate and costeffective than supervised or nonindividualized approaches. 
Toon Vanderschueren · Robert Boute · Tim Verdonck · Bart Baesens · Wouter Verbeke 🔗 