The importance of robust predictions continues to grow as machine learning models are increasingly relied upon in highstakes settings. Ensuring reliability in realworld applications remains an enormous challenge, particularly because data in the wild frequently differs substantially from the data on which models were trained. This phenomenon, broadly known as “distribution shift”, has become a major recent focus of the research community.
With the growing interest in addressing this problem has come growing awareness of the multitude of possible meanings of “distribution shift” and the importance of understanding the distinctions between them: which types of shift occur in the real world, and under which of these is generalization feasible? Negative results seem just as common as positive ones; where provable generalization is possible, it often depends on strong structural assumptions whose likelihood of holding in reality is questionable. Existing approaches often lack rigor and clarity with regards to the precise problem they are trying to solve. Some work has been done to precisely define distribution shift and to produce benchmarks which properly reflect realworld distribution shift, but overall there seems to be little communication between the communities tackling foundations and applications respectively. Recent strides have been made to move beyond tinkering, bringing much needed rigor to the field, and we hope to encourage this effort by opening a dialogue to share ideas between these communities.
Sat 6:00 a.m.  6:10 a.m.

Introduction
SlidesLive Video » 
🔗 
Sat 6:10 a.m.  6:50 a.m.

Distribution Shifts in Healthcare—A Key Barrier to Safe Deployment of Machine Learning Algorithms in the Clinic
(Invited Talk; Inperson)
SlidesLive Video » Deep learning approaches are increasingly used in healthcare due to their seemingly remarkable performance. However, they can be notoriously brittle, often with little ability to generalize outside their training data. Using real life examples from ophthalmology, oncology and radiology, we will first discuss practical examples of distribution shifts. We will then highlight how even seemingly subtle distribution shifts can lead to catastrophic failures of models. We will highlight the need for constant vigilance of the input data and better metrics to quantify distribution shifts. We will conclude with a plea to the ICML/PODS community to work with clinical community on this critically important topic. 
Jayashree KalpathyCramer 🔗 
Sat 6:50 a.m.  7:10 a.m.

Extending the WILDS Benchmark for Unsupervised Adaptation
(Invited Talk; Inperson)
SlidesLive Video » Machine learning models deployed in the real world constantly face distribution shifts, and these distribution shifts can significantly degrade model performance. In this talk, I will present the WILDS benchmark of realworld distribution shifts, focusing on the version 2.0 update that adds curated unlabeled data. Unlabeled data can be a powerful leverage for improving outofdistribution performance, but existing distribution shift benchmarks with unlabeled data do not reflect the breadth of scenarios that arise in realworld applications. To this end, we provide unlabeled data to 8 out of 10 datasets in WILDS, spanning diverse applications and modalities. We observe that existing methods fail to improve outofdistribution performance on WILDS, even though these methods have been successful on existing benchmarks with different types of distribution shifts. This underscores the importance of developing and evaluating methods on diverse types of distribution shifts, including directly on shifts that arise in practice. 
Shiori Sagawa 🔗 
Sat 7:10 a.m.  7:30 a.m.

Coffee Break
(Break)

🔗 
Sat 7:30 a.m.  8:10 a.m.

Distribution Shift Through the Lens of Explanations
(Invited Talk; Inperson)
SlidesLive Video » Machine learning models often perform poorly under distribution shift. But can we understand how a particular distribution shift will affect a model? We approach this in two parts: (1) explaining the shift itself, and (2) explaining the model's behavior. First, we train a language model to describe the difference between two distributions. The model produces natural language explanations that allow humans to distinguish random draws from the two distributions. This helps reveal subtle but important shifts that may not be apparent from manual inspection, and can also be used to uncover spurious cues. We use this to identify "shortcuts" that models rely on, and construct a distribution shift that breaks the shortcut and decreases model performance. Having built tools to understand how the data is shifted, we next investigate whether model explanations (such as GradCAM) can be used to predict the behavior of models under distribution shift. Here, the resuts are largely negative. We construct models with specific defects (such as backdoors or spurious cues) that affect outofdistribution performance, and measure whether model explanations can distinguish these from regular, nondefective models. Detection rates are typically low and in some cases trivial. This underscores the need to improve model explanations if they are to be used as a reliable tool for model debugging. 
Jacob Steinhardt 🔗 
Sat 8:10 a.m.  8:50 a.m.

Can Fairness be Retained Over Distribution Shifts?
(Invited Talk; Livestreamed)
SlidesLive Video » Given the inherent difficulty of learning a model that is robust to data distribution shifts, much research focus has `shifted’ to learning data representations that are useful for learning good models for downstream, yet unknown, data distributions. The primary aim of such models is accuracy generalization. In this talk I wish to address an additional desideratum—model fairness. On a high level, the question I am interested in is: to what extent and under what assumptions can one come up with data representations that are both “fair” and allow accurate predictions when applied to downstream tasks about which one has only limited information? I will address different possible fairness requirements and provide some initial insights on what can, and more often, what cannot be achieved along this line. 
Shai BenDavid 🔗 
Sat 8:50 a.m.  9:30 a.m.

Poster Session 1
(Inperson poster session)

🔗 
Sat 9:30 a.m.  10:45 a.m.

Lunch Break
(Break)

🔗 
Sat 10:45 a.m.  12:00 p.m.

Discussion Panel
(Discussion Panel; Inperson and on zoom)
SlidesLive Video » 
Percy Liang · Léon Bottou · Jayashree KalpathyCramer · Alex Smola 🔗 
Sat 12:00 p.m.  12:15 p.m.

Coffee Break
(Break)

🔗 
Sat 12:15 p.m.  12:55 p.m.

Causal Structure Learning with Unknown Mechanism Shifts
(Invited Talk; Livestreamed)
SlidesLive Video » The formalism of structural causal models provides a precise approach for describing certain types of distribution shifts, via the notion of a soft intervention or mechanism change. Popular approaches to learning causal structure from data rely on the availability of distribution shifts in order to identify between otherwise indistinguishable models. However, many approaches rely on prior knowledge of which variables have been shifted between settings, called intervention targets. When this information is not available, one must simultaneously learn both intervention targets and the causal structure. We introduce the UnknownTarget Interventional Greedy Sparsest Permutation algorithm, a nonparametric, hybrid approach for this learning task. We prove the consistency of the algorithm, and demonstrate its performance on synthetic and biological datasets. 
🔗 
Sat 12:55 p.m.  1:35 p.m.

Algorithmic Robust Statistics
(Invited Talk; Livestreamed)
SlidesLive Video » Over the past few years, there has been exciting progress on algorithmic robust statistics in unsupervised, supervised and online learning. Much of this progress has been fueled by new algorithmic tools for detecting portions of the samples that have different distributional profiles. We will survey some of these tools as well as discuss prospects for building theories of coping with distribution shift from them. 
Ankur Moitra 🔗 
Sat 1:35 p.m.  1:55 p.m.

A Causal Graphical Framework for Understanding Stability to Dataset Shifts
(Invited Talk; Inperson)
SlidesLive Video » Growing interest in the external validity of prediction models has produced many methods for finding predictive distributions that are invariant to dataset shifts and can be used for prediction in new, unseen environments. However, these methods consider different types of shifts and have been developed under disparate frameworks, making it difficult to theoretically analyze how solutions differ with respect to stability and accuracy. Taking a causal graphical view, in this talk I will discuss three graphical operators for removing unstable parts of the DGP that correspond to three types of stable distributions. This clarifies the relationship between the types of “invariances” sought by many existing methods. Then, using an example from healthcare, I will demonstrate the tradeoff between minimax and average performance, highlighting the need for model developers to carefully determine when and how they achieve invariance. 
Adarsh Subbaswamy 🔗 
Sat 1:55 p.m.  2:40 p.m.

Poster Session 2
(Inperson poster session)

🔗 


Simple and nearoptimal algorithms for hidden stratification and multigroup learning
(Poster)
[
Poster]
Multigroup agnostic learning is a formal learning criterion that is concernedwith the conditional risks of predictors within subgroups of a population. The criterion addresses recent practical concerns such as subgroup fairness andhidden stratification. This paper studies the structure of solutions to the multigroup learning problem, and provides simple and nearoptimal algorithms for the learningproblem. 
Christopher Tosh · Daniel Hsu 🔗 


GAPX: Generalized Autoregressive ParaphraseIdentification X
(Poster)
Paraphrase Identification is a fundamental task in Natural Language Processing. While much progress has been made in the field, the performance of many stateoftheart models often suffer from distribution shift during inference time. We verify that a major source of this performance drop comes from biases introduced by negative examples. To overcome these biases, we propose in this paper to train two separate models, one that only utilize the positive pairs and the other the negative pairs. This enables us the option of deciding how much to utilize the negative model, for which we introduce a perplexity based outofdistribution metric that we show can effectively and automatically determine how much weight it should be given during inference. We support our findings with strong empirical results. 
Yifei Zhou · Renyu Li · Hayden Housen · SerNam Lim 🔗 


Generative Gradual Domain Adaptation with Optimal Transport
(Poster)
[
Poster]
Existing unsupervised domain adaptation (UDA) algorithms adapt a model from a labeled source domain to an unlabeled target domain in a oneoff way. While these algorithms have been applied widely, they face a great challenge whenever the distribution distance between the source and the target is large. One natural idea to overcome this issue is to divide the original problem into smaller pieces so that each subproblem only deals with a small shift. Following this idea and inspired by existing theory on gradual domain adaptation (GDA), we propose Generative Gradual Domain Adaptation with Optimal Transport (GOAT), a novel divideandconquer framework for UDA that automatically generates the intermediate domains connecting the source and the target in order to reduce the original UDA problem to GDA. Concretely, we first determine a Wasserstein geodesic under the Euclidean metric between the source and target in an embedding space, and then generate embeddings of intermediate domains along the geodesic by solving an optimal transport problem. Given the sequence of generated intermediate domains, we then apply gradual selftraining, a standard GDA algorithm, to adapt the sourcelearned classifier sequentially to the target. Empirically, by using embeddings from modern generative models, we show that our algorithmic framework can utilize the power of existing generative models for UDA, which we believe makes the proposed algorithm widely applicable in many settings. We also conduct experiments on modern UDA datasets such as Rotated CIFAR10, Office31, and OfficeHome. The results show superior performances of GOAT over conventional UDA approaches, which further demonstrates the effectiveness of GOAT in addressing large distribution shifts presented in many UDA problems. 
Yifei He · Haoxiang Wang · Han Zhao 🔗 


Pareto Invariant Risk Minimization
(Poster)
SlidesLive Video » Despite the success of invariant risk minimization (IRM) in tackling the OutofDistribution generalization problem, IRM can compromise the optimality when applied in practice. The practical variants of IRM, e.g., IRMv1, have been shown to have significant gaps with IRM and thus could fail to capture the invariance even in simple problems. Moreover, the optimization procedure in IRMv1 involves two intrinsically conflicting objectives, and often requires careful tuning for the objective weights. To remedy the above issues, we reformulate IRM as a multiobjective optimization problem, and propose a new optimization scheme for IRM, called PAreto Invariant Risk Minimization (PAIR). PAIR can adaptively adjust the optimization direction under the objective conflicts. Furthermore, we show PAIR can empower the practical IRM variants to overcome the barriers with the original IRM when provided with proper guidance. We conduct experiments with ColoredMNIST to confirm our theory and the effectiveness of PAIR. 
Yongqiang Chen · Kaiwen Zhou · Yatao Bian · Binghui Xie · Kaili MA · Yonggang Zhang · Han Yang · Bo Han · James Cheng 🔗 


OutofDistribution Detection for Medical Applications: Guidelines for Practical Evaluation
(Poster)
SlidesLive Video » Detection of OutofDistribution (OOD) samples in realtime is a crucial safety check for the deployment of machine learning models in the medical field. Despite a growing number of uncertainty quantification techniques, there is a lack of evaluation guidelines on how to select OOD detection methods in practice. This gap impedes the implementation of OOD detection methods for realworld applications. Here, we propose a series of practical considerations and tests to choose the best OOD detector for a specific medical dataset. These guidelines are illustrated on a reallife use case of Electronic Health Records (EHR). Our results serve as a guide for the implementation of OOD detection methods in clinical practice, mitigating risks associated with the use of machine learning models in healthcare. 
Karina Zadorozhny · Patrick Thoral · Paul Elbers · Giovanni Cinà 🔗 


Distribution Shift nested in Web Scraping : Adapting MS COCO for Inclusive Data
(Poster)
SlidesLive Video » Popular benchmarks in Computer Vision suffer from a Westerncentric bias that leads to a distribution shift problem when trying to deploy Machine Learning systems in developing countries. Palliating this problem using the same data generation methods in poorly represented countries will likely bring the same bias that were initially observed. In this paper, we propose an adaptation of the MS COCO data generation methodology that address this issue, and show how the web scraping methods nests geographical distribution shifts. 
Theophile Bayet · Christophe Denis · JeanDaniel Zucker · Alassane BAH 🔗 


Estimating Test Performance for AI Medical Devices under Distribution Shift with Conformal Prediction
(Poster)
Estimating test performance of software AIbased medical devices under distribution shifts is crucial for evaluating safety, efficiency, and usability prior to clinical deployment~\cite{fda}.Due to the nature of regulated medical device software and the difficulty in acquiring large amounts of labeled medical datasets, we consider the task of predicting test accuracy of an arbitrary blackbox model on an unlabeled target domain \textit{without} modification to the original training process or any distributional assumptions of the original source data (i.e. we treat the model as a 
charlie lu · Syed Rakin Ahmed · Praveer Singh · Jayashree KalpathyCramer 🔗 


ALASCA: Rethinking Label Smoothing for Deep Learning Under Label Noise
(Poster)
SlidesLive Video » As label noise, one of the most popular distribution shifts, severely degrades deep neural networks' generalization performance, robust training with noisy labels is becoming an important task in modern deep learning. In this paper, we propose our framework, coined as Adaptive LAbel smoothing on SubClAssifier (ALASCA), that provides a robust feature extractor with theoretical guarantee and negligible additional computation. First, we derive that the label smoothing (LS) incurs implicit Lipschitz regularization (LR). Furthermore, based on these derivations, we apply the adaptive LS (ALS) on subclassifiers architectures for the practical application of adaptive LR on intermediate layers. We conduct extensive experiments for ALASCA and combine it with previous noiserobust methods on several datasets and show our framework consistently outperforms corresponding baselines. 
Jongwoo Ko · Bongsoo Yi · SeYoung Yun 🔗 


Diversify and Disambiguate: Learning from Underspecified Data
(Poster)
SlidesLive Video » Many datasets are underspecified, meaning that there are several equally viable solutions to a given task. Underspecified datasets can be problematic for methods that learn a single hypothesis because different functions that achieve low training loss can focus on different predictive features and thus have widely varying predictions on outofdistribution data. We propose DivDis, a simple twostage framework that first learns a collection of diverse hypotheses for a task by leveraging unlabeled data from the test distribution. We then disambiguate by selecting one of the discovered hypotheses using minimal additional supervision, in the form of additional labels or inspection of function visualization. We demonstrate the ability of DivDis to find robust hypotheses in image classification and natural language processing problems with underspecification. 
Yoonho Lee · Huaxiu Yao · Chelsea Finn 🔗 


Back to the Basics: Revisiting OutofDistribution Detection Baselines
(Poster)
SlidesLive Video » We study simple methods for outofdistribution (OOD) image detection that are compatible with any already trained classifier, relying on only its predictions or learned representations. Evaluating the OOD detection performance of various methods when utilized with ResNet50 and Swin Transformer models, we find methods that solely consider the model's predictions can be easily outperformed by also considering the learned representations. Based on our analysis, we advocate for a deadsimple approach that has been neglected in other studies: simply flag as OOD images whose average distance to their K nearest neighbors is large (in the representation space of an image classifier trained on the indistribution data). 
Johnson Kuan · Jonas Mueller 🔗 


Style Balancing and TestTime Style Shifting for Domain Generalization
(Poster)
SlidesLive Video » Recent works on domain generalization have shown great success by generating new feature statistics (or style statistics) during training, which enables the model to get exposed to diverse domains or styles. However, existing works suffer from crossdomain class imbalance problem, that naturally arises in domain generalization problems. The performance of previous works are also degraded when the gap between the style statistics of source and target domains is large (i.e., when the distribution shift is large in the featurelevel style space). In this paper, we propose new strategies to improve robustness against potential domain shift. We first propose style balancing, which strategically balances the number of samples for each class across all source domains, to improve domain diversity during training. Then we propose testtime style shifting, which shifts the style of the test sample (that has a large style gap with the source domains) to the nearest source domain to improve the prediction performance. 
Jungwuk Park · DongJun Han · Soyeong Kim · Jaekyun Moon 🔗 


Models Out of Line: A Fourier Lens on Distribution Shift Robustness
(Poster)
SlidesLive Video » Improving the accuracy of deep neural networks (DNNs) on outofdistribution (OOD) data is critical to an acceptance of deep learning (DL) in real world applications. It has been observed that accuracies on indistribution (ID) versus OOD data follow a linear trend and models that outperform this baseline are exceptionally rare (and referred to as 
Sara FridovichKeil · Brian Bartoldson · James Diffenderfer · Bhavya Kailkhura · PeerTimo Bremer 🔗 


Noisy Learning for Neural ODEs Acts as a Robustness Locus Widening
(Poster)
[
Poster]
We investigate several problems and challenges of evaluating the robustness of Differential Equationbased (DE) networks against synthetic shifts. We propose a novel and simple accuracy metric that can be used to evaluate intrinsic robustness and validate dataset corruption simulators. We also propose methodology recommendations destined for evaluating many faces of neural DEs' robustness and for comparing them with their discrete counterparts rigorously. We then use this criteria to evaluate a cheap data augmentation technique as a reliable way for demonstrating the natural robustness of neural ODEs against simulated image corruptions across multiple datasets. 
Martin Gonzalez · Loic Cantat 🔗 


The Power and Limitation of PretrainingFinetuning for Linear Regression under Covariate Shift
(Poster)
[
Poster]
We study linear regression under covariate shift, where the marginal distribution over the input covariates differs in the source and the target domains, while the conditional distribution of the output given the input covariates is similar across the two domains. We investigate a transfer learning approach with pretraining on the source data and finetuning based on the target data (both conducted by SGD) for this problem. We establish sharp instancedependent excess risk upper and lower bounds for this approach. Our bounds suggest that for a large class of linear regression instances, transfer learning with $O(N^2)$ source data (and scarce or no target data) is as effective as supervised learning with $N$ target data. In addition, we show that finetuning, even with only a small amount of target data, can drastically reduce the amount of source data required by pretraining. Our theory sheds light on the effectiveness and limitation of pretraining as well as the benefits of finetuning for tackling covariate shift problems.

Jingfeng Wu · Difan Zou · Vladimir Braverman · Quanquan Gu · Sham Kakade 🔗 


A BiasVariance Analysis of Weight Averaging for OOD Generalization
(Poster)
[
Poster]
SlidesLive Video » Standard neural networks struggle to generalize under distribution shifts. For outofdistribution generalization in computer vision, the best current approach averages the weights along a training run. Previous papers argue that weight averaging (WA) succeeds because it flattens the loss landscape. Our paper highlights the limitations of this analysis and proposes a new one based on WA's similarities with functional ensembling. We provide a new biasvariancecovariancelocality decomposition of WA's expected error: it explains WA's success especially when the marginal distribution changes at test time. Our analysis deepens the understanding of WA and more generally of deep networks under distribution shifts. 
Alexandre Ramé · Matthieu Kirchmeyer · Thibaud J Rahier · Alain Rakotomamonjy · Patrick Gallinari · Matthieu Cord 🔗 


Monotonic Risk Relationships under Distribution Shifts for Regularized Risk Minimization
(Poster)
Machine learning systems are often applied to data that is drawn from a different distribution than the training distribution. Recent work has shown that for a variety of classification and signal reconstruction problems, the outofdistribution performance is strongly linearly correlated with the indistribution performance. If this relationship or more generally a monotonic one holds, it has important consequences. For example, it allows to optimize performance on one distribution as a proxy for performance on the other. In this work, we study conditions under which a monotonic relationship between the performances of a model on two distributions is expected. We prove an exact asymptotic linear relation for squared error and a monotonic relation for misclassification error under a subspace shift model with feature scaling. 
Daniel LeJeune · Jiayu Liu · Reinhard Heckel 🔗 


What You See is What You Get: Distributional Generalization for Algorithm Design in Deep Learning
(Poster)
[
Poster]
We investigate and leverage a connection between Differential Privacy (DP) and the recently proposed notion of Distributional Generalization (DG). Applying this connection, we introduce new conceptual tools for designing deeplearning methods that bypass "pathologies" of standard stochastic gradient descent (SGD). First, we prove that differentially private methods satisfy a "What You See Is What You Get (WYSIWYG)" generalization guarantee: whatever a model does on its train data is almost exactly what it will do at test time. This guarantee is formally captured by distributional generalization. WYSIWYG enables principled algorithm design in deep learning by reducing \emph{generalization} concerns to \emph{optimization} ones: in order to mitigate unwanted behavior at test time, it is provably sufficient to mitigate this behavior on the train data. This is notably false for standard (nonDP) methods, hence this observation has applications even when privacy is not required. For example, importance sampling is known to fail for standard ERM, but we show that it has exactly the intended effect for DPtrained models. We use these insights to construct simple algorithms which match or outperform SOTA in several distributional robustness applications, and to significantly improve the privacy vs. disparate impact tradeoff of DPSGD. Finally, we also improve on known theoretical bounds relating DP, stability, and distributional generalization. 
Bogdan Kulynych · YaoYuan Yang · Yaodong Yu · Jarosław Błasiok · Preetum Nakkiran 🔗 


Time Series Prediction under Distribution Shift using Differentiable Forgetting
(Poster)
Time series prediction is often complicated by distribution shift which demands adaptive models to accommodate timevarying distributions. We frame time series prediction under distribution shift as a weighted empirical risk minimisation problem. The weighting of previous observations in the empirical risk is determined by a forgetting mechanism which controls the tradeoff between the relevancy and effective sample size that is used for the estimation of the predictive model. In contrast to previous work, we propose a gradientbased learning method for the parameters of the forgetting mechanism. This speeds up optimisation and therefore allows more expressive forgetting mechanisms. 
Stefanos Bennett · Jason Clarkson 🔗 


On the nonlinear correlation of ML performance across data subpopulations
(Poster)
[
Poster]
SlidesLive Video » Understanding the performance of machine learning models across diverse data distributions is critically important for reliable applications. Recent empirically works find that there is a strong linear relationship between indistribution (ID) and outofdistribution (OOD) performance, but we show that this is not necessarily true if there are subpopulation shifts. In this paper, we empirically show that outofdistribution performance often has nonlinear correlation with indistribution performance under subpopulation shifts. To understand this phenomenon, we decompose the model's performance into performance on each subpopulation. We show that there is a "moon shape" correlation (parabolic uptrend curve) between the test performance on the majority subpopulation and the minority subpopulation. This nonlinear correlations hold across model architectures, training durations and hyperparameters, and the imbalance between subpopulations. Moreover, we show that the nonlinearity increases in the presence of spurious correlations in the training data. We provide complementary theoretical and experimental analyses for this interesting phenomenon of nonlinear performance correlation across subpopulations. Finally, we discuss the implications of our findings for ML reliability and fairness. 
Weixin Liang · Yining Mao · Yongchan Kwon · Xinyu Yang · James Zou 🔗 


Data Augmentation vs. Equivariant Networks: A Theoretical Study of Generalizability on Dynamics Forecasting
(Poster)
[
Poster]
Exploiting symmetry in structured data is a powerful way to improve the learning and generalization ability of deep learning models. Data augmentation and equivariant neural nets are two of the main approaches for enabling neural nets to preserve symmetries. Since realworld data is rarely strictly symmetric, recently, several approximately equivariant networks have also been introduced. In this work, we theoretically compare the generalizability of data augmentation techniques, strictly equivariant networks, and approximately equivariant networks.Unlike most prior theoretical works on symmetry that are based on the i.i.d assumption, we instead focus on generalizability of these three approaches on the task of nonstationary dynamics forecasting. 
Rui Wang · Robin Walters · Rose Yu 🔗 


Maximum Mean Discrepancy Distributionally Robust Nonlinear ChanceConstrained Optimization with FiniteSample Guarantee
(Poster)
[
Poster]
Distributionally robust chanceconstrained programs (DRCCP) provide a powerful frameworkfor chance constraint optimization in presenceof distributional uncertainty. However, such programs based on the popular Wasserstein ambiguity sets usually require restrictive assumptions onthe constraint functions. To overcome these limitations, we propose a practical DRCCP algorithmusing kernel maximum mean discrepancy (MMD)ambiguity sets, which we term MMDDRCCP, totreat general nonlinear constraints without usingadhoc reformulation techniques. MMDDRCCPcan handle general nonlinear and nonconvex constraints with a proven finitesample constraint satisfaction guarantee of a dimensionindependent\mathcal{O}(\frac{1}{{N}})rate, achievable by a practical algorithm.We further propose an efficient bootstrap schemefor constructing sharp MMD ambiguity sets inpractice without relying on computationally costlycross validation procedures. 
Yassine Nemmour · Heiner Kremer · Bernhard Schölkopf · JiaJie Zhu 🔗 


DAFT: Distilling Adversarially Finetuned teachers for OOD Robustness
(Poster)
SlidesLive Video » We consider the problem of OOD generalization,where the goal is to train a model that performs well on test distributions that are different from the training distribution. Deep learning models are known to be fragile to such shifts and can suffer large accuracy drops even for slightly different test distributions (Hendrycks & Dietterich, 2019).We propose a new method –DAFT– based on the intuition that adversarially robust combination of a large number of rich features should provide OOD robustness. Our method carefully distills the model from a powerful teacher that learns several discriminative features using standard training while combining them using adversarial training. The standard adversarial training procedure is modified to produce teachers which can guide the student better. We evaluate DAFT on standard benchmarks in the DomainBed framework, and find that DAFT consistently outperforms welltuned ERM and distillation baselines by up to 6%, with more pronounced gains for smaller networks 
Anshul Nasery · Sravanti Addepalli · Praneeth Netrapalli · Prateek Jain 🔗 


Evaluation of Generative Unsupervised Domain Adaptation in the Absence of Target Labels
(Poster)
SlidesLive Video » Unsupervised domain adaptation is essential for generalization on unlabeled target domains. Generative domain adaptation methods achieve domain adaptation by synthesizing intermediate sourcetotarget images. The inspection of such images can assist in identifying successful sets of hyperparameters and methods, however, this is both timeconsuming and frequently challenging. In practical applications, selecting an appropriate method and tuning its parameters is difficult when target labels are entirely absent. We develop a metric for automatically assessing unsupervised generative domain adaptation methods based on the generated sourcetotarget images. We show that this metric correlates well with the performance of the downstream machine learning task, which is, in this case, semantic segmentation. 
Zeju Qiu · Grigorios Chrysos · Stratis Tzoumas 🔗 


GraphTTA: Test Time Adaptation on Graph Neural Networks
(Poster)
[
Poster]
Recently, test time adaptation (TTA) has attracted increasing attention due to its power of handling the distribution shift issue in the real world. Unlike what has been developed for convolutional neural networks (CNNs) for image data, TTA is less explored for Graph Neural Networks (GNNs). There is still a lack of efficient algorithms tailored for graphs with irregular structures. In this paper, we present a novel test time adaptation strategy named Graph Adversarial Pseudo Group Contrast (GAPGC), for graph neural networks TTA, to better adapt to the Out Of Distribution (OOD) test data. Specifically, GAPGC employs a contrastive learning variant as a selfsupervised task during TTA, equipped with Adversarial Learnable Augmenter and Group PseudoPositive Samples to enhance the relevance between the selfsupervised task and the main task, boosting the performance of the main task. Furthermore, we provide theoretical evidence that GAPGC can extract minimal sufficient information for the main task from information theory perspective. Extensive experiments on molecular scaffold OOD dataset demonstrated that the proposed approach achieves stateoftheart performance on GNNs. 
Guanzi Chen · Jiying Zhang · Xi Xiao · Yang Li 🔗 


Adversarial Cheap Talk
(Poster)
Adversarial attacks in reinforcement learning (RL) often assume highlyprivileged access to the learning agent’s parameters, environment or data. Instead, this paper proposes a novel adversarial setting called a Cheap Talk MDP in which an Adversary has a minimal range of influence over the Victim. Parameterised as a deterministic policy that only conditions on the current state, an Adversary can merely append information to a Victim’s observation. To motivate the minimumviability, we prove that in this setting the Adversary cannot occlude the ground truth, influence the underlying dynamics of the environment, introduce nonstationarity, add stochasticity, see the Victim’s actions, or access their parameters. Additionally, we present a novel metalearning algorithm to train the Adversary, called adversarial cheap talk (ACT). Using ACT, we demonstrate that the resulting Adversary still manages to influence the Victim’s training and test performance despite these restrictive assumptions. Affecting traintime performance reveals a new attack vector and provides insight into the success and failure modes of existing RL algorithms. More specifically, we show that an ACT Adversary is capable of harming performance by interfering with the learner’s function approximation and helping the Victim’s performance by appending useful features. Finally, we demonstrate that an ACT Adversary can append information during traintime to directly and arbitrarily control the Victim at testtime in a zeroshot manner. 
Christopher Lu · Timon Willi · Alistair Letcher · Jakob Foerster 🔗 


Fairness and robustness in anticausal prediction
(Poster)
Robustness to distribution shift and fairness have independently emerged as two important desiderata required of modern machine learning models. Here, we discuss these connections through a causal lens, focusing on anticausal prediction tasks, where the input to a classifier (e.g., an image) is assumed to be generated as a function of the target label and the protected attribute. By taking this perspective, we draw explicit connections between a common fairness criterionseparationand a common notion of robustnessrisk invariance. These connections provide new motivation for applying the separation criterion in anticausal settings, and show that fairness can be motivated entirely on the basis of achieving better performance. In addition, our findings suggest that robustnessmotivated approaches can be used to enforce separation, and that they often work better in practice than methods designed to directly enforce separation. Using a medical dataset, we empirically validate our findings on the task of detecting pneumonia from Xrays, in a setting where differences in prevalence across sex groups motivates a fairness mitigation. Our findings highlight the importance of considering causal structure when choosing and enforcing fairness criteria. 
Maggie Makar · Alexander D'Amour 🔗 


Plex: Towards Reliability using Pretrained Large Model Extensions
(Poster)
A recent trend in artificial intelligence (AI) is the use of pretrained models for language and vision tasks, which has achieved extraordinary performance but also puzzling failures. Examining tasks that probe the model’s abilities in diverse ways is therefore critical to the field. In this paper, we explore the reliability of models, where we define a reliable model as one that not only achieves strong predictive performance but also performs well consistently over many decisionmaking tasks such as uncertainty (e.g., selective prediction, open set recognition), robust generalization (e.g., accuracy and proper scoring rules such as loglikelihood on in and outofdistribution datasets), and adaptation (e.g., active learning, fewshot learning). We devise 10 types of tasks over 36 datasets in order to evaluate different aspects of reliability on both vision and language domains. To improve reliability, we developed ViTPlex and T5Plex, pretrained large model extensions (plex) for vision and language modalities, respectively. Plex greatly improves the stateoftheart across tasks, and simplifies the traditional protocol as it does not require designing scores or tuning the model for each individual task. We demonstrate scaling effects over model sizes and pretraining dataset sizes up to 4 billion examples. We also demonstrate Plex’s capabilities on challenging tasks including zeroshot open set recognition, fewshot uncertainty, and uncertainty in conversational language understanding. 
Dustin Tran · Andreas Kirsch · Balaji Lakshminarayanan · Huiyi Hu · Du Phan · D. Sculley · Jasper Snoek · Jeremiah Liu · Jie Ren · Joost van Amersfoort · Kehang Han · E. Kelly Buchanan · Kevin Murphy · Mark Collier · Mike Dusenberry · Neil Band · Nithum Thain · Rodolphe Jenatton · Tim G. J Rudner · Yarin Gal · Zachary Nado · Zelda Mariet · Zi Wang · Zoubin Ghahramani



Group Distributionally Robust Reinforcement Learning with Hierarchical Latent Variables
(Poster)
[
Poster]
Reinforcement Learning (RL) agents may only have incomplete information about tasks to solve. Although inferring the latent task could improve the performance, blindly trusting the task estimates may cause significant performance drops due to inevitable inference errors. One dominant way to enhance robustness is to optimize over worstpossible tasks, which may generate overly conservative policies. Moreover, most sequential decisionmaking formulations assume tasks are i.i.d. sampled and overlook the existence of task subpopulations. To address both challenges under task estimate uncertainty, we propose Group Distributionally Robust Markov Decision Process (GDRMDP). GDRMDP is flexible to encode prior task relationships via a latent mixture model, and leverage the prior by dynamically updating a belief distribution over mixtures. GDRMDP has a distributionally robust decision criterion as finding the optimal policy that maximizes the expected return under the worstpossible qualified belief within an ambiguity set. We show both theoretically and empirically that GDRMDP's hierarchical structure further enhances the distributional robustness over belief inference errors. 
Mengdi Xu · Peide Huang · Visak Kumar · Jielin Qiu · Chao Fang · KuanHui Lee · Xuewei Qi · Henry Lam · Bo Li · Ding Zhao 🔗 


Task Modeling: A Multitask Approach for Improving Robustness to Group Shifts
(Poster)
We study the problem of learning from multiple groups of heterogeneous data distributions. Previous work shows that machine learning models trained under group shifts can exhibit poor performance on groups whose training set size is usually small. In this work, we explore multitask learning approaches to augment the training set and optimize the worstgroup performance of a target task. A critical challenge in multitask learning is how to identify beneficial source tasks for a target task. To address this challenge, we propose a task modeling framework that learns a mapping from any subset of source tasks to their transferability on the target task. Our key finding is that with outputs from training models on randomly subsampled source tasks, a linear task model can accurately predict the results of multitask training for a target task. This finding implies an algorithm that selects beneficial source tasks using the learned task model. We validate our approach on a tabular dataset with 50 tasks. Our experiments demonstrate that our task selection algorithm achieves an average improvement of 1.03% in the worstgroup accuracy on six target tasks compared to prior methods. Meanwhile, our approach is applicable to other performance metrics, including average performance and fairness measures, and outperforms baselines by 0.57% and 2.09%, respectively. 
Dongyue Li · Huy Nguyen · Hongyang Zhang 🔗 


A MetaAnalysis of Distributionally Robust Models
(Poster)
Stateoftheart image classifiers trained on massive datasets (such as ImageNet) have been shown to be vulnerable to a range of both intentional and incidental distribution shifts. On the other hand, several recent classifiers with favorable outofdistribution (OOD) robustness properties have emerged, achieving very accuracy on their target tasks while maintaining their indistribution accuracy on challenging benchmarks. We present a metaanalysis on a wide range of publicly released models, most of which have been published over the last twelve months. Through this metaanalysis, we empirically identify four main commonalities for all the bestperforming OODrobust models, all of which illuminate the considerable promise of visionlanguage pretraining. 
Benjamin Feuer · Ameya Joshi · Chinmay Hegde 🔗 


On Feature Learning in the Presence of Spurious Correlations
(Poster)
Deep learning classifiers are known to rely on spurious correlations — patterns which are semantically irrelevant but predictive of the target on the training data. In this paper we explore the quality of feature representations learned by standard empirical risk minimization (ERM) and specialized group robustness training, as well as the effect of key factors including architecture, pretraining strategy, regularization and others. Following recent work on Deep Feature Reweighting (DFR), we evaluate the feature representations by retraining the last layer of the model on a heldout set where the spurious correlation is broken. Through this procedure, we reveal how much information about the core semantic features is contained in the learned representations. On multiple vision and NLP problems, we show that the features learned by simple ERM are highly competitive with the features learned by specialized group robustness methods targeted at reducing the effect of spurious correlations. Moreover, we show that the quality of learned feature representations is largely affected by the choice of data augmentation, model architecture and pretraining strategy. On the other hand, we find that strong regularization, and long training are generally not helpful for improving the learned feature representations. Finally, using insights from our analysis, we significantly improve upon the best results reported in the literature on the popular Waterbirds, CelebA hair color prediction and WILDSFMOW problems, achieving 97%, 92% and 50% worstgroup accuracies respectively. 
Pavel Izmailov · Polina Kirichenko · Nate Gruver · Andrew Wilson 🔗 


Deep ensemble diversity and robustness on classification tasks
(Poster)
Ensembles of neural networks have been shown to achieve stateoftheart performance on a variety of ML benchmark tasks, and particularly on tasks evaluating robustness to dataset shift. Conventional wisdom attributes this success to the diversity of the neural networks within the ensemble: the more diverse the predictions, the more robust the aggregated output should be. Under the mean squared error loss, the influence of ensemble diversity is apparent from the biasvariance decomposition, which separates the ensemble loss into two terms: the first evaluates the individual model quality of ensemble members, and the second the overall ensemble diversity. Classification tasks, however, typically rely upon KL divergencebased losses with less tractable biasvariance decompositions, and thus several ad hoc metrics have been proposed as measures of classifier diversity. In this work, we a) show empirically that various metrics of ensemble diversity indeed correlate with improved performance on classification tasks, and b) leverage a generalization of the biasvariance decomposition to propose a theoreticallymotivated diversity metric with a strong correlation to ensemble loss. On outofdistribution tasks, albeit to a lesser degree, diversity metrics also correlate with ensemble loss. 
Zelda Mariet 🔗 


Asymmetry Learning for Counterfactualinvariant Classification in OOD Tasks
(Poster)
Generalizing from observed to new related environments (outofdistribution) is central to the reliability of classifiers. However, most classifiers fail to predict label $Y$ from input $X$ when the change in environment is due a (stochastic) input transformation $T^\text{te} \circ X'$ not observed in training, as in training we observe $T^\text{tr} \circ X'$, where $X'$ is a hidden variable. This work argues that when these transformations are induced by a collection of known $m$ equivalence relations, the task of finding a robust OOD classifier can be defined as finding the simplest causal model that defines a causal connection between the transformations and the target labels. We then propose a new learning paradigm, asymmetry learning, that identifies which symmetries the classifier must break in order to correctly predict $Y$ in both train and test. Asymmetry learning performs a causal model search that, under certain identifiability conditions, finds classifiers that perform equally well indistribution and outofdistribution.

Chandra Mouli Sekar · Bruno Ribeiro 🔗 


Robust Estimation of Laplacian Constrained Gaussian Graphical Models with Trimmed Nonconvex Regularization
(Poster)
SlidesLive Video » The problem of discovering a structure that fits a collection of vector data is of crucial importance for a variety of applications. Such problems can be framed as Laplacian constrained Gaussian Graphical Model inference. Existing algorithms rely on the assumption that all the available observations are drawn from the same Multivariate Gaussian distribution. However, in practice it is common to find scenarios where the datasets are contaminated with a certain number of outliers. The purpose of this work is to address that problem. We propose a robust method based on Trimmed Least Squares that copes with the presence of corrupted samples. We provide statistical guarantees on the estimation error and present results on simulated data. 
Mariana Vargas Vieyra 🔗 


Evaluating Robustness to Dataset Shift via Parametric Robustness Sets
(Poster)
We give a method for proactively identifying small, plausible shifts in distribution which lead to large differences in model performance. To ensure that these shifts are plausible, we parameterize them in terms of interpretable changes in causal mechanisms of observed variables. This defines a parametric robustness set of plausible distributions and a corresponding worstcase loss. We construct a local approximation to the loss under shift, and show that problem of finding worstcase shifts can be efficiently solved. 
Nikolaj Thams · Michael Oberst · David Sontag 🔗 


Improved Medical OutofDistribution Detectors For Modality and Semantic Shifts
(Poster)
Detecting outofdistribution (OOD) data with varying levels of semantic and covariate shifts with respect to the indistribution (ID) is critical for the safe deployment of models. The goal is to design a detector that can accept meaningful variations of the ID data, while rejecting samples from OOD regimes. Such an objective can be realized by enforcing consistency with a scoring function (e.g., energy) and calibrating the detector to reject a curated set of OOD data (\textit{a.k.a} outlier exposure (OE)). However, OE methods require representative OOD datasets which are challenging to acquire in practice, hence the recent trend of designing OEfree detectors. In this paper, we find that controlled generalization to ID variations and exposure to diverse (synthetic) outliers are critical for improving OOD detection. Through empirical studies on the MedMNIST medical imaging benchmark, we demonstrate significant performance gains ($15\%  35\%$ in AUROC) over existing OEfree, OOD detection approaches under both semantic and modality shifts.

Vivek Narayanaswamy · Yamen Mubarka · Rushil Anirudh · Deepta Rajan · Andreas Spanias · Jayaraman J. Thiagarajan 🔗 


AugLoss: A Robust, Reliable Methodology for RealWorld Corruptions
(Poster)
SlidesLive Video » Deep Learning (DL) models achieve great successes in many domains. However, DL models increasingly face safety and robustness concerns, including noisy labeling in the training stage and feature distribution shifts in the testing stage. Previous works made significant progress in addressing these problems, but the focus has largely been on developing solutions for only one problem at a time. For example, recent work has argued for the use of tunable robust loss functions to mitigate label noise, and data augmentation (e.g., AugMix) to combat distribution shifts. As a step towards addressing both problems simultaneously, we introduce AugLoss, a simple but effective methodology that achieves robustness against both traintime noisy labeling and testtime feature distribution shifts by unifying data augmentation and robust loss functions. We conduct comprehensive experiments in varied settings of realworld dataset corruption to showcase the gains achieved by AugLoss compared to previous stateoftheart methods. Lastly, we hope this work will open new directions for designing more robust and reliable DL models under realworld corruptions. 
Kyle Otstot · John Kevin Cava · Tyler Sypherd · Lalitha Sankar 🔗 


Context Shift from Test Benchmarks to RealWorld Production Performance
(Poster)
Across a wide variety of domains, there exists a performance gap between machine learning models' accuracy on dataset benchmarks and realworld production data. Despite the careful design of static dataset benchmarks to represent the realworld, models often err when the data is outofdistribution relative to the data the models have been trained on. We can directly measure and adjust for some aspects of distribution shift, but we cannot address sample selection bias, adversarial perturbations, and nonstationarity without knowing the data generation process. In this paper, we outline two methods for identifying changes in context that lead to distribution shifts and model prediction errors: leveraging human intuition and expert knowledge to identify firstorder contexts and developing dynamic benchmarks based on desiderata for the data generation process. Furthermore, we present two casestudies to highlight the implicit assumptions underlying applied machine learning models that tend to lead to errors when attempting to generalize beyond test benchmark datasets. By paying close attention to the role of context in each prediction task, researchers can reduce context shift errors and increase generalization performance. 
Matthew Groh 🔗 


Exploring the Design of Adaptation Protocols for Improved Generalization and Machine Learning Safety
(Poster)
SlidesLive Video » While directly finetuning largescale, pretrained models on taskspecific data is wellknown to induce strong indistribution task performance, recent works have demonstrated that different adaptation protocols, such as linear probing before finetuning, can improve OOD generalization. However, the design space of such adaptation protocols remains underexplored and the evaluation of such protocols has primarily focused on distribution shifts. Therefore, in this work, we evaluate common adaptation protocols across distributions shifts and machine learning safety metrics (e.g., anomaly detection, calibration). We find that protocols induce disparate tradeoffs that were not apparent from prior evaluation. Finally, we demonstrate that appropriate pairing of data augmentation and protocol can substantially mitigate this tradeoff. 
Puja Trivedi · Danai Koutra · Jayaraman J. Thiagarajan 🔗 


CODiT: Conformal OutofDistribution Detection in TimeSeries Data
(Poster)
SlidesLive Video » Machine learning models are prone to make incorrect predictions on inputs that are far from the training distribution. This hinders their deployment in safetycritical domains such as autonomous vehicles and healthcare. A number of techniques have been proposed for outofdistribution (OOD) detection on individual datapoints. But in many applications, the inputs to these models form a temporal sequence. Existing techniques for OOD detection in timeseries either do not exploit temporal relationships in the sequence or do not provide any guarantees on detection. We develop a selfsupervised learning approach, CODiT for OOD detection in timeseries data with guarantees on detection. We illustrate CODiT's efficacy on autonomous driving vision datasets and physiological GAIT data. Our code is available at shorturl.at/fzR02. 
Ramneet Kaur · Kaustubh Sridhar · Sangdon Park · Susmit Jha · Anirban Roy · Oleg Sokolsky · Insup Lee 🔗 


Diagnosing Model Performance Under Distribution Shift
(Poster)
Prediction models perform poorly when deployed to distributions different from those seen during training. To understand these operational failure modes of ML models, we develop methods to attribute the drop in performance to different types of distribution shifts. Our approach decomposes the performance drop into 1) an increase in harder but frequently seen examples during training, 2) changes in the relationship between outcome and features, and 3) poor performance on examples infrequent or unseen during training. Our procedure is principled yet flexible enough to incorporate any feature mapping or metadata. We empirically demonstrate how our decomposition can inform different ways to improve model performance for different distribution shifts. 
Tianhui Cai · Hongseok Namkoong · Steve Yadlowsky 🔗 


Distributionally Adaptive Meta Reinforcement Learning
(Poster)
SlidesLive Video » Metareinforcement learning algorithms provide a datadriven way to acquire learning algorithms that quickly adapt to many tasks with varying rewards or dynamics functions. However, learned metapolicies are often effective only on the exact task distribution on which the policy was trained, and struggle in the presence of distribution shift of testtime rewards or transition dynamics. In this work, we develop a framework for metaRL algorithms that are able to behave appropriately under testtime distribution shifts in the space of tasks. Our framework centers on an adaptive approach to distributional robustness, in which we train a population of metaagents to be robust to varying levels of distribution shift, so that when evaluated on a (potentially shifted) testtime distribution of tasks, we can adaptively choose the most appropriate metaagent to follow. We formally show how this framework allows for improved regret under distribution shift, and empirically show its efficacy on simulated robotics problems under a wide range of distribution shifts. 
Anurag Ajay · Dibya Ghosh · Sergey Levine · Pulkit Agrawal · Abhishek Gupta 🔗 


2 CENTs on continual adaptation: replay & parameter buffers stabilize entropy minimization
(Poster)
SlidesLive Video » We propose continual entropy minimization (CENT) to adapt computer vision models to continual distribution shifts at ImageNet scale. CENT leverages a replay buffer with images from the source distribution along with rolling parameter buffers to stabilize the training dynamics of conventional testtime adaptation methods. Our work is the first to demonstrate stable, continual adaptation on ImageNet scale, and obtains stateoftheart results in both static and continual variants of the ImageNetC benchmark. 
Ori Press · Steffen Schneider · Matthias Kuemmerer · Matthias Bethge 🔗 


Towards Practicable Sequential Shift Detectors
(Poster)
[
Poster]
There is a growing awareness of the harmful effects of distribution shift on the performance of deployed machine learning models. Consequently, there is a growing interest in detecting these shifts before associated costs have time to accumulate. However, desiderata of crucial importance to the practicable deployment of sequential shift detectors are typically overlooked by existing works, precluding their widespread adoption. We identify three such desiderata, highlight existing works relevant to their satisfaction, and recommend impactful directions for future research. 
Oliver Cobb · Arnaud Van Looveren 🔗 


Towards OOD Detection in Graph Classification from Uncertainty Estimation Perspective
(Poster)
SlidesLive Video » The problem of outofdistribution detection for graph classification is far from being solved. The existing models tend to be overconfident about OOD examples or completely ignore the detection task. In this work, we consider this problem from the uncertainty estimation perspective and perform the comparison of several recently proposed methods. In our experiments, we find that there is no universal approach for OOD detection, and it is important to consider both graph representations and predictive categorical distribution. 
Gleb Bazhenov · Sergey Ivanov · Maxim Panov · Alexey Zaytsev · Evgeny Burnaev 🔗 


What can we do with just the model? A simple knowledge extraction framework
(Poster)
SlidesLive Video » We consider the problem of adapting semantic segmentation models to new target domains, only from the trained source model, without the source data. Not only is this setting much harder than if one had access to the source data, this is necessary in many practical situations where source data is not available due to privacy and storage reasons. Our algorithm has two parts  first, we update that normalization statistics which helps to compensate for the distribution shift and second, we transfer knowledge from the source models adhering to certain equivariant and invariant transforms. The transforms helps to efficiently extract the knowledge beyond vanilla selftraining. Through extensive experiments on multiple semantic segmentation tasks, we show how such a simple framework can be effective in extracting knowledge from the source model, for a variety of problem settings, and performs much better or at par with current stateoftheart methods which are specifically tuned for the respective settings. 
Sujoy Paul · Ansh Khurana · Gaurav Aggarwal 🔗 


Are We Viewing the Problem of Robust Generalisation through the Appropriate Lens?
(Poster)
We discuss different approaches to the challenge of robust object recognition under distribution shifts. We advocate a view of this challenge that is more closely informed by the problem of visual recognition, and which emphasizes dynamic model behaviour as opposed to centering the statistical properties of training and test distributions. We introduce an experimental setting geared towards developing models that can exhibit robust behaviour in a reliable and scalable manner. We refer to this requirement "systematic robustness", which involves excluding certain combinations of classes and image attributes are systematically during training. Unlike prior work which studies systematic generalisation in DNNs or their susceptibility to spurious correlations, we use synthetic operations and data sampling to scale such experiments up to largescale naturalistic datasets. 
Mohamed Omran · Bernt Schiele 🔗 


Adapting to Shifts in Latent Confounders via Observed Concepts and Proxies
(Poster)
We address the problem of unsupervised domain adaptation when the source differs from the target because of a shift in the distribution of a latent confounder. In this case, neither covariate shift nor label shift assumptions apply. When all data is discrete, we show that the optimal target predictor can be nonparametrically identified with the help of concept and proxy variables, available only in the source, and unlabeled data from the target. 
Matt Kusner · Ibrahim Alabdulmohsin · Stephen Pfohl · Olawale Salaudeen · Arthur Gretton · Sanmi Koyejo · Jessica Schrouff · Alexander D'Amour 🔗 


Positive Unlabeled Contrastive Representation Learning
(Poster)
Selfsupervised pretraining on unlabeled data followed by supervised finetuning on labeled data is a popular paradigm for learning from limited labeled examples. In this paper, we investigate and extend this paradigm to the classical positive unlabeled (PU) setting  the weakly supervised taskof learning a binary classifier only using a few labeled positive examples and a set of unlabeled samples. We propose a novel contrastive objective  positive unlabeled Noise Contrastive Estimation (puNCE) that leverages the available explicit (from labeled samples) and implicit (from unlabeled samples) supervision to learn useful representations from positive unlabeled input data. The underlying idea is to assign each training sample an individual weight; labeled positives are given unit weight; unlabeled samples are duplicated, one copy is labeled positive and the other as negative with weights $\pi$ and $(1\pi)$ respectively, where $\pi$ denotes the class prior. Extensive experiments across vision and natural language tasks reveal that puNCE consistently improves over existing unsupervised and supervised contrastive baselines under limited supervision.

Anish Acharya · Sujay Sanghavi · Li Jing · Bhargav Bhushanam · Michael Rabbat · Dhruv Choudhary · Inderjit Dhillon 🔗 


Towards Domain Adversarial Methods to Mitigate Texture Bias
(Poster)
[
Poster]
SlidesLive Video » Shapetexture conflict is key to our understanding of the behavior of Convolutional Neural Networks (CNNs) and their observably good performance. This work proposes a domain adversarial traininginspired technique as a novel approach to mitigate texture bias. In our work, instead of looking at the domains as the source from which the images are from, we look at the domains as inherent features of the image. The model is trained in a method similar to Domain Adversarial training, where we define the source and target domains as the dataset and its augmented versions with minimal texture information (edge maps and stylized images), respectively. We show that using domain invariant learning to capture a prior based on the shapetexture information helps models learn robust representations. We perform extensive experiments on three subsets of ImageNet, namely, ImageNet20, ImageNet200, ImageNet9. The results show that the proposed method outperforms standard Empirical Risk Minimization (ERM) in terms of test accuracy and also as evidenced by the high accuracy on the OutOfDistribution (OOD) datasets ImageNetR and NICO. 
Dhruva Kashyap · Sumukh K Aithal · Rakshith C · Natarajan Subramanyam 🔗 


Dynamics of Dataset Bias and Robustness
(Poster)
SlidesLive Video » We aim to shine a light on the effects of various techniques for improving robustness under distributionshift on the datasetbias (i.e. class imbalance). This relationship between dataskewness and such performanceenhancing measures remains largely unexplored. Deep learning models are seeing realworld deployment, hence it's crucial to gauge the reliability of such neural networks since undetected (side)effects of robustness enhancement on dataset bias could be catastrophic. We observe that robustnessenhancement techniques affect performance on underrepresented (yet critical) classes, thus requiring investigation from a fairness perspective. We evaluate methods for model robustness on distinct architectures by their effects on dataset bias through a variety of specialized metrics (imbalancefocused; F1 score/balanced accuracy) on artificially imbalanced datasets. 
Prabhu Pradhan · Ruchit Rawal 🔗 


Bridging Distribution Shift in Imitation Learning via Taylor Expansions
(Poster)
We propose Taylor Series Imitation Learning (TaSIL), a simple augmentation to standard behavior cloning losses in the context of continuous control. TaSIL penalizes deviations in the higherorder Taylor series terms betweenthe learned and expert policies. We show that experts satisfying a notion of incremental inputtostate stability are easy to learn, in the sense that a small TaSILaugmented imitation loss over expert trajectories guarantees a small imitation loss over trajectories generated by the learned policy, and provide samplecomplexity bounds for TaSIL that scale as $\tilde\calO(1/n)$ in the realizable setting, for $n$ the number of expert demonstrations. Finally, we compare experimentally standard Behavior Cloning, DART, and DAgger with TaSILlossaugmented variants. In all cases, we show significant improvement over baselines across a variety of MuJoCo tasks.

Daniel Pfrommer · Thomas T. Zhang · Nikolai Matni · Stephen Tu 🔗 


AgreementontheLine: Predicting the Performance of Neural Networks under Distribution Shift
(Poster)
[
Poster]
Recently, Miller et al. showed that a model's indistribution (ID) accuracy has a strong linear correlation with its outofdistribution (OOD) accuracy, on several OOD benchmarks, a phenomenon they dubbed ``accuracyontheline''. While a useful tool for model selection, this fact does not help to estimate the actual OOD performance of models without access to a labeled OOD validation set. In this paper, we show a similar surprising phenomenon also holds for the agreement between pairs of neural network classifiers: whenever accuracyontheline holds, the OOD agreement between the predictions of any two pairs of neural networks is linearly correlated with ID agreement. Furthermore, we observe that the slope and bias of OOD vs ID agreement closely matches that of OOD vs ID accuracy. This phenomenon which we call agreementontheline, has important practical applications: without any labeled data, we can predict the OOD accuracy of classifiers, since OOD agreement can be estimated with just unlabeled data. Our prediction algorithm outperforms previous methods both in shifts where agreementontheline holds and, surprisingly, when accuracy is not on the line. 
Christina Baek · Yiding Jiang · aditi raghunathan · Zico Kolter 🔗 