Timezone: »

Workshop
Principles of Distribution Shift (PODS)
Elan Rosenfeld · Saurabh Garg · Shibani Santurkar · Jamie Morgenstern · Hossein Mobahi · Zachary Lipton · Andrej Risteski

Sat Jul 23 06:00 AM -- 02:40 PM (PDT) @ Ballroom 3

The importance of robust predictions continues to grow as machine learning models are increasingly relied upon in high-stakes settings. Ensuring reliability in real-world applications remains an enormous challenge, particularly because data in the wild frequently differs substantially from the data on which models were trained. This phenomenon, broadly known as “distribution shift”, has become a major recent focus of the research community.

With the growing interest in addressing this problem has come growing awareness of the multitude of possible meanings of “distribution shift” and the importance of understanding the distinctions between them: which types of shift occur in the real world, and under which of these is generalization feasible? Negative results seem just as common as positive ones; where provable generalization is possible, it often depends on strong structural assumptions whose likelihood of holding in reality is questionable. Existing approaches often lack rigor and clarity with regards to the precise problem they are trying to solve. Some work has been done to precisely define distribution shift and to produce benchmarks which properly reflect real-world distribution shift, but overall there seems to be little communication between the communities tackling foundations and applications respectively. Recent strides have been made to move beyond tinkering, bringing much needed rigor to the field, and we hope to encourage this effort by opening a dialogue to share ideas between these communities.

 Sat 6:00 a.m. - 6:10 a.m. Introduction 🔗 Sat 6:10 a.m. - 6:50 a.m. Distribution Shifts in Healthcare—A Key Barrier to Safe Deployment of Machine Learning Algorithms in the Clinic (Invited Talk) Deep learning approaches are increasingly used in healthcare due to their seemingly remarkable performance. However, they can be notoriously brittle, often with little ability to generalize outside their training data. Using real life examples from ophthalmology, oncology and radiology, we will first discuss practical examples of distribution shifts. We will then highlight how even seemingly subtle distribution shifts can lead to catastrophic failures of models. We will highlight the need for constant vigilance of the input data and better metrics to quantify distribution shifts. We will conclude with a plea to the ICML/PODS community to work with clinical community on this critically important topic. Jayashree Kalpathy-Cramer 🔗 Sat 6:50 a.m. - 7:10 a.m. Extending the WILDS Benchmark for Unsupervised Adaptation (Invited Talk) Machine learning models deployed in the real world constantly face distribution shifts, and these distribution shifts can significantly degrade model performance. In this talk, I will present the WILDS benchmark of real-world distribution shifts, focusing on the version 2.0 update that adds curated unlabeled data. Unlabeled data can be a powerful leverage for improving out-of-distribution performance, but existing distribution shift benchmarks with unlabeled data do not reflect the breadth of scenarios that arise in real-world applications. To this end, we provide unlabeled data to 8 out of 10 datasets in WILDS, spanning diverse applications and modalities. We observe that existing methods fail to improve out-of-distribution performance on WILDS, even though these methods have been successful on existing benchmarks with different types of distribution shifts. This underscores the importance of developing and evaluating methods on diverse types of distribution shifts, including directly on shifts that arise in practice. Shiori Sagawa 🔗 Sat 7:10 a.m. - 7:30 a.m. Coffee Break (Break) 🔗 Sat 7:30 a.m. - 8:10 a.m. Distribution Shift Through the Lens of Explanations (Invited Talk) Machine learning models often perform poorly under distribution shift. But can we understand how a particular distribution shift will affect a model? We approach this in two parts: (1) explaining the shift itself, and (2) explaining the model's behavior. First, we train a language model to describe the difference between two distributions. The model produces natural language explanations that allow humans to distinguish random draws from the two distributions. This helps reveal subtle but important shifts that may not be apparent from manual inspection, and can also be used to uncover spurious cues. We use this to identify "shortcuts" that models rely on, and construct a distribution shift that breaks the shortcut and decreases model performance. Having built tools to understand how the data is shifted, we next investigate whether model explanations (such as Grad-CAM) can be used to predict the behavior of models under distribution shift. Here, the resuts are largely negative. We construct models with specific defects (such as backdoors or spurious cues) that affect out-of-distribution performance, and measure whether model explanations can distinguish these from regular, non-defective models. Detection rates are typically low and in some cases trivial. This underscores the need to improve model explanations if they are to be used as a reliable tool for model debugging. Jacob Steinhardt 🔗 Sat 8:10 a.m. - 8:50 a.m. TBD (Invited Talk) Shai Ben-David 🔗 Sat 8:50 a.m. - 9:30 a.m. Poster Session 1 (Poster Session) 🔗 Sat 9:30 a.m. - 10:45 a.m. Lunch Break (Break) 🔗 Sat 10:45 a.m. - 12:00 p.m. Discussion Panel 🔗 Sat 12:00 p.m. - 12:15 p.m. Coffee Break (Break) 🔗 Sat 12:15 p.m. - 12:55 p.m. TBD (Invited Talk) Caroline Uhler 🔗 Sat 12:55 p.m. - 1:35 p.m. Algorithmic Robust Statistics (Invited Talk) Over the past few years, there has been exciting progress on algorithmic robust statistics in unsupervised, supervised and online learning. Much of this progress has been fueled by new algorithmic tools for detecting portions of the samples that have different distributional profiles. We will survey some of these tools as well as discuss prospects for building theories of coping with distribution shift from them. Ankur Moitra 🔗 Sat 1:35 p.m. - 1:55 p.m. TBD (Invited Talk) Adarsh Subbaswamy 🔗 Sat 1:55 p.m. - 2:40 p.m. Poster Session 2 and Coffee Break (Break) 🔗 - Simple and near-optimal algorithms for hidden stratification and multi-group learning (Poster) Multi-group agnostic learning is a formal learning criterion that is concernedwith the conditional risks of predictors within subgroups of a population. The criterion addresses recent practical concerns such as subgroup fairness andhidden stratification. This paper studies the structure of solutions to the multi-group learning problem, and provides simple and near-optimal algorithms for the learningproblem. Christopher Tosh · Daniel Hsu 🔗 - GAPX: Generalized Autoregressive Paraphrase-Identification X (Poster) Paraphrase Identification is a fundamental task in Natural Language Processing. While much progress has been made in the field, the performance of many state-of-the-art models often suffer from distribution shift during inference time. We verify that a major source of this performance drop comes from biases introduced by negative examples. To overcome these biases, we propose in this paper to train two separate models, one that only utilize the positive pairs and the other the negative pairs. This enables us the option of deciding how much to utilize the negative model, for which we introduce a perplexity based out-of-distribution metric that we show can effectively and automatically determine how much weight it should be given during inference. We support our findings with strong empirical results. Yifei Zhou · Renyu Li · Hayden Housen · Ser-Nam Lim 🔗 - Generative Gradual Domain Adaptation with Optimal Transport (Poster) Existing unsupervised domain adaptation (UDA) algorithms adapt a model from a labeled source domain to an unlabeled target domain in a one-off way. While these algorithms have been applied widely, they face a great challenge whenever the distribution distance between the source and the target is large. One natural idea to overcome this issue is to divide the original problem into smaller pieces so that each sub-problem only deals with a small shift. Following this idea and inspired by existing theory on gradual domain adaptation (GDA), we propose Generative Gradual Domain Adaptation with Optimal Transport (GOAT), a novel divide-and-conquer framework for UDA that automatically generates the intermediate domains connecting the source and the target in order to reduce the original UDA problem to GDA. Concretely, we first determine a Wasserstein geodesic under the Euclidean metric between the source and target in an embedding space, and then generate embeddings of intermediate domains along the geodesic by solving an optimal transport problem. Given the sequence of generated intermediate domains, we then apply gradual self-training, a standard GDA algorithm, to adapt the source-learned classifier sequentially to the target. Empirically, by using embeddings from modern generative models, we show that our algorithmic framework can utilize the power of existing generative models for UDA, which we believe makes the proposed algorithm widely applicable in many settings. We also conduct experiments on modern UDA datasets such as Rotated CIFAR-10, Office-31, and Office-Home. The results show superior performances of GOAT over conventional UDA approaches, which further demonstrates the effectiveness of GOAT in addressing large distribution shifts presented in many UDA problems. Yifei He · Haoxiang Wang · Han Zhao 🔗 - Pareto Invariant Risk Minimization (Poster) Despite the success of invariant risk minimization (IRM) in tackling the Out-of-Distribution generalization problem, IRM can compromise the optimality when applied in practice. The practical variants of IRM, e.g., IRMv1, have been shown to have significant gaps with IRM and thus could fail to capture the invariance even in simple problems. Moreover, the optimization procedure in IRMv1 involves two intrinsically conflicting objectives, and often requires careful tuning for the objective weights. To remedy the above issues, we reformulate IRM as a multi-objective optimization problem, and propose a new optimization scheme for IRM, called PAreto Invariant Risk Minimization (PAIR). PAIR can adaptively adjust the optimization direction under the objective conflicts. Furthermore, we show PAIR can empower the practical IRM variants to overcome the barriers with the original IRM when provided with proper guidance. We conduct experiments with ColoredMNIST to confirm our theory and the effectiveness of PAIR. Yongqiang Chen · Kaiwen Zhou · Yatao Bian · Binghui Xie · Kaili MA · Yonggang Zhang · Han Yang · Bo Han · James Cheng 🔗 - Out-of-Distribution Detection for Medical Applications: Guidelines for Practical Evaluation (Poster) Detection of Out-of-Distribution (OOD) samples in real-time is a crucial safety check for the deployment of machine learning models in the medical field. Despite a growing number of uncertainty quantification techniques, there is a lack of evaluation guidelines on how to select OOD detection methods in practice. This gap impedes the implementation of OOD detection methods for real-world applications. Here, we propose a series of practical considerations and tests to choose the best OOD detector for a specific medical dataset. These guidelines are illustrated on a real-life use case of Electronic Health Records (EHR). Our results serve as a guide for the implementation of OOD detection methods in clinical practice, mitigating risks associated with the use of machine learning models in healthcare. Karina Zadorozhny · Patrick Thoral · Paul Elbers · Giovanni Cinà 🔗 - Distribution Shift nested in Web Scraping : Adapting MS COCO for Inclusive Data (Poster) Popular benchmarks in Computer Vision suffer from a Western-centric bias that leads to a distribution shift problem when trying to deploy Machine Learning systems in developing countries. Palliating this problem using the same data generation methods in poorly represented countries will likely bring the same bias that were initially observed. In this paper, we propose an adaptation of the MS COCO data generation methodology that address this issue, and show how the web scraping methods nests geographical distribution shifts. Theophile Bayet · Christophe Denis · Jean-Daniel Zucker · Alassane BAH 🔗 - Towards Backwards-Compatible Data with Confounded Domain Adaptation (Poster) Most current domain adaptation methods address either covariate shift or label shift, but are not applicable where they occur simultaneously and are confounded with each other. Domain adaptation approaches which do account for such confounding are designed to adapt covariates to optimally predict a particular label whose shift is confounded with covariate shift. Here, we instead seek to achieve general-purpose data backwards compatibility. This would allow the adapted covariates to be used for a variety of downstream problems, including on pre-existing prediction models and on data analytics tasks. To do this we consider a modification of generalized label shift (GLS), which we call confounded shift.We present a novel framework for this problem, based on minimizing the expected divergence between the source and target conditional feature distributions, conditioning on possible confounders. Within this framework, we propose using the Gaussian reverse Kullback-Leibler divergence, as well as the Maximum Mean Discrepancy. Calvin McCarter 🔗 - Estimating Test Performance for AI Medical Devices under Distribution Shift with Conformal Prediction (Poster) Estimating test performance of software AI-based medical devices under distribution shifts is crucial for evaluating safety, efficiency, and usability prior to clinical deployment~\cite{fda}.Due to the nature of regulated medical device software and the difficulty in acquiring large amounts of labeled medical datasets, we consider the task of predicting test accuracy of an arbitrary black-box model on an unlabeled target domain \textit{without} modification to the original training process or any distributional assumptions of the original source data (i.e. we treat the model as a black-box'' and only use the predicted output responses).We propose ablack-box'' test estimation technique based on conformal prediction and evaluate against other methods on three medical imaging datasets (mammography, dermatology, and histopathology) under several clinically relevant types of distribution shift (institution, hardware scanner, atlas, hospital).We hope that by promoting practical and effective estimation techniques for black-box models, manufacturers of medical devices will develop more standardized and realistic evaluation procedures to improve robustness and trustworthiness of clinical AI tools. charlie lu · Syed Rakin Ahmed · Praveer Singh · Jayashree Kalpathy-Cramer 🔗 - ALASCA: Rethinking Label Smoothing for Deep Learning Under Label Noise (Poster) As label noise, one of the most popular distribution shifts, severely degrades deep neural networks' generalization performance, robust training with noisy labels is becoming an important task in modern deep learning. In this paper, we propose our framework, coined as Adaptive LAbel smoothing on Sub-ClAssifier (ALASCA), that provides a robust feature extractor with theoretical guarantee and negligible additional computation. First, we derive that the label smoothing (LS) incurs implicit Lipschitz regularization (LR). Furthermore, based on these derivations, we apply the adaptive LS (ALS) on sub-classifiers architectures for the practical application of adaptive LR on intermediate layers. We conduct extensive experiments for ALASCA and combine it with previous noise-robust methods on several datasets and show our framework consistently outperforms corresponding baselines. Jongwoo Ko · Bongsoo Yi · Se-Young Yun 🔗 - Diversify and Disambiguate: Learning from Underspecified Data (Poster) Many datasets are underspecified, meaning that there are several equally viable solutions to a given task. Underspecified datasets can be problematic for methods that learn a single hypothesis because different functions that achieve low training loss can focus on different predictive features and thus have widely varying predictions on out-of-distribution data. We propose DivDis, a simple two-stage framework that first learns a collection of diverse hypotheses for a task by leveraging unlabeled data from the test distribution. We then disambiguate by selecting one of the discovered hypotheses using minimal additional supervision, in the form of additional labels or inspection of function visualization. We demonstrate the ability of DivDis to find robust hypotheses in image classification and natural language processing problems with underspecification. Yoonho Lee · Huaxiu Yao · Chelsea Finn 🔗 - Back to the Basics: Revisiting Out-of-Distribution Detection Baselines (Poster) We study simple methods for out-of-distribution (OOD) image detection that are compatible with any already trained classifier, relying on only its predictions or learned representations. Evaluating the OOD detection performance of various methods when utilized with ResNet-50 and Swin Transformer models, we find methods that solely consider the model's predictions can be easily outperformed by also considering the learned representations. Based on our analysis, we advocate for a dead-simple approach that has been neglected in other studies: simply flag as OOD images whose average distance to their K nearest neighbors is large (in the representation space of an image classifier trained on the in-distribution data). Johnson Kuan · Jonas Mueller 🔗 - Style Balancing and Test-Time Style Shifting for Domain Generalization (Poster) Recent works on domain generalization have shown great success by generating new feature statistics (or style statistics) during training, which enables the model to get exposed to diverse domains or styles. However, existing works suffer from cross-domain class imbalance problem, that naturally arises in domain generalization problems. The performance of previous works are also degraded when the gap between the style statistics of source and target domains is large (i.e., when the distribution shift is large in the feature-level style space). In this paper, we propose new strategies to improve robustness against potential domain shift. We first propose style balancing, which strategically balances the number of samples for each class across all source domains, to improve domain diversity during training. Then we propose test-time style shifting, which shifts the style of the test sample (that has a large style gap with the source domains) to the nearest source domain to improve the prediction performance. Jungwuk Park · Dong-Jun Han · Soyeong Kim · Jaekyun Moon 🔗 - Models Out of Line: A Fourier Lens on Distribution Shift Robustness (Poster) Improving the accuracy of deep neural networks (DNNs) on out-of-distribution (OOD) data is critical to an acceptance of deep learning (DL) in real world applications. It has been observed that accuracies on in-distribution (ID) versus OOD data follow a linear trend and models that outperform this baseline are exceptionally rare (and referred to as effectively robust”). Recently, some promising approaches have been developed to improve OOD robustness, in particular ensembling large pretrained models like CLIP. However, there is still no clear understanding of which model properties are required to produce effective robustness. We approach this issue by conducting an empirical study of robust models on a broad range of natural and synthetic distribution shifts of ImageNet. In particular, we view theeffective robustness puzzle" through a Fourier lens and ask how spectral properties of models influence the corresponding effective robustness. We find this Fourier lens offers some insight into why certain robust models, particularly those from the CLIP family, achieve OOD robustness. Sara Fridovich-Keil · Brian Bartoldson · James Diffenderfer · Bhavya Kailkhura · Peer-Timo Bremer 🔗 - Noisy Learning for Neural ODEs Acts as a Robustness Locus Widening (Poster) []  We investigate several problems and challenges of evaluating the robustness of Differential Equation-based (DE) networks against synthetic shifts. We propose a novel and simple accuracy metric that can be used to evaluate intrinsic robustness and validate dataset corruption simulators. We also propose methodology recommendations destined for evaluating many faces of neural DEs' robustness and for comparing them with their discrete counterparts rigorously. We then use this criteria to evaluate a cheap data augmentation technique as a reliable way for demonstrating the natural robustness of neural ODEs against simulated image corruptions across multiple datasets. Martin Gonzalez · Loic Cantat 🔗 - The Power and Limitation of Pretraining-Finetuning for Linear Regression under Covariate Shift (Poster) We study linear regression under covariate shift, where the marginal distribution over the input covariates differs in the source and the target domains, while the conditional distribution of the output given the input covariates is similar across the two domains. We investigate a transfer learning approach with pretraining on the source data and finetuning based on the target data (both conducted by SGD) for this problem. We establish sharp instance-dependent excess risk upper and lower bounds for this approach. Our bounds suggest that for a large class of linear regression instances, transfer learning with $O(N^2)$ source data (and scarce or no target data) is as effective as supervised learning with $N$ target data. In addition, we show that finetuning, even with only a small amount of target data, can drastically reduce the amount of source data required by pretraining. Our theory sheds light on the effectiveness and limitation of pretraining as well as the benefits of finetuning for tackling covariate shift problems. Jingfeng Wu · Difan Zou · Vladimir Braverman · Quanquan Gu · Sham Kakade 🔗 - A Bias-Variance Analysis of Weight Averaging for OOD Generalization (Poster) []  Standard neural networks struggle to generalize under distribution shifts. For out-of-distribution generalization in computer vision, the best current approach averages the weights along a training run. Previous papers argue that weight averaging (WA) succeeds because it flattens the loss landscape. Our paper highlights the limitations of this analysis and proposes a new one based on WA's similarities with functional ensembling. We provide a new bias-variance-covariance-locality decomposition of WA's expected error: it explains WA's success especially when the marginal distribution changes at test time. Our analysis deepens the understanding of WA and more generally of deep networks under distribution shifts. Alexandre Ramé · Matthieu Kirchmeyer · Thibaud J Rahier · Alain Rakotomamonjy · Patrick Gallinari · Matthieu Cord 🔗 - Monotonic Risk Relationships under Distribution Shifts for Regularized Risk Minimization (Poster) Machine learning systems are often applied to data that is drawn from a different distribution than the training distribution. Recent work has shown that for a variety of classification and signal reconstruction problems, the out-of-distribution performance is strongly linearly correlated with the in-distribution performance. If this relationship or more generally a monotonic one holds, it has important consequences. For example, it allows to optimize performance on one distribution as a proxy for performance on the other. In this work, we study conditions under which a monotonic relationship between the performances of a model on two distributions is expected. We prove an exact asymptotic linear relation for squared error and a monotonic relation for misclassification error under a subspace shift model with feature scaling. Daniel LeJeune · Jiayu Liu · Reinhard Heckel 🔗 - What You See is What You Get: Distributional Generalization for Algorithm Design in Deep Learning (Poster) We investigate and leverage a connection between Differential Privacy (DP) and the recently proposed notion of Distributional Generalization (DG). Applying this connection, we introduce new conceptual tools for designing deep-learning methods that bypass "pathologies" of standard stochastic gradient descent (SGD). First, we prove that differentially private methods satisfy a "What You See Is What You Get (WYSIWYG)" generalization guarantee: whatever a model does on its train data is almost exactly what it will do at test time. This guarantee is formally captured by distributional generalization. WYSIWYG enables principled algorithm design in deep learning by reducing \emph{generalization} concerns to \emph{optimization} ones: in order to mitigate unwanted behavior at test time, it is provably sufficient to mitigate this behavior on the train data. This is notably false for standard (non-DP) methods, hence this observation has applications even when privacy is not required. For example, importance sampling is known to fail for standard ERM, but we show that it has exactly the intended effect for DP-trained models. We use these insights to construct simple algorithms which match or outperform SOTA in several distributional robustness applications, and to significantly improve the privacy vs. disparate impact tradeoff of DP-SGD. Finally, we also improve on known theoretical bounds relating DP, stability, and distributional generalization. Bogdan Kulynych · Yao-Yuan Yang · Yaodong Yu · Jarosław Błasiok · Preetum Nakkiran 🔗 - Time Series Prediction under Distribution Shift using Differentiable Forgetting (Poster) Time series prediction is often complicated by distribution shift which demands adaptive models to accommodate time-varying distributions. We frame time series prediction under distribution shift as a weighted empirical risk minimisation problem. The weighting of previous observations in the empirical risk is determined by a forgetting mechanism which controls the trade-off between the relevancy and effective sample size that is used for the estimation of the predictive model. In contrast to previous work, we propose a gradient-based learning method for the parameters of the forgetting mechanism. This speeds up optimisation and therefore allows more expressive forgetting mechanisms. Stefanos Bennett · Jason Clarkson 🔗 - On the nonlinear correlation of ML performance across data subpopulations (Poster) Understanding the performance of machine learning models across diverse data distributions is critically important for reliable applications. Recent empirically works find that there is a strong linear relationship between in-distribution (ID) and out-of-distribution (OOD) performance, but we show that this is not necessarily true if there are subpopulation shifts. In this paper, we empirically show that out-of-distribution performance often has nonlinear correlation with in-distribution performance under subpopulation shifts. To understand this phenomenon, we decompose the model's performance into performance on each subpopulation. We show that there is a "moon shape" correlation (parabolic uptrend curve) between the test performance on the majority subpopulation and the minority subpopulation. This nonlinear correlations hold across model architectures, training durations and hyperparameters, and the imbalance between subpopulations. Moreover, we show that the nonlinearity increases in the presence of spurious correlations in the training data. We provide complementary theoretical and experimental analyses for this interesting phenomenon of nonlinear performance correlation across subpopulations. Finally, we discuss the implications of our findings for ML reliability and fairness. Weixin Liang · Yining Mao · Yongchan Kwon · Xinyu Yang · James Zou 🔗 - Data Augmentation vs. Equivariant Networks: A Theoretical Study of Generalizability on Dynamics Forecasting (Poster) Exploiting symmetry in structured data is a powerful way to improve the learning and generalization ability of deep learning models. Data augmentation and equivariant neural nets are two of the main approaches for enabling neural nets to preserve symmetries. Since real-world data is rarely strictly symmetric, recently, several approximately equivariant networks have also been introduced. In this work, we theoretically compare the generalizability of data augmentation techniques, strictly equivariant networks, and approximately equivariant networks.Unlike most prior theoretical works on symmetry that are based on the i.i.d assumption, we instead focus on generalizability of these three approaches on the task of non-stationary dynamics forecasting. Rui Wang · Robin Walters · Rose Yu 🔗 - Maximum Mean Discrepancy Distributionally Robust Nonlinear Chance-Constrained Optimization with Finite-Sample Guarantee (Poster) Distributionally robust chance-constrained pro-grams (DRCCP) provide a powerful frameworkfor chance constraint optimization in presenceof distributional uncertainty. However, such pro-grams based on the popular Wasserstein ambigu-ity sets usually require restrictive assumptions onthe constraint functions. To overcome these limi-tations, we propose a practical DRCCP algorithmusing kernel maximum mean discrepancy (MMD)ambiguity sets, which we term MMD-DRCCP, totreat general nonlinear constraints without usingad-hoc reformulation techniques. MMD-DRCCPcan handle general nonlinear and non-convex con-straints with a proven finite-sample constraint sat-isfaction guarantee of a dimension-independent\mathcal{O}(\frac{1}{{N}})rate, achievable by a practical algorithm.We further propose an efficient bootstrap schemefor constructing sharp MMD ambiguity sets inpractice without relying on computationally costlycross validation procedures. Yassine Nemmour · Heiner Kremer · Bernhard Schölkopf · Jia-Jie Zhu 🔗 - DAFT: Distilling Adversarially Fine-tuned teachers for OOD Robustness (Poster) We consider the problem of OOD generalization,where the goal is to train a model that performs well on test distributions that are different from the training distribution. Deep learning models are known to be fragile to such shifts and can suffer large accuracy drops even for slightly different test distributions (Hendrycks & Dietterich, 2019).We propose a new method –DAFT– based on the intuition that adversarially robust combination of a large number of rich features should provide OOD robustness. Our method carefully distills the model from a powerful teacher that learns several discriminative features using standard training while combining them using adversarial training. The standard adversarial training procedure is modified to produce teachers which can guide the student better. We evaluate DAFT on standard benchmarks in the DomainBed framework, and find that DAFT consistently out-performs well-tuned ERM and distillation baselines by up to 6%, with more pronounced gains for smaller networks Anshul Nasery · Sravanti Addepalli · Praneeth Netrapalli · Prateek Jain 🔗 - Evaluation of Generative Unsupervised Domain Adaptation in the Absence of Target Labels (Poster) Unsupervised domain adaptation is essential for generalization on unlabeled target domains. Generative domain adaptation methods achieve domain adaptation by synthesizing intermediate source-to-target images. The inspection of such images can assist in identifying successful sets of hyperparameters and methods, however, this is both time-consuming and frequently challenging. In practical applications, selecting an appropriate method and tuning its parameters is difficult when target labels are entirely absent. We develop a metric for automatically assessing unsupervised generative domain adaptation methods based on the generated source-to-target images. We show that this metric correlates well with the performance of the downstream machine learning task, which is, in this case, semantic segmentation. Zeju Qiu · Grigorios Chrysos · Stratis Tzoumas 🔗 - GraphTTA: Test Time Adaptation on Graph Neural Networks (Poster) []  Recently, test time adaptation (TTA) has attracted increasing attention due to its power of handling the distribution shift issue in the real world. Unlike what has been developed for convolutional neural networks (CNNs) for image data, TTA is less explored for Graph Neural Networks (GNNs). There is still a lack of efficient algorithms tailored for graphs with irregular structures. In this paper, we present a novel test time adaptation strategy named Graph Adversarial Pseudo Group Contrast (GAPGC), for graph neural networks TTA, to better adapt to the Out Of Distribution (OOD) test data. Specifically, GAPGC employs a contrastive learning variant as a self-supervised task during TTA, equipped with Adversarial Learnable Augmenter and Group Pseudo-Positive Samples to enhance the relevance between the self-supervised task and the main task, boosting the performance of the main task. Furthermore, we provide theoretical evidence that GAPGC can extract minimal sufficient information for the main task from information theory perspective. Extensive experiments on molecular scaffold OOD dataset demonstrated that the proposed approach achieves state-of-the-art performance on GNNs. Guanzi Chen · Jiying Zhang · Xi Xiao · Yang Li 🔗 - Adversarial Cheap Talk (Poster) Adversarial attacks in reinforcement learning (RL) often assume highly-privileged access to the learning agent’s parameters, environment or data. Instead, this paper proposes a novel adversarial setting called a Cheap Talk MDP in which an Adversary has a minimal range of influence over the Victim. Parameterised as a deterministic policy that only conditions on the current state, an Adversary can merely append information to a Victim’s observation. To motivate the minimum-viability, we prove that in this setting the Adversary cannot occlude the ground truth, influence the underlying dynamics of the environment, introduce non-stationarity, add stochasticity, see the Victim’s actions, or access their parameters. Additionally, we present a novel meta-learning algorithm to train the Adversary, called adversarial cheap talk (ACT). Using ACT, we demonstrate that the resulting Adversary still manages to influence the Victim’s training and test performance despite these restrictive assumptions. Affecting train-time performance reveals a new attack vector and provides insight into the success and failure modes of existing RL algorithms. More specifically, we show that an ACT Adversary is capable of harming performance by interfering with the learner’s function approximation and helping the Victim’s performance by appending useful features. Finally, we demonstrate that an ACT Adversary can append information during train-time to directly and arbitrarily control the Victim at test-time in a zero-shot manner. Christopher Lu · Timon Willi · Alistair Letcher · Jakob Foerster 🔗 - Fairness and robustness in anti-causal prediction (Poster) Robustness to distribution shift and fairness have independently emerged as two important desiderata required of modern machine learning models. Here, we discuss these connections through a causal lens, focusing on anti-causal prediction tasks, where the input to a classifier (e.g., an image) is assumed to be generated as a function of the target label and the protected attribute. By taking this perspective, we draw explicit connections between a common fairness criterion---separation---and a common notion of robustness---risk invariance. These connections provide new motivation for applying the separation criterion in anticausal settings, and show that fairness can be motivated entirely on the basis of achieving better performance. In addition, our findings suggest that robustness-motivated approaches can be used to enforce separation, and that they often work better in practice than methods designed to directly enforce separation. Using a medical dataset, we empirically validate our findings on the task of detecting pneumonia from X-rays, in a setting where differences in prevalence across sex groups motivates a fairness mitigation. Our findings highlight the importance of considering causal structure when choosing and enforcing fairness criteria. Maggie Makar · Alexander D'Amour 🔗 - Plex: Towards Reliability using Pretrained Large Model Extensions (Poster) A recent trend in artificial intelligence (AI) is the use of pretrained models for language and vision tasks, which has achieved extraordinary performance but also puzzling failures. Examining tasks that probe the model’s abilities in diverse ways is therefore critical to the field. In this paper, we explore the reliability of models, where we define a reliable model as one that not only achieves strong predictive performance but also performs well consistently over many decision-making tasks such as uncertainty (e.g., selective prediction, open set recognition), robust generalization (e.g., accuracy and proper scoring rules such as log-likelihood on in- and out-of-distribution datasets), and adaptation (e.g., active learning, few-shot learning). We devise 10 types of tasks over 36 datasets in order to evaluate different aspects of reliability on both vision and language domains. To improve reliability, we developed ViT-Plex and T5-Plex, pretrained large model extensions (plex) for vision and language modalities, respectively. Plex greatly improves the state-of-the-art across tasks, and simplifies the traditional protocol as it does not require designing scores or tuning the model for each individual task. We demonstrate scaling effects over model sizes and pretraining dataset sizes up to 4 billion examples. We also demonstrate Plex’s capabilities on challenging tasks including zero-shot open set recognition, few-shot uncertainty, and uncertainty in conversational language understanding. Dustin Tran · Andreas Kirsch · Balaji Lakshminarayanan · Huiyi Hu · Du Phan · D. Sculley · Jasper Snoek · Jeremiah Liu · Jie Ren · Joost van Amersfoort · Kehang Han · E. Kelly Buchanan · Kevin Murphy · Mark Collier · Mike Dusenberry · Neil Band · Nithum Thain · Rodolphe Jenatton · Tim Rudner · Yarin Gal · Zachary Nado · Zelda Mariet · Zi Wang · Zoubin Ghahramani 🔗 - Group Distributionally Robust Reinforcement Learning with Hierarchical Latent Variables (Poster) Reinforcement Learning (RL) agents may only have incomplete information about tasks to solve. Although inferring the latent task could improve the performance, blindly trusting the task estimates may cause significant performance drops due to inevitable inference errors. One dominant way to enhance robustness is to optimize over worst-possible tasks, which may generate overly conservative policies. Moreover, most sequential decision-making formulations assume tasks are i.i.d. sampled and overlook the existence of task subpopulations. To address both challenges under task estimate uncertainty, we propose Group Distributionally Robust Markov Decision Process (GDR-MDP). GDR-MDP is flexible to encode prior task relationships via a latent mixture model, and leverage the prior by dynamically updating a belief distribution over mixtures. GDR-MDP has a distributionally robust decision criterion as finding the optimal policy that maximizes the expected return under the worst-possible qualified belief within an ambiguity set. We show both theoretically and empirically that GDR-MDP's hierarchical structure further enhances the distributional robustness over belief inference errors. Mengdi Xu · Peide Huang · Visak Kumar · Jielin Qiu · Chao Fang · Kuan-Hui Lee · Xuewei Qi · Henry Lam · Bo Li · Ding Zhao 🔗 - Task Modeling: A Multitask Approach for Improving Robustness to Group Shifts (Poster) We study the problem of learning from multiple groups of heterogeneous data distributions. Previous work shows that machine learning models trained under group shifts can exhibit poor performance on groups whose training set size is usually small. In this work, we explore multitask learning approaches to augment the training set and optimize the worst-group performance of a target task. A critical challenge in multitask learning is how to identify beneficial source tasks for a target task. To address this challenge, we propose a task modeling framework that learns a mapping from any subset of source tasks to their transferability on the target task. Our key finding is that with outputs from training models on randomly subsampled source tasks, a linear task model can accurately predict the results of multitask training for a target task. This finding implies an algorithm that selects beneficial source tasks using the learned task model. We validate our approach on a tabular dataset with 50 tasks. Our experiments demonstrate that our task selection algorithm achieves an average improvement of 1.03% in the worst-group accuracy on six target tasks compared to prior methods. Meanwhile, our approach is applicable to other performance metrics, including average performance and fairness measures, and outperforms baselines by 0.57% and 2.09%, respectively. Dongyue Li · Huy Nguyen · Hongyang Zhang 🔗 - A Meta-Analysis of Distributionally Robust Models (Poster) State-of-the-art image classifiers trained on massive datasets (such as ImageNet) have been shown to be vulnerable to a range of both intentional and incidental distribution shifts. On the other hand, several recent classifiers with favorable out-of-distribution (OOD) robustness properties have emerged, achieving very accuracy on their target tasks while maintaining their in-distribution accuracy on challenging benchmarks. We present a meta-analysis on a wide range of publicly released models, most of which have been published over the last twelve months. Through this meta-analysis, we empirically identify four main commonalities for all the best-performing OOD-robust models, all of which illuminate the considerable promise of vision-language pre-training. Benjamin Feuer · Ameya Joshi · Chinmay Hegde 🔗 - On Feature Learning in the Presence of Spurious Correlations (Poster) Deep learning classifiers are known to rely on spurious correlations — patterns which are semantically irrelevant but predictive of the target on the training data. In this paper we explore the quality of feature representations learned by standard empirical risk minimization (ERM) and specialized group robustness training, as well as the effect of key factors including architecture, pre-training strategy, regularization and others. Following recent work on Deep Feature Reweighting (DFR), we evaluate the feature representations by re-training the last layer of the model on a held-out set where the spurious correlation is broken. Through this procedure, we reveal how much information about the core semantic features is contained in the learned representations. On multiple vision and NLP problems, we show that the features learned by simple ERM are highly competitive with the features learned by specialized group robustness methods targeted at reducing the effect of spurious correlations. Moreover, we show that the quality of learned feature representations is largely affected by the choice of data augmentation, model architecture and pre-training strategy. On the other hand, we find that strong regularization, and long training are generally not helpful for improving the learned feature representations. Finally, using insights from our analysis, we significantly improve upon the best results reported in the literature on the popular Waterbirds, CelebA hair color prediction and WILDS-FMOW problems, achieving 97%, 92% and 50% worst-group accuracies respectively. Pavel Izmailov · Polina Kirichenko · Nate Gruver · Andrew Wilson 🔗 - Deep ensemble diversity and robustness on classification tasks (Poster) Ensembles of neural networks have been shown to achieve state-of-the-art performance on a variety of ML benchmark tasks, and particularly on tasks evaluating robustness to dataset shift. Conventional wisdom attributes this success to the diversity of the neural networks within the ensemble: the more diverse the predictions, the more robust the aggregated output should be. Under the mean squared error loss, the influence of ensemble diversity is apparent from the bias-variance decomposition, which separates the ensemble loss into two terms: the first evaluates the individual model quality of ensemble members, and the second the overall ensemble diversity. Classification tasks, however, typically rely upon KL divergence-based losses with less tractable bias-variance decompositions, and thus several ad hoc metrics have been proposed as measures of classifier diversity. In this work, we a) show empirically that various metrics of ensemble diversity indeed correlate with improved performance on classification tasks, and b) leverage a generalization of the bias-variance decomposition to propose a theoretically-motivated diversity metric with a strong correlation to ensemble loss. On out-of-distribution tasks, albeit to a lesser degree, diversity metrics also correlate with ensemble loss. Zelda Mariet 🔗 - Asymmetry Learning for Counterfactual-invariant Classification in OOD Tasks (Poster) Generalizing from observed to new related environments (out-of-distribution) is central to the reliability of classifiers. However, most classifiers fail to predict label $Y$ from input $X$ when the change in environment is due a (stochastic) input transformation $T^\text{te} \circ X'$ not observed in training, as in training we observe $T^\text{tr} \circ X'$, where $X'$ is a hidden variable. This work argues that when these transformations are induced by a collection of known $m$ equivalence relations, the task of finding a robust OOD classifier can be defined as finding the simplest causal model that defines a causal connection between the transformations and the target labels. We then propose a new learning paradigm, asymmetry learning, that identifies which symmetries the classifier must break in order to correctly predict $Y$ in both train and test. Asymmetry learning performs a causal model search that, under certain identifiability conditions, finds classifiers that perform equally well in-distribution and out-of-distribution. Chandra Mouli Sekar · Bruno Ribeiro 🔗 - Robust Estimation of Laplacian Constrained Gaussian Graphical Models with Trimmed Non-convex Regularization (Poster) The problem of discovering a structure that fits a collection of vector data is of crucial importance for a variety of applications. Such problems can be framed as Laplacian constrained Gaussian Graphical Model inference. Existing algorithms rely on the assumption that all the available observations are drawn from the same Multivariate Gaussian distribution. However, in practice it is common to find scenarios where the datasets are contaminated with a certain number of outliers. The purpose of this work is to address that problem. We propose a robust method based on Trimmed Least Squares that copes with the presence of corrupted samples. We provide statistical guarantees on the estimation error and present results on simulated data. Mariana Vargas Vieyra 🔗 - Evaluating Robustness to Dataset Shift via Parametric Robustness Sets (Poster) We give a method for proactively identifying small, plausible shifts in distribution which lead to large differences in model performance. To ensure that these shifts are plausible, we parameterize them in terms of interpretable changes in causal mechanisms of observed variables. This defines a parametric robustness set of plausible distributions and a corresponding worst-case loss. We construct a local approximation to the loss under shift, and show that problem of finding worst-case shifts can be efficiently solved. Nikolaj Thams · Michael Oberst · David Sontag 🔗 - Improved Medical Out-of-Distribution Detectors For Modality and Semantic Shifts (Poster) Detecting out-of-distribution (OOD) data with varying levels of semantic and covariate shifts with respect to the in-distribution (ID) is critical for the safe deployment of models. The goal is to design a detector that can accept meaningful variations of the ID data, while rejecting samples from OOD regimes. Such an objective can be realized by enforcing consistency with a scoring function (e.g., energy) and calibrating the detector to reject a curated set of OOD data (\textit{a.k.a} outlier exposure (OE)). However, OE methods require representative OOD datasets which are challenging to acquire in practice, hence the recent trend of designing OE-free detectors. In this paper, we find that controlled generalization to ID variations and exposure to diverse (synthetic) outliers are critical for improving OOD detection. Through empirical studies on the MedMNIST medical imaging benchmark, we demonstrate significant performance gains ($15\% - 35\%$ in AUROC) over existing OE-free, OOD detection approaches under both semantic and modality shifts. Vivek Narayanaswamy · Yamen Mubarka · Rushil Anirudh · Deepta Rajan · Andreas Spanias · Jayaraman J. Thiagarajan 🔗 - AugLoss: A Robust, Reliable Methodology for Real-World Corruptions (Poster) Deep Learning (DL) models achieve great successes in many domains. However, DL models increasingly face safety and robustness concerns, including noisy labeling in the training stage and feature distribution shifts in the testing stage. Previous works made significant progress in addressing these problems, but the focus has largely been on developing solutions for only one problem at a time. For example, recent work has argued for the use of tunable robust loss functions to mitigate label noise, and data augmentation (e.g., AugMix) to combat distribution shifts. As a step towards addressing both problems simultaneously, we introduce AugLoss, a simple but effective methodology that achieves robustness against both train-time noisy labeling and test-time feature distribution shifts by unifying data augmentation and robust loss functions. We conduct comprehensive experiments in varied settings of real-world dataset corruption to showcase the gains achieved by AugLoss compared to previous state-of-the-art methods. Lastly, we hope this work will open new directions for designing more robust and reliable DL models under real-world corruptions. Kyle Otstot · John Kevin Cava · Tyler Sypherd · Lalitha Sankar 🔗 - Context Shift from Test Benchmarks to Real-World Production Performance (Poster) Across a wide variety of domains, there exists a performance gap between machine learning models' accuracy on dataset benchmarks and real-world production data. Despite the careful design of static dataset benchmarks to represent the real-world, models often err when the data is out-of-distribution relative to the data the models have been trained on. We can directly measure and adjust for some aspects of distribution shift, but we cannot address sample selection bias, adversarial perturbations, and non-stationarity without knowing the data generation process. In this paper, we outline two methods for identifying changes in context that lead to distribution shifts and model prediction errors: leveraging human intuition and expert knowledge to identify first-order contexts and developing dynamic benchmarks based on desiderata for the data generation process. Furthermore, we present two case-studies to highlight the implicit assumptions underlying applied machine learning models that tend to lead to errors when attempting to generalize beyond test benchmark datasets. By paying close attention to the role of context in each prediction task, researchers can reduce context shift errors and increase generalization performance. Matthew Groh 🔗 - Exploring the Design of Adaptation Protocols for Improved Generalization and Machine Learning Safety (Poster) While directly fine-tuning large-scale, pretrained models on task-specific data is well-known to induce strong in-distribution task performance, recent works have demonstrated that different adaptation protocols, such as linear probing before fine-tuning, can improve OOD generalization. However, the design space of such adaptation protocols remains under-explored and the evaluation of such protocols has primarily focused on distribution shifts. Therefore, in this work, we evaluate common adaptation protocols across distributions shifts and machine learning safety metrics (e.g., anomaly detection, calibration). We find that protocols induce disparate trade-offs that were not apparent from prior evaluation. Finally, we demonstrate that appropriate pairing of data augmentation and protocol can substantially mitigate this trade-off. Puja Trivedi · Danai Koutra · Jayaraman J. Thiagarajan 🔗 - CODiT: Conformal Out-of-Distribution Detection in Time-Series Data (Poster) Machine learning models are prone to make incorrect predictions on inputs that are far from the training distribution. This hinders their deployment in safety-critical domains such as autonomous vehicles and healthcare. A number of techniques have been proposed for out-of-distribution (OOD) detection on individual datapoints. But in many applications, the inputs to these models form a temporal sequence. Existing techniques for OOD detection in time-series either do not exploit temporal relationships in the sequence or do not provide any guarantees on detection. We develop a self-supervised learning approach, CODiT for OOD detection in time-series data with guarantees on detection. We illustrate CODiT's efficacy on autonomous driving vision datasets and physiological GAIT data. Our code is available at shorturl.at/fzR02. Ramneet Kaur · Kaustubh Sridhar · Sangdon Park · Susmit Jha · Anirban Roy · Oleg Sokolsky · Insup Lee 🔗 - Diagnosing Model Performance Under Distribution Shift (Poster) Prediction models perform poorly when deployed to distributions different from those seen during training. To understand these operational failure modes of ML models, we develop methods to attribute the drop in performance to different types of distribution shifts. Our approach decomposes the performance drop into 1) an increase in harder but frequently seen examples during training, 2) changes in the relationship between outcome and features, and 3) poor performance on examples infrequent or unseen during training. Our procedure is principled yet flexible enough to incorporate any feature mapping or metadata. We empirically demonstrate how our decomposition can inform different ways to improve model performance for different distribution shifts. Tianhui Cai · Hongseok Namkoong · Steve Yadlowsky 🔗 - Distributionally Adaptive Meta Reinforcement Learning (Poster) Meta-reinforcement learning algorithms provide a data-driven way to acquire learning algorithms that quickly adapt to many tasks with varying rewards or dynamics functions. However, learned meta-policies are often effective only on the exact task distribution on which the policy was trained, and struggle in the presence of distribution shift of test-time rewards or transition dynamics. In this work, we develop a framework for meta-RL algorithms that are able to behave appropriately under test-time distribution shifts in the space of tasks. Our framework centers on an adaptive approach to distributional robustness, in which we train a population of meta-agents to be robust to varying levels of distribution shift, so that when evaluated on a (potentially shifted) test-time distribution of tasks, we can adaptively choose the most appropriate meta-agent to follow. We formally show how this framework allows for improved regret under distribution shift, and empirically show its efficacy on simulated robotics problems under a wide range of distribution shifts. Anurag Ajay · Dibya Ghosh · Sergey Levine · Pulkit Agrawal · Abhishek Gupta 🔗 - 2 CENTs on continual adaptation: replay & parameter buffers stabilize entropy minimization (Poster) We propose continual entropy minimization (CENT) to adapt computer vision models to continual distribution shifts at ImageNet scale. CENT leverages a replay buffer with images from the source distribution along with rolling parameter buffers to stabilize the training dynamics of conventional test-time adaptation methods. Our work is the first to demonstrate stable, continual adaptation on ImageNet scale, and obtains state-of-the-art results in both static and continual variants of the ImageNet-C benchmark. Ori Press · Steffen Schneider · Matthias Kuemmerer · Matthias Bethge 🔗 - Towards Practicable Sequential Shift Detectors (Poster) There is a growing awareness of the harmful effects of distribution shift on the performance of deployed machine learning models. Consequently, there is a growing interest in detecting these shifts before associated costs have time to accumulate. However, desiderata of crucial importance to the practicable deployment of sequential shift detectors are typically overlooked by existing works, precluding their widespread adoption. We identify three such desiderata, highlight existing works relevant to their satisfaction, and recommend impactful directions for future research. Oliver Cobb · Arnaud Van Looveren 🔗 - Towards OOD Detection in Graph Classification from Uncertainty Estimation Perspective (Poster) The problem of out-of-distribution detection for graph classification is far from being solved. The existing models tend to be overconfident about OOD examples or completely ignore the detection task. In this work, we consider this problem from the uncertainty estimation perspective and perform the comparison of several recently proposed methods. In our experiments, we find that there is no universal approach for OOD detection, and it is important to consider both graph representations and predictive categorical distribution. Gleb Bazhenov · Sergey Ivanov · Maxim Panov · Alexey Zaytsev · Evgeny Burnaev 🔗 - What can we do with just the model? A simple knowledge extraction framework (Poster) We consider the problem of adapting semantic segmentation models to new target domains, only from the trained source model, without the source data. Not only is this setting much harder than if one had access to the source data, this is necessary in many practical situations where source data is not available due to privacy and storage reasons. Our algorithm has two parts - first, we update that normalization statistics which helps to compensate for the distribution shift and second, we transfer knowledge from the source models adhering to certain equivariant and invariant transforms. The transforms helps to efficiently extract the knowledge beyond vanilla self-training. Through extensive experiments on multiple semantic segmentation tasks, we show how such a simple framework can be effective in extracting knowledge from the source model, for a variety of problem settings, and performs much better or at par with current state-of-the-art methods which are specifically tuned for the respective settings. Sujoy Paul · Ansh Khurana · Gaurav Aggarwal 🔗 - Are We Viewing the Problem of Robust Generalisation through the Appropriate Lens? (Poster) We discuss different approaches to the challenge of robust object recognition under distribution shifts. We advocate a view of this challenge that is more closely informed by the problem of visual recognition, and which emphasizes dynamic model behaviour as opposed to centering the statistical properties of training and test distributions. We introduce an experimental setting geared towards developing models that can exhibit robust behaviour in a reliable and scalable manner. We refer to this requirement "systematic robustness", which involves excluding certain combinations of classes and image attributes are systematically during training. Unlike prior work which studies systematic generalisation in DNNs or their susceptibility to spurious correlations, we use synthetic operations and data sampling to scale such experiments up to large-scale naturalistic datasets. Mohamed Omran · Bernt Schiele 🔗 - Adapting to Shifts in Latent Confounders via Observed Concepts and Proxies (Poster) We address the problem of unsupervised domain adaptation when the source differs from the target because of a shift in the distribution of a latent confounder. In this case, neither covariate shift nor label shift assumptions apply. When all data is discrete, we show that the optimal target predictor can be non-parametrically identified with the help of concept and proxy variables, available only in the source, and unlabeled data from the target. Matt Kusner · Ibrahim Alabdulmohsin · Stephen Pfohl · Olawale Salaudeen · Arthur Gretton · Sanmi Koyejo · Jessica Schrouff · Alexander D'Amour 🔗 - Positive Unlabeled Contrastive Representation Learning (Poster) Self-supervised pretraining on unlabeled data followed by supervised finetuning on labeled data is a popular paradigm for learning from limited labeled examples. In this paper, we investigate and extend this paradigm to the classical positive unlabeled (PU) setting - the weakly supervised taskof learning a binary classifier only using a few labeled positive examples and a set of unlabeled samples. We propose a novel contrastive objective - positive unlabeled Noise Contrastive Estimation (puNCE) that leverages the available explicit (from labeled samples) and implicit (from unlabeled samples) supervision to learn useful representations from positive unlabeled input data. The underlying idea is to assign each training sample an individual weight; labeled positives are given unit weight; unlabeled samples are duplicated, one copy is labeled positive and the other as negative with weights $\pi$ and $(1-\pi)$ respectively, where $\pi$ denotes the class prior. Extensive experiments across vision and natural language tasks reveal that puNCE consistently improves over existing unsupervised and supervised contrastive baselines under limited supervision. Anish Acharya · Sujay Sanghavi · Li Jing · Bhargav Bhushanam · Michael Rabbat · Dhruv Choudhary · Inderjit Dhillon 🔗 - Towards Domain Adversarial Methods to Mitigate Texture Bias (Poster) Shape-texture conflict is key to our understanding of the behavior of Convolutional Neural Networks (CNNs) and their observably good performance. This work proposes a domain adversarial training-inspired technique as a novel approach to mitigate texture bias. In our work, instead of looking at the domains as the source from which the images are from, we look at the domains as inherent features of the image. The model is trained in a method similar to Domain Adversarial training, where we define the source and target domains as the dataset and its augmented versions with minimal texture information (edge maps and stylized images), respectively. We show that using domain invariant learning to capture a prior based on the shape-texture information helps models learn robust representations. We perform extensive experiments on three subsets of ImageNet, namely, ImageNet-20, ImageNet-200, ImageNet-9. The results show that the proposed method outperforms standard Empirical Risk Minimization (ERM) in terms of test accuracy and also as evidenced by the high accuracy on the Out-Of-Distribution (OOD) datasets ImageNet-R and NICO. Dhruva Kashyap · Sumukh K Aithal · Rakshith C · Natarajan Subramanyam 🔗 - Dynamics of Dataset Bias and Robustness (Poster) We aim to shine a light on the effects of various techniques for improving robustness under distribution-shift on the dataset-bias (i.e. class imbalance). This relationship between data-skewness and such performance-enhancing measures remains largely unexplored. Deep learning models are seeing real-world deployment, hence it's crucial to gauge the reliability of such neural networks since undetected (side)effects of robustness enhancement on dataset bias could be catastrophic. We observe that robustness-enhancement techniques affect performance on under-represented (yet critical) classes, thus requiring investigation from a fairness perspective. We evaluate methods for model robustness on distinct architectures by their effects on dataset bias through a variety of specialized metrics (imbalance-focused; F-1 score/balanced accuracy) on artificially imbalanced datasets. Prabhu Pradhan · Ruchit Rawal 🔗 - Bridging Distribution Shift in Imitation Learning via Taylor Expansions (Poster) We propose Taylor Series Imitation Learning (TaSIL), a simple augmentation to standard behavior cloning losses in the context of continuous control. TaSIL penalizes deviations in the higher-order Taylor series terms betweenthe learned and expert policies. We show that experts satisfying a notion of incremental input-to-state stability are easy to learn, in the sense that a small TaSIL-augmented imitation loss over expert trajectories guarantees a small imitation loss over trajectories generated by the learned policy, and provide sample-complexity bounds for TaSIL that scale as $\tilde\calO(1/n)$ in the realizable setting, for $n$ the number of expert demonstrations. Finally, we compare experimentally standard Behavior Cloning, DART, and DAgger with TaSIL-loss-augmented variants. In all cases, we show significant improvement over baselines across a variety of MuJoCo tasks. Daniel Pfrommer · Thomas T. Zhang · Nikolai Matni · Stephen Tu 🔗 - Agreement-on-the-Line: Predicting the Performance of Neural Networks under Distribution Shift (Poster) Recently, Miller et al. showed that a model's in-distribution (ID) accuracy has a strong linear correlation with its out-of-distribution (OOD) accuracy, on several OOD benchmarks, a phenomenon they dubbed accuracy-on-the-line''. While a useful tool for model selection, this fact does not help to estimate the actual OOD performance of models without access to a labeled OOD validation set. In this paper, we show a similar surprising phenomenon also holds for the agreement between pairs of neural network classifiers: whenever accuracy-on-the-line holds, the OOD agreement between the predictions of any two pairs of neural networks is linearly correlated with ID agreement. Furthermore, we observe that the slope and bias of OOD vs ID agreement closely matches that of OOD vs ID accuracy. This phenomenon which we call agreement-on-the-line, has important practical applications: without any labeled data, we can predict the OOD accuracy of classifiers, since OOD agreement can be estimated with just unlabeled data. Our prediction algorithm outperforms previous methods both in shifts where agreement-on-the-line holds and, surprisingly, when accuracy is not on the line. Christina Baek · Yiding Jiang · aditi raghunathan · Zico Kolter 🔗