Timezone: »
As machine learning models are introduced into every aspect of our lives, and potential benefits become abundant, so do possible catastrophic failures. One of the most common failure scenarios when deploying machine learning models in the wild, which could possibly lead to dire consequences in extreme cases, is the reliance of models on apparently unnatural or irrelevant features.
The issue comes up in a variety of applications: from the reliance of detection models for X-rays on scanner types and marks made by technicians in the hospital, through visual question answering models being sensitive to linguistic variations in the questions, the list of examples for such undesirable behaviors keeps growing.In examples like these, the undesirable behavior stems from the model exploiting a spurious correlation.
Following last year's workshop on Spurious Correlations, Invariance and Stability (SCIS), it is apparent that work on spurious correlations is a long-term effort that spans communities such as fairness, causality-inspired ML, and domains such as NLP, healthcare and many others. Hence we hope that this year's workshop, the second edition of SCIS, will help facilitate this long term effort across communities. The workshop will feature talks by top experts doing methodological work on dealing with spurious correlations, and an extended poster session to allow extensive discussion on work submitted to the workshop.
Sat 11:50 a.m. - 12:00 p.m.
|
Opening Remarks
SlidesLive Video » |
🔗 |
Sat 12:00 p.m. - 12:45 p.m.
|
Distribution Shifts in Generalist and Causal Models
(
Talk
)
SlidesLive Video » |
Francesco Locatello 🔗 |
Sat 12:45 p.m. - 1:15 p.m.
|
Paper Spotlights
(
Spotlight
)
SlidesLive Video » 1) Where Does My Model Underperform?: A Human Evaluation of Slice Discovery Algorithms Nari Johnson, Angel Cabrera, Gregory Plumb, Ameet Talwalkar 2) Antibody DomainBed: Towards robust predictions using invariant representations of biological sequences carrying complex distribution shifts Natasa Tagasovska, Ji Won Park, Stephen Ra, Kyunghyun Cho 3) Provable domain adaptation using privileged information Adam Breitholtz, Anton Matsson, Fredrik D. Johansson 4) Delphic Offline Reinforcement Learning under Nonidentifiable Hidden Confounding Alizée Pace, Hugo Yèche, Bernhard Schölkopf, Gunnar Ratsch, Guy Tennenholtz 5) ModelDiff: A Framework for Comparing Learning Algorithms. Harshay Shah, Sung Min Park, Andrew Ilyas, Aleksander Madry |
Andrew Ilyas · Alizée Pace · Ji Won Park · Adam Breitholtz · Nari Johnson 🔗 |
Sat 1:15 p.m. - 1:30 p.m.
|
Break
|
🔗 |
Sat 1:30 p.m. - 2:15 p.m.
|
On learning domain general predictors
(
Talk
)
SlidesLive Video » |
Sanmi Koyejo 🔗 |
Sat 2:15 p.m. - 3:00 p.m.
|
Using Causality to Improve Safety Throughout the AI Lifecycle
(
Talk
)
SlidesLive Video » |
Suchi Saria · Adarsh Subbaswamy 🔗 |
Sat 3:00 p.m. - 4:00 p.m.
|
Lunch Break
|
🔗 |
Sat 4:00 p.m. - 4:45 p.m.
|
A data-centric view on reliable generalization: From ImageNet to LAION-5B
(
Talk
)
SlidesLive Video » |
Ludwig Schmidt 🔗 |
Sat 4:45 p.m. - 5:30 p.m.
|
Causal vs Causality-inspired representation learning
(
Talk
)
SlidesLive Video » |
Sara Magliacane 🔗 |
Sat 5:30 p.m. - 6:30 p.m.
|
Poster Session 1 (in-person only)
(
Poster Session
)
|
🔗 |
Sat 6:30 p.m. - 7:15 p.m.
|
SCIS 2023 Panel, The Future of Generalization: Scale, Safety and Beyond
(
Panel Discussion
)
SlidesLive Video » Moderator: Adam Gleave Participants: Samuel Bowman, Maggie Makar, Zachary Lipton |
Maggie Makar · Samuel Bowman · Zachary Lipton · Adam Gleave 🔗 |
Sat 7:15 p.m. - 8:00 p.m.
|
Causal Conversation + Poster Session 2
(
Poster Session
)
|
🔗 |
-
|
Fairness-Preserving Regularizer: Balancing Core and Spurious Features
(
Poster
)
link »
Real world visual data contains multiple attributes, e.g. color, foreground, background, etc. To solve a specific learning task, machine learning models should use a specific set of attributes. In principle, selecting which set of attributes as the core feature is defined by the task regardless of how heavily other attributes are (spuriously) correlated with the label. Without prior knowledge for identifying the core feature or spurious one, we can hardly tell a learned correlation is spurious or not in real world scenarios. In this work, we dive into this realistic setting and since there is no prior knowledge to determine which feature is core or spurious, we aim to learn a regularized predictor to fairly balance both core and spurious features.To achieve this, we start by formalizing fairness of learned features in a linear predictor under multiview data distribution assumption (Allen-Zhu &Li, 2023). We prove that achieving this fairness can be bounded by a simple regularization term and finally design fairness-preserving regularizer. Experiments on Waterbirds, CelebA and Wilds-FMOW datasets validate the effectiveness of our method. |
Jiawei Feng · Ancong Wu · YuHan Yao · Wei-Shi Zheng 🔗 |
-
|
Identifying and Disentangling Spurious Features in Pretrained Image Representations
(
Poster
)
link »
Neural networks employ spurious correlations in their predictions, resulting in decreased performance when these correlations do not hold.Recent works suggest fixing pretrained representations and training a classification head that does not use spurious features.We investigate how spurious features are represented in pretrained representations and explore strategies for removing information about spurious features.Considering the Waterbirds dataset and a few pretrained representations, we find that even with full knowledge of spurious features, their removal is not straightforward due to entangled representation.To address this, we propose a linear autoencoder training method to separate the representation into core, spurious, and other features.We propose two effective spurious feature removal approaches that are applied to the encoding and significantly improve classification performance measured by worst-group accuracy. |
Rafayel Darbinyan · Hrayr Harutyunyan · Aram Markosyan · Hrant Khachatrian 🔗 |
-
|
Pruning for Better Domain Generalizability
(
Poster
)
link »
In this paper, we investigate whether we coulduse pruning as a reliable method to boost thegeneralization ability of the model. We foundthat existing pruning method like L2 can alreadyoffer small improvement on the target domainperformance. We further propose a novel pruningscoring method, called DSS, designed notto maintain source accuracy as typical pruningwork, but to directly enhance the robustness ofthe model. We conduct empirical experiments tovalidate our method and demonstrate that it canbe even combined with state-of-the-art generalizationwork like MIRO(Cha et al., 2022) to furtherboost the performance. On MNIST to MNIST-M,we could improve the baseline performance byover 5 points by introducing 60% channel sparsityinto the model. On DomainBed benchmarkand state-of-the-art MIRO, we can further boostits performance by 1 point only by introducing 10% sparsityinto the model. |
Xinglong Sun 🔗 |
-
|
Temporal Consistency based Test Time Adaptation: Towards Fair and Personalized AI
(
Poster
)
link »
We introduce a novel unsupervised method for improving fairness of computer vision algorithms through test-time adaptation on videos. The method is model agnostic and uses the temporal coherence of predictions across sequential frames as a self-supervision signal to update a subset of model parameters. We assess the performance on Casual Conversations Dataset for gender prediction task on face images and show that our approach can significantly improve predictive disparity across different skin tones, ages and lighting conditions. |
Mohammadmahdi Honarmand · Onur Cezmi Mutlu · Saimourya Surabhi · Dennis Wall 🔗 |
-
|
Regularizing Model Gradients with Concepts to Improve Robustness to Spurious Correlations
(
Poster
)
link »
Deep Neural Networks are prone to capturing correlations between spurious attributes and class labels, leading to low accuracy on some groups of the data. Existing methods rely on group labels either during training or validation to improve the model’s robustness to spurious correlation. We observe that if a model correlates a spurious at- tribute with the target class, then the model is sensitive to the spurious attribute. In a pure vision setting, attribute labels representing bias may not be available. We propose Concept Regularization (CReg), a method that penalizes a model’s sensitivity to a concept represented as a set of curated images drawn from any external source: image generation models or web search. Our method does not require group labels on a dataset level, instead relying on a small amount of auxiliary data, potentially irrelevant to the classification task, to represent the protected attribute. We show across datasets that CReg outperforms the standard empirical risk minimization (ERM). |
Yiwei Yang · Anthony Liu · Robert Wolfe · Aylin Caliskan · Bill Howe 🔗 |
-
|
Regularizing Adversarial Imitation Learning Using Causal Invariance
(
Poster
)
link »
|
Ivan Ovinnikov · Joachim Buhmann 🔗 |
-
|
Spuriosity Didn’t Kill the Classifier: Using Invariant Predictions to Harness Spurious Features
(
Poster
)
link »
To avoid failures on out-of-distribution data, recent works have sought to use only features with an invariant or stable relationship with the label across domains, discarding "spurious" or unstable features whose relationship with the label changes across domains. However, unstable features often carry complementary information about the label that could boost performance if used correctly in the test domain. Our main contribution is to show that it is possible to learn how to use these unstable features in the test domain without labels. We prove that pseudo-labels based on stable features provide sufficient guidance for doing so, provided that stable and unstable features are conditionally independent given the label. Based on this insight, we propose Stable Feature Boosting (SFB), an algorithm for: (i) learning stable and conditionally-independent unstable features; and (ii) using the stable-feature predictions to adapt the unstable-feature predictions to the test domain. Theoretically, we prove that SFB can learn an asymptotically-optimal predictor without test-domain labels. Empirically, we demonstrate the effectiveness of SFB on real and synthetic data. |
Cian Eastwood · Shashank Singh · Andrei Nicolicioiu · Marin Vlastelica · Julius von Kügelgen · Bernhard Schölkopf 🔗 |
-
|
Complementary Benefits of Contrastive Learning and Self-Training Under Distribution Shift
(
Poster
)
link »
Self-training and contrastive learning have emerged as leading techniques for incorporating unlabeled data, both under distribution shift (unsupervised domain adaptation) and when it is absent (semi-supervised learning). However, despite the popularity and compatibility of these techniques, their efficacy in combination remains surprisingly unexplored. In this paper, we first undertake a systematic empirical investigation of this combination, finding (i) that in domain adaptation settings, self-training and contrastive learning offer significant complementary gains; and (ii) that in semi-supervised learning settings, surprisingly, the benefits are not synergistic. Across eight distribution shift datasets (e.g., BREEDs, WILDS), we demonstrate that the combined method obtains 3--8% higher accuracy than either approach independently. Finally, we theoretically analyze these techniques in a simplified model of distribution shift demonstrating scenarios under which the features produced by contrastive learning can yield a good initialization for self-training to further amplify gains and achieve optimal performance, even when either method alone would fail. |
Saurabh Garg · Amrith Setlur · Zachary Lipton · Sivaraman Balakrishnan · Virginia Smith · Aditi Raghunathan 🔗 |
-
|
Spurious Correlations and Where to Find Them
(
Poster
)
link »
Despite being a well-known drawback of data-driven learning and having several algorithms proposed to mitigate it, we are yet to jointly derive the indicators of spurious correlations. As a result, the solutions built upon standalone hypotheses fail to beat simple ERM baselines. We collect some of the commonly studied hypotheses behind the occurrence of spurious correlations and study their influence on preliminary ERM baselines using synthetic datasets generated using causal graphs. Subsequently, we observe patterns connecting these hypotheses and model design choices. |
Gautam Sreekumar · Vishnu Boddeti 🔗 |
-
|
Why is SAM Robust to Label Noise?
(
Poster
)
link »
Sharpness-Aware Minimization (SAM) has recently been able to achieve state-of-the-art generalization performances in both natural image and language tasks. Previous work has largely tried to understand this generalization performance by characterizing SAM's solutions as lying in ``flat'' (low curvature) regions of the loss landscape. However, others works have also shown that the correlation between various notions of flatness and generalization is weak, raising some doubts about such justification. In this paper, we focus on understanding SAM in the presence of label noise, where the performance gains of SAM are especially pronounced. We first show that SAM's improved generalization properties can already be observed in linear logistic regression, where 1-SAM reduces to simply up-weighting the gradients from correctly labeled points during the early epochs of the training trajectory. Next, we empirically investigate how SAM's learning dynamics change for neural networks, showing similar behavior with regard to how it handles noisy versus clean samples. We conclude that SAM's gains in the label noise setting can largely be explained by how it regularizes the speed at which different examples are learned during the training. |
Christina Baek · Zico Kolter · Aditi Raghunathan 🔗 |
-
|
Confident feature ranking
(
Poster
)
link »
Interpretation of feature importance values often relies on the relative order of the features rather than on the value itself, referred to as ranking. However, the order may be unstable due to the small sample sizes used in calculating the importance values. We propose that post-hoc importance methods produce a ranking and simultaneous confident sets for the ranking. Based on pair-wise comparisons of the feature importance values, our method is guaranteed to include the ``true'' (infinite sample) ranking with high probability and allows for a selection of top-k sets. |
Bitya Neuhof · Yuval Benjamini 🔗 |
-
|
Calibrated Propensities for Causal Effect Estimation
(
Poster
)
link »
Propensity scores are commonly used to balance observed confounders while estimating treatment effects. When the confounders are high-dimensional or unstructured, the learned propensity scores can be miscalibrated and ineffective in the correction of confounding. We argue that the probabilistic output of a learned propensity score model should be calibrated, i.e. predictive treatment probability of 90% should correspond to 90% individuals being assigned the treatment group. We investigate the theoretical properties of a calibrated propensity score model and its role in unbiased treatment effect estimation. We demonstrate improved causal effect estimation with calibrated propensity scores in several tasks including high-dimensional genome-wide association studies. |
Shachi Deshpande · Volodymyr Kuleshov 🔗 |
-
|
Understanding the Detrimental Class-level Effects of Data Augmentation
(
Poster
)
link »
Data augmentation (DA) encodes invariance and provides implicit regularization critical to a model's performance in image classification tasks. However, while DA improves average accuracy, recent studies have shown that its impact can be highly class dependent: achieving optimal average accuracy comes at the cost of significantly hurting individual class accuracy by as much as $20\%$ on ImageNet. In this work, we present a framework for understanding how DA interacts with class-level learning dynamics. Using higher-quality multi-label annotations on ImageNet, we systematically categorize the affected classes and find that the majority are inherently ambiguous, spuriously correlated, or involve fine-grained distinctions, while DA controls the model's bias towards one of the closely related classes. While many of the previously reported performance drops are explained by multi-label annotations, our analysis of class confusions reveals other sources of accuracy degradation. We show that simple class-conditional augmentation strategies informed by our framework improve performance on the negatively affected classes.
|
Polina Kirichenko · Mark Ibrahim · Randall Balestriero · Diane Bouchacourt · Ramakrishna Vedantam · Hamed Firooz · Andrew Wilson 🔗 |
-
|
Transportable Representations for Out-of-distribution Generalization
(
Poster
)
link »
Building on the theory of causal transportability (Bareinboim & Pearl), we define in this paper the notion of ``transportable representations," and show that the out-of-distribution generalization risk of classifiers defined based on these representations can be bounded, considering that graphical assumptions about the underlying system are provided. |
Amirkasra Jalaldoust · Elias Bareinboim 🔗 |
-
|
Feature Selection in the Presence of Monotone Batch Effects
(
Poster
)
link »
We study the problem of feature selection in the presence of monotone batch effects when merging datasets from disparate technologies and different environments affects the underlying causal dependence of data features. We propose two novel algorithms for this task: 1) joint feature selection and batch effect correction through transforming the data batches using deep neural networks; 2) transforming data using a batch-invariant characteristic (i.e., feature rank) to append datasets. We assess the performance of the feature selection methods in the presence of a monotone batch effect by $F_1$ score.Our experiments on synthetic data show that the former method combined with Lasso improves the $F_1$ score significantly, even with few samples per dataset. This method outperforms popular batch effect removal algorithms, including Combat-Seq, Limma, and PCA. Comparatively, while the ranking method is computationally more efficient, its performance is worse due to the information loss resulted from ignoring the magnitude of data.
|
Peng Dai · Sina Baharlouei · Meisam Razaviyayn · Sze-Chuan Suen 🔗 |
-
|
Exploring new ways: Enforcing representational dissimilarity to learn new features and reduce error consistency
(
Poster
)
link »
Independently trained machine learning modelstend to learn similar features. Given an ensem-ble of idependently trained models this results incorrelated predictions and common failure modes.Previous attempts focusing on decorrelation ofoutput predictions or logits yielded mixed results,particularly due to their reduction in model ac-curacy caused by conflicting optimization objec-tives. In this paper we propose the novel idea ofutilizing methods of the representational similar-ity field to promote dissimilarity during traininginstead of measuring similarity of trained mod-els. To this end we promote intermediate rep-resentations to be dissimilar at different depthsbetween architectures with the goal of learningrobust ensembles with disjoint failure modes. Weshow that highly dissimilar intermediate represen-tations result in less correlated output predictionsand slightly lower error consistency, resulting inhigher ensemble accuracy. With this we shinefirst light on the connection between intermedi-ate representations and their impact on the outputrepresentations. |
Tassilo Wald · Constantin Ulrich · Fabian Isensee · David Zimmerer · Gregor Koehler · Michael Baumgartner · Klaus Maier-Hein 🔗 |
-
|
Where Does My Model Underperform?: A Human Evaluation of Slice Discovery Algorithms
(
Oral
)
link »
A growing number of works propose tools to help stakeholders form hypotheses about the behavior of machine learning models. We focus our study on slice discovery algorithms: automated methods that aim to group together coherent and high-error "slices" (i.e. subsets) of data. While these tools purport to help users identify where (on which subgroups) their model underperforms, there has been little evaluation of whether they help users achieve their proposed goals. We run a controlled user study $(N = 15)$ to evaluate if the slices output by two existing slice discovery algorithms help users form correct hypotheses about an image classification model. Our results provide positive evidence that existing tools provide benefit relative to a naive baseline, and challenge dominant assumptions shared by past work.
|
Nari Johnson · Ángel Alexander Cabrera · Gregory Plumb · Ameet Talwalkar 🔗 |
-
|
Complementing a Policy with a Different Observation Space
(
Poster
)
link »
We consider the problem of improving upon a black-box policy which operates on a different observation space than the learner. Such problems occur when augmenting an existing hand-engineered system with a new machine learning model or in a shared autonomy / human-AI complementarity context. We prove that following the naive policy gradient can lead to a decrease in performance because of incorrect grounding in a different observation space. Then, if we have access to both sets of observation at train time, we derive a method for correctly estimating a policy gradient via an application of the backdoor criterion. If we don't, we prove that under certain assumptions, we can use the proxy correction to correctly estimate a direction of improvement. |
Gokul Swamy · Sanjiban Choudhury · J. Bagnell · Steven Wu 🔗 |
-
|
Last-Layer Fairness Fine-tuning is Simple and Effective for Neural Networks
(
Poster
)
link »
As machine learning has been deployed ubiquitously across applications in modern data science, algorithmic fairness has become a great concern. Among them, imposing fairness constraints during learning, i.e. in-processing fair training, has been a popular type of training method because they don't require accessing sensitive attributes during test time in contrast to post-processing methods. While this has been extensively studied in classical machine learning models, their impact on deep neural networks remains unclear. Recent research has shown that adding fairness constraints to the objective function leads to severe over-fitting to fairness criteria in large models, and how to solve this challenge is an important open question. To tackle this, we leverage the wisdom and power of pre-training and fine-tuning and develop a simple but novel framework to train fair neural networks in an efficient and inexpensive way --- last-layer fine-tuning alone can effectively promote fairness in deep neural networks. This framework offers valuable insights into representation learning for training fair neural networks. |
Yuzhen Mao · Zhun Deng · Huaxiu Yao · Ting Ye · Kenji Kawaguchi · James Zou 🔗 |
-
|
Separating multimodal modeling from multidimensional modeling for multimodal learning
(
Poster
)
link »
Multimodal learning is defined as learning to map a set of separate modalities to a target. Despite its intuitive definition, it is unclear whether one should model this problem using a multidimensional model, where the features from each modality are concatenated and treated as multidimensional features from a single modality or a multimodal model, where we use the information about the modality boundaries. In this first-of-its-kind work we formalize the framework for multimodal learning and identify the conditions that favor multimodal modeling over multidimensional modeling. Through a series of synthetic experiments, where we fully control the data generation process, we demonstrate the necessity of multimodal modeling for solving a multimodal learning problem for the first time. Our proposed framework, which is agnostic to any assumptions pertaining to model architectures, can have a widespread impact by informing modeling choices when dealing with data from different modalities. |
Divyam Madaan · Taro Makino · Sumit Chopra · Kyunghyun Cho 🔗 |
-
|
Do as your neighbors: Invariant learning through non-parametric neighbourhood matching
(
Poster
)
link »
Invariant learning methods aim to obtain robust features that can be used in the same way in multiple environments and can generalize out-of-distribution. This paper introduces a novel method to achieve this, called Invariant KNN. We are guided by the idea that robust features should elicit an invariant non-parametric predictor across domains. For this, we create a K-nearest neighbors predictor from each training environment and constrain them to be the same. We experimentally prove that this approach leads to invariant predictors which learn to use the robust features in the data and generalize out-of-distribution. We test our algorithm on a simple but popular benchmark and demonstrate that it is both competitive with other popular algorithms as well as less sensitive to hyperparameter selection. |
Andrei Nicolicioiu · Jerry Huang · Dhanya Sridhar · Aaron Courville 🔗 |
-
|
Leveraging sparse and shared feature activations for disentangled representation learning
(
Poster
)
link »
Recovering the latent factors of variation of high dimensional data has so far focused on simple synthetic settings. Mostly building on unsupervised and weakly-supervised objectives, prior work missed out on the positive implications for representation learning on real world data. In this work, we propose to leverage knowledge extracted from a diversified set of supervised tasks to learn a common disentangled representation. Assuming each supervised task only depends on an unknown subset of the factors of variation, we disentangle the feature space of a supervised multi-task model, with features activating sparsely across different tasks and information being shared as appropriate. Importantly, we never directly observe the factors of variations but establish that access to multiple tasks is sufficient for identifiability under sufficiency and minimality assumptions. We validate our approach on six real world distribution shift benchmarks, and different data modalities (images, text), demonstrating how disentangled representations can be transferred to real settings. |
Marco Fumero · Florian Wenzel · Luca Zancato · Alessandro Achille · Emanuele Rodola · Stefano Soatto · Bernhard Schölkopf · Francesco Locatello 🔗 |
-
|
Implications of Gaussian process kernel mismatch for out-of-distribution data
(
Poster
)
link »
Gaussian processes provide reliable uncertainty estimates in nonlinear modeling, but a poor choice of the kernel can lead to slow learning. Although learning the hyperparameters of the kernel typically leads to optimal generalization on in-distribution test data, we show that the generalization can be poor on out-of-distribution test data. We then investigate three solutions --- learning the smoothness using a discrete cosine transform, assuming fatter tails in function-space using a Student-$t$ process, and learning a more flexible kernel using deep kernel learning --- finding some evidence in favor of the first two.
|
Beau Coker · Finale Doshi-Velez 🔗 |
-
|
Which Features are Learned by Contrastive Learning? On the Role of Simplicity Bias in Class Collapse and Feature Suppression
(
Poster
)
link »
Contrastive learning (CL) has emerged as a powerful technique for representation learning, with or without label supervision.However, supervised CL is prone to collapsing representations of subclasses within a class by not capturing all their features, and unsupervised CL may suppress harder class-relevant features by focusing on learning easy class-irrelevant features; both significantly compromise representation quality. Yet, there is no theoretical understanding of \textit{class collapse} or \textit{feature suppression} at \textit{test} time. We provide the first unified theoretically rigorous framework to determine \textit{which} features are learnt by CL. Our analysis indicate that, perhaps surprisingly, bias of (stochastic) gradient descent towards finding simpler solutions is a key factor in collapsing subclass representations and suppressing harder class-relevant features. We also provide the first theoretical explanation for why employingsupervised and unsupervised CL together yields higher-quality representations, even when using commonly-used stochastic gradient methods. |
Yihao Xue · Siddharth Joshi · Eric Gan · Pin-Yu Chen · Baharan Mirzasoleiman 🔗 |
-
|
Cross-Risk Minimization: Inferring Groups Information for Improved Generalization
(
Poster
)
link »
Learning shortcuts, such as relying on spurious correlations or memorizing specific examples, make achieving robust machine learning difficult. invariant learning methods such as GroupDRO, capable of learning from various training groups, are shown to be effective for obtaining more robust models. However, the high cost of annotating data with environmental labels limits the practicality of these algorithms. This work introduces a framework called cross-risk minimization (CRM), which automatically groups examples based on their level of difficulty level. As an extension of the widely-used cross-validation routine, CRM uses the mistakes made by a model on held-out data as a signal to identify challenging examples. By leveraging these mistakes, CRM can effectively label both training and validation examples into groups with different levels of difficulty. We provide experiments on the Waterbirds dataset set, a well-known out-of-distribution (OOD) benchmark to demonstrate the effectiveness of CRM in inferring reliable group labels. These group labels are then used by other invariant learning methods to improve the worst-group accuracy. |
Mohammad Pezeshki · Diane Bouchacourt · Mark Ibrahim · Nicolas Ballas · Pascal Vincent · David Lopez-Paz 🔗 |
-
|
Robustness through Loss Consistency Regularization
(
Poster
)
link »
While deep learning through empirical risk minimization (ERM) has succeeded at achieving human-level performance at a variety of complex tasks, ERM is not robust to distribution shifts or adversarial attacks. Data augmentation followed by empirical risk minimization (DA-ERM) is used to improve robustness in ERM. In addition, consistency regularization can be applied to further improve the robustness of the model by forcing the representation of the original sample and the augmented one to be similar. However, existing consistency regularization methods are not applicable to covariant data augmentation, where the label in the augmented sample is dependent on the augmentation function. In this paper, we propose data augmented loss invariant regularization (DAIR), a simple form of consistency regularization that is applied directly at the loss level rather than intermediate features, making it widely applicable to both invariant and covariant data augmentation regardless of network architecture, problem setup, and task. We apply DAIR to real-world learning problems involving covariant data augmentation: robust neural task-oriented dialog state tracking and robust visual question answering. We also apply DAIR to tasks involving invariant data augmentation: robust regression, robust classification against adversarial attacks, and robust ImageNet classification under distribution shift. Our experiments show that DAIR consistently outperforms ERM and DA-ERM with little marginal computational cost and sets new state-of-the-art results in several benchmarks. |
Tianjian Huang · Shaunak Halbe · Chinnadhurai Sankar · Pooyan Amini · Satwik Kottur · Alborz Geramifard · Meisam Razaviyayn · Ahmad Beirami 🔗 |
-
|
Learning Diverse Features in Vision Transformers for Improved Generalization
(
Poster
)
link »
Deep learning models often learn and rely only on a small set of features, even when there is a richer set of predictive signals in the training data. This makes models brittle and sensitive to distribution shifts. In this work, we show how to diversify the features learned by vision transformers (ViTs). We find that their attention heads inherently induce some modularity in their internal representations. We propose a new regularizer that acts on their input gradients and further enhances the diversity and complementarity of the learned features. We observe improved out-of-distribution (OOD) robustness on standard diagnostic benchmarks (MNIST-CIFAR and Waterbirds). We also show that a much higher performance can be achieved by identifying and pruning the attention heads that extract spurious features. |
Armand Nicolicioiu · Andrei Nicolicioiu · Bogdan Alexe · Damien Teney 🔗 |
-
|
Saving a Split for Last-layer Retraining can Improve Group Robustness without Group Annotations
(
Poster
)
link »
Empirical risk minimization (ERM) of neural networks is prone to over-reliance on spurious correlations and poor generalization on minority groups. The recent deep feature reweighting technique achieves state-of-the-art group robustness via simple last-layer retraining, but it requires held-out group annotations to construct a group-balanced reweighting dataset. We examine this impractical requirement and find that last-layer retraining can be surprisingly effective without group annotations; in some cases, a significant gain is solely due to class balancing. Moreover, we show that instead of using the entire training dataset for ERM, dependence on spurious correlations can be reduced by holding out a small split of the training dataset for class-balanced last-layer retraining. Our experiments on four benchmarks across vision and language tasks indicate that this method improves worst-group accuracy by up to 17% over class-balanced ERM on the original dataset despite using no additional data or annotations – a surprising and unexplained result given that the two splits have equally drastic group imbalance. |
Tyler LaBonte · Vidya Muthukumar · Abhishek Kumar 🔗 |
-
|
Sharpness-Aware Minimization Enhances Feature Diversity
(
Poster
)
link »
Sharpness-Aware Minimization (SAM) has emerged as a promising alternative to stochastic gradient descent (SGD) for minimizing the loss objective in neural network training. The motivation behind SAM is to bias models towards flatter minima that are believed to generalize better. However, recent studies have shown conflicting evidence on the relationship between flatness and generalization, leaving the mechanism behind SAM's performance improvement unclear. In this paper, we present theoretical and empirical evidence that SAM can enhance feature diversity compared to SGD in vision datasets containing redundant or spurious features. We further provide insights into this behavior of SAM by investigating a controlled setting, demonstrating how SAM can induce feature diversity. Our results imply that one mechanism by which SAM improves downstream generalization is by learning representations that rely on more diverse features. |
Jacob Mitchell Springer · Vaishnavh Nagarajan · Aditi Raghunathan 🔗 |
-
|
ERM++: An Improved Baseline for Domain Generalization
(
Poster
)
link »
Multi-source Domain Generalization (DG) measures a classifier's ability to generalize to new distributions of data it was not trained on, given several training domains. While several multi-source DG methods have been proposed, they incur additional complexity during training by using domain labels. Recent work has shown that a well-tuned Empirical Risk Minimization (ERM) training procedure, that is simply minimizing the empirical risk on the source domains, can outperform most existing DG methods. We identify several key candidate techniques to further improve ERM performance, such as better utilization of training data, model parameter selection, and weight-space regularization. We call the resulting method ERM++, and show it significantly improves the performance of DG on five multi-source datasets by over 5% compared to standard ERM, and beats the current state-of-the-art despite being less computationally expensive. We hope that ERM++ becomes a strong baseline for future DG research. |
Piotr Teterwak · Kuniaki Saito · Theodoros Tsiligkaridis · Kate Saenko · Bryan Plummer 🔗 |
-
|
Front-door Adjustment Beyond Markov Equivalence with Limited Graph Knowledge
(
Poster
)
link »
Causal effect estimation from data typically requires assumptions about the cause-effect relations either explicitly in the form of a causal graph structure within the Pearlian framework, or implicitly in terms of (conditional) independence statements between counterfactual variables within the potential outcomes framework. When the treatment variable and the outcome variable are confounded, front-door adjustment is an important special case where, given the graph, causal effect of the treatment on the target can be estimated using post-treatment variables. However, the exact formula for front-door adjustment depends on the structure of the graph, which is difficult to learn in practice. In this work, we provide testable conditional independence statements to compute the causal effect using front-door-like adjustment without knowing the graph under limited structural side information. We show that our method is applicable in scenarios where knowing the Markov equivalence class is not sufficient for causal effect estimation. We demonstrate the effectiveness of our method on a class of random graphs as well as real causal fairness benchmarks. |
Abhin Shah · Karthikeyan Shanmugam · Murat Kocaoglu 🔗 |
-
|
Group Fairness with Uncertainty in Sensitive Attributes
(
Poster
)
link »
Learning a fair predictive model is crucial to mitigate biased decisions against minority groups in high-stakes applications. A common approach to learn such a model involves solving an optimization problem that maximizes the predictive power of the model under an appropriate group fairness constraint. However, in practice, sensitive attributes are often missing or noisy resulting in uncertainty. We demonstrate that solely enforcing fairness constraints on uncertain sensitive attributes can fall significantly short in achieving the level of fairness of models trained without uncertainty. To overcome this limitation, we propose a bootstrap-based algorithm that achieves the target level of fairness despite the uncertainty in sensitive attributes. The algorithm is guided by a Gaussian analysis for the independence notion of fairness where we propose a robust quadratically constrained quadratic problem to ensure a strict fairness guarantee with uncertain sensitive attributes. Our algorithm is applicable to both discrete and continuous sensitive attributes and is effective in real-world classification and regression tasks for various group fairness notions, e.g., independence and separation. |
Abhin Shah · Maohao Shen · Jongha Ryu · Subhro Das · Prasanna Sattigeri · Yuheng Bu · Gregory Wornell 🔗 |
-
|
Data Models for Dataset Drift Controls in Machine Learning With Optical Images
(
Poster
)
link »
This study addresses robustness concerns in machine learning due to dataset drift by integrating physical optics with machine learning to create explicit, differentiable data models. These models illuminate the impact of data generation on model performance and facilitate drift synthesis, precise tolerancing of model sensitivity (drift forensics), and beneficial drift creation (drift optimization). Accompanying the study are two datasets, Raw-Microscopy and Raw-Drone, available at ANONYMIZED. |
Luis Oala · Marco Aversa · Gabriel Nobis · Kurt Willis · Yoan Neuenschwander · Michèle Buck · Christian Matek · Jerome Extermann · Enrico Pomarico · Wojciech Samek · Roderick Murray-Smith · Christoph Clausen · Bruno Sanguinetti
|
-
|
Arbitrary Decisions are a Hidden Cost of Differentially Private Training
(
Poster
)
link »
Mechanisms used in privacy-preserving machine learning often aim to guarantee differential privacy (DP) during model training. Practical DP-ensuring training methods use randomization when fitting model parameters to privacy-sensitive data (e.g., adding Gaussian noise to clipped gradients). We demonstrate that such randomization incurs predictive multiplicity: for a given input example, the output predicted by equally-private models depends on the randomness used in training. Thus, for a given input, the predicted output can vary drastically if a model is re-trained, even if the same training dataset is used. The predictive-multiplicity cost of DP training has not been studied, and is currently neither audited for nor communicated to model designers and stakeholders. We derive a bound on the number of re-trainings required to estimate predictive multiplicity reliably. We analyze--both theoretically and through extensive experiments--the predictive-multiplicity cost of three DP-ensuring algorithms: output perturbation, objective perturbation, and DP-SGD. We demonstrate that the degree of predictive multiplicity rises as the level of privacy increases, and is unevenly distributed across individuals and demographic groups in the data. Because randomness used to ensure DP during training explains predictions for some examples, our results highlight a fundamental challenge to the justifiability of decisions supported by differentially private models in high-stakes settings. We conclude that practitioners should audit the predictive multiplicity of their DP-ensuring algorithms before deploying them in applications of individual-level consequence. |
Bogdan Kulynych · Hsiang Hsu · Carmela Troncoso · Flavio Calmon 🔗 |
-
|
Out of the Ordinary: Spectrally Adapting Regression for Covariate Shift
(
Poster
)
link »
Designing deep neural network classifiers that perform robustly on distributions differing from the available training data is an active area of machine learning research. However, out-of-distribution generalization for regression---the analogous problem for modeling continuous targets---remains relatively unexplored. To tackle this problem, we return to first principles and analyze how the closed-form solution for ordinary least squares (OLS) regression is sensitive to covariate shift. We characterize the out-of-distribution risk of the OLS model in terms of the eigenspectrum decomposition of the source and target data. We then use this insight to propose a method for adapting the weights of the last layer of a pre-trained neural regression model to perform better on input data originating from a different distribution. We demonstrate how this lightweight spectral adaptation procedure can improve out-of-distribution performance in a suite of both synthetic and real-world experiments. |
Benjamin Eyre · Elliot Creager · David Madras · Vardan Papyan · Richard Zemel 🔗 |
-
|
Prediction without Preclusion: Recourse Verification with Reachable Sets
(
Poster
)
link »
Machine learning models are now used to decide who will receive a loan, a job interview, or a public service. Standard techniques to build these models use features that characterize people but overlook their \emph{actionability}. In domains like lending and hiring, models can assign predictions that are fixed – meaning that consumers denied loans and interviews are precluded from access to credit and employment. In this work, we introduce a formal testing procedure to flag models that assign fixed predictions called recourse verification. We develop machinery to reliably test the feasibility of recourse for any model under user-specified actionability constraints. We demonstrate how these tools can ensure recourse and adversarial robustness and use them to study the infeasibility of recourse in real-world lending datasets. Our results highlight how models can inadvertently assign fixed predictions that preclude access and motivate the need to design algorithms that account for actionability when developing models and providing recourse. |
Avni Kothari · Bogdan Kulynych · Lily Weng · Berk Ustun 🔗 |
-
|
Removing Multiple Biases through the Lens of Multi-task Learning
(
Poster
)
link »
We consider the problem of training an unbiased and accurate model using a biased dataset with multiple biases.One of the major challenges is to balance improving overall accuracy and ignoring all the biases. To address this, we provide a novel framework connecting the problem to multi-task learning (MTL). To be specific, our framework divides training data into several groups according to their effects on the model bias, and defines each task of MTL as solving the target problem for each group. It in turn trains a single model for all the tasks with a weighted sum of task-wise losses as the training objective, while optimizing the weights as well as the model parameters. At the heart of our method lies the weight adjustment algorithm, which is rooted in a theory of multi-objective optimization and guarantees a Pareto-stationary solution. Our algorithm achieved the state of the art on two datasets with multiple biases, and demonstrated superior performance on conventional single-bias datasets. |
Nayeong Kim · Juwon Kang · Sungsoo Ahn · Jungseul Ok · Suha Kwak 🔗 |
-
|
Antibody DomainBed: Towards robust predictions using invariant representations of biological sequences carrying complex distribution shifts
(
Oral
)
link »
Recently, there has been an increased interest in accelerating drug design with machine learning (ML). Active ML-guided design of biological sequences with favorable properties involves multiple design cycles, in which (1) candidate sequences are proposed, (2) a subset of the candidates is selected using ML surrogate models trained to predict target properties of interest, and (3) a wet lab experimentally validates the selected sequences. The returned experimental results from one cycle provide valuable feedback for the next one, but the modifications they inspire in the candidate proposals or experimental protocol can lead to distribution shifts that impair the performance of surrogate models in the upcoming cycle. For the surrogate models to achieve consistent performance across cycles, we must explicitly account for the distribution shifts in their training. We turn to the notion of invariance and causal representation learning to achieve robustness across cycles. In particular, we apply domain generalization (DG) methods to develop invariant classifiers for predicting properties of therapeutic antibodies. We adapt a recent benchmark of DG algorithms, ``DomainBed,'' to deploy 23 algorithms across 5 domains, or cycle numbers. Our results confirm that invariant features lead to better predictive performance for out-of-distribution domains. |
Natasa Tagasovska · Ji Won Park · Stephen Ra · Kyunghyun Cho 🔗 |
-
|
Learning Independent Causal Mechanisms
(
Poster
)
link »
In many real-world applications, we consider a system not in isolation but in multiple contexts with distribution shifts. This results in non-i.i.d. data which may contain spurious correlations that can heavily bias learning. Here, we are interested in modeling such data using a mixture of causal mechanisms. To this end, we consider the principle that causal mechanisms either remain invariant under distribution shift or change independently. While existing work formulates this idea using statistical independence, it is limited to discovering an equivalence class of the causal model unless additional assumptions are imposed. We propose using the algorithmic notion of independence, and introduce a nonparametric approach for discovering independent mechanisms using Gaussian processes. In empirical evaluations, we show that this approach allows to discover causal models beyond partially directed graphs while being robust to different data-generating processes. |
Sarah Mameche · David Kaltenpoth · Jilles Vreeken 🔗 |
-
|
Towards Fair Knowledge Distillation using Student Feedback
(
Poster
)
link »
With the advent of large-scale models and their success in diverse fields, Knowledge Distillation (KD) techniques are increasingly used to deploy them to edge devices with limited memory and computation constraints. However, most distillation works focus on improving the prediction performance of the student model with little to no work in studying the effect of distillation on key fairness properties, ensuring trustworthy distillation. In this work, we propose a fairness-driven distillation framework, BIRD (BIas-awaRe Distillation), which introduces a FAIRDISTILL operator to collect feedback from the student through a meta-learning-based approach and selectively distill teacher knowledge. We demonstrate that BIRD can be augmented with different KD methods to increase the performance of foundation models and convolutional neural networks. Extensive experiments across three fairness datasets show the efficacy of our framework over existing state-of-the-art KD methods, opening up new directions to develop trustworthy distillation techniques |
Abhinav Java · Surgan Jandial · Chirag Agarwal 🔗 |
-
|
Provable domain adaptation using privileged information
(
Oral
)
link »
Successful unsupervised domain adaptation is guaranteed only under strong assumptions such as covariate shift and overlap between input domains. The latter is often violated in high-dimensional applications such as image classification which, despite this challenge, continues to serve as inspiration and benchmark for algorithm development. In this work, we show that access to side information about examples from the source and target domains can help relax sufficient assumptions on input variables and increase sample efficiency at the cost of collecting a richer variable set. We call this unsupervised domain adaptation by learning using privileged information (DALUPI). Tailored for this task, we propose algorithms for both multi-class and multi-label classification tasks. In our experiments we demonstrate that incorporating privileged information in learning can reduce errors in domain transfer and increase sample efficiency compared to classical learning. |
Adam Breitholtz · Anton Matsson · Fredrik Johansson 🔗 |
-
|
A Cosine Similarity-based Method for Out-of-Distribution Detection
(
Poster
)
link »
The ability to detect OOD data is a crucial aspect of practical machine learning applications. In this work, we show that cosine similarity between the test feature and the typical ID feature is a good indicator of OOD data. We propose Class Typical Matching (CTM), a post hoc OOD detection algorithm that uses a cosine similarity scoring function. Extensive experiments on multiple benchmarks show that CTM outperforms existing post hoc OOD detection methods. |
Ngoc Hieu Nguyen · Nguyen Hung-Quang · The-Anh Ta · Thanh Nguyen-Tang · Khoa Doan · Hoang Thanh-Tung 🔗 |
-
|
Reviving Shift Equivariance in Vision Transformers
(
Poster
)
link »
Shift equivariance, integral to object recognition, is often disrupted in Vision Transformers (ViT) by components like patch embedding, subsampled attention, and positional encoding. Attempts to combine convolutional neural network with ViTs are not fully successful in addressing this issue. We propose an input-adaptive polyphase anchoring algorithm for seamless integration into ViT models to ensure shift-equivariance. We also employ depth-wise convolution to encode positional information. Our algorithms enable ViT, and its variants such as Twins to achieve 100\% consistency with respect to input shift, demonstrate robustness to cropping, flipping, and affine transformations, and maintain consistent predictions even when the original models lose 20 percentage points on average when shifted by just a few pixels with Twins' accuracy dropping from 80.57\% to 62.40\%. |
Peijian Ding · Davit Soselia · Thomas Armstrong · Jiahao Su · Furong Huang 🔗 |
-
|
Identifiability Guarantees for Causal Disentanglement from Soft Interventions
(
Poster
)
link »
Causal disentanglement aims to uncover a representation of data using latent variables that are interrelated via a causal model. Such a representation is identifiable if the latent model that explains the data is unique. In this work, we focus on the scenario where observational and interventional data are available, with each intervention changing the mechanism of a latent variable. When the causal variables are fully observed, statistically consistent algorithms have been developed to identify the causal model under faithfulness assumptions. We here show that identifiability can still be achieved with unobserved causal variables, given a generalized notion of faithfulness. Our results guarantee that we can recover the latent causal model up to an equivalence class and predict the effect of unseen combinations of interventions, in the limit of infinite data. |
Jiaqi Zhang · Chandler Squires · Kristjan Greenewald · Akash Srivastava · Karthikeyan Shanmugam · Caroline Uhler 🔗 |
-
|
Towards A Scalable Solution for Compositional Multi-Group Fair Classification
(
Poster
)
link »
Despite rich literature on fairness, relatively little attention has been paid to remediating complex compositional systems built on multi-label classifiers, with respect to many groups, to achieve equality of opportunity. In this paper, we first show that baseline approaches scale linearly with the product of number of remediated groups and the number of prediction labels, making them intractable in practice. We introduce two simple techniques to achieve a constant scaling in this multi-group multi-label setup. We report experimental results in academic and real-world environments to empirically demonstrate the effectiveness of our proposal at mitigation in this setup. |
James Atwood · Tina Tian · Ben Packer · Meghana Deodhar · Jilin Chen · Alex Beutel · Flavien Prost · Ahmad Beirami 🔗 |
-
|
Towards Modular Learning of Deep Causal Generative Models
(
Poster
)
link »
Shpitser & Pearl (2008) proposed sound and complete algorithms to compute identifiable observational, interventional, and counterfactual queries for certain causal graph structures. However, these algorithms assume that we can correctly estimate the joint distributions, which is impractical for high-dimensional datasets. During the current rise of foundational models, we have access to large pre-trained models to generate realistic high-dimensional samples. To address the causal inference problem with high dimensional data, we propose a sequential adversarial training algorithm for learning deep causal generative models by dividing the training problem into independent sub-parts, thereby enabling the use of such pre-trained models. Our proposed algorithm called WhatIfGAN, arranges generative models according to a causal graph and trains them to imitate the underlying causal model even with unobserved confounders. Finally, with a semi-synthetic Colored MNIST dataset, we show that WhatIfGAN can sample from identifiable causal queries involving high-dimensional variables. |
Md Musfiqur Rahman · Murat Kocaoglu 🔗 |
-
|
Delphic Offline Reinforcement Learning under Nonidentifiable Hidden Confounding
(
Oral
)
link »
A prominent challenge of offline reinforcement learning (RL) is the issue of hidden confounding: unobserved variables may influence both the actions taken by the agent and the observed outcomes. Hidden confounding can compromise the validity of any causal conclusion drawn from data and presents a major obstacle to effective offline RL. In the present paper, we tackle the problem of hidden confounding in the nonidentifiable setting. We propose a definition of uncertainty due to hidden confounding bias, termed delphic uncertainty, which uses variation over world models compatible with the observations, and differentiate it from the well-known epistemic and aleatoric uncertainties. We derive a practical method for estimating the three types of uncertainties, and construct a pessimistic offline RL algorithm to account for them. Our method does not assume identifiability of the unobserved confounders, and attempts to reduce the amount of confounding bias. We demonstrate through extensive experiments and ablations the efficacy of our approach on a sepsis management benchmark, as well as on electronic health records. Our results suggest that nonidentifiable hidden confounding bias can be mitigated to improve offline RL solutions in practice. |
Alizée Pace · Hugo Yèche · Bernhard Schölkopf · Gunnar Ratsch · Guy Tennenholtz 🔗 |
-
|
C-Disentanglement: Discovering Causally-Independent Generative Factors under an Inductive Bias of Confounder
(
Poster
)
link »
Representation learning assumes that real-world data is generated by a few causally disentangled generative factors (i.e., sources of variation). However, most existing works assume unconfoundedness (i.e., there are no common causes to the generative factors) in the discovery process, and thus obtain only statistical independence. In this paper, we recognize the importance of modeling confounders in discovering causal generative factors. Unfortunately, such factors are not identifiable without proper inductive bias. We fill the gap by introducing a framework named Confounded-Cisentanglement (C-Disentanglement), the first framework that explicitly introduces the inductive bias of confounder via labels/knowledge from domain expertise. We further propose an approach for sufficient identification under the VAE framework. |
Xiaoyu Liu · Jiaxin Yuan · Bang An · Yuancheng Xu · Yifan Yang · Furong Huang 🔗 |
-
|
Mitigating Spurious Correlations in Multi-modal Models during Fine-tuning
(
Poster
)
link »
Mitigating spurious correlations during pre-training for large-scale multi-modal models can be costly and impractical. This paper proposes a novel approach to address spurious correlations during fine-tuning for a given domain of interest. With a focus on multi-modal models (e.g., CLIP), the proposed method leverages different modalities in these models to detect and explicitly set apart spurious attributes from the affected class, achieved through a multi-modal contrastive loss function that expresses spurious relationships through language. Our experimental results and in-depth visualizations on CLIP show that such an intervention can effectively i) improve the model's accuracy when spurious attributes are not present, and ii) directs the model's activation maps towards the actual class rather than the spurious attribute when present. In particular, on the Waterbirds dataset, our algorithm achieved a worst-group accuracy 23\% higher than ERM on CLIP with a ResNet-50 backbone, and 32\% higher on CLIP with a ViT backbone, while maintaining the same average accuracy as ERM. |
Yu Yang · Besmira Nushi · Hamid Palangi · Baharan Mirzasoleiman 🔗 |
-
|
Adversarial Data Augmentations for Out-of-Distribution Generalization
(
Poster
)
link »
Out-of-distribution (OoD) generalization occurs when representation learning encounters a distribution shift. This frequently happens in practice when training and testing data come from different environments. Covariate shift is a type of distribution shift that occurs only in the input data, while keeping the concept distribution invariant. We propose RIA - Regularization for Invariance with Adversarial training, a new method for OoD generalization that performs an adversarial search for training data environments. These adversarial data augmentations prevent a collapse to an in-distribution trained learner. It works with many existing OoD generalization methods for covariate shift that can be formulated as constrained optimization problems. We perform extensive experiments on OoD graph classification for various kinds of synthetic and natural distribution shifts. We demonstrate that our method can achieve high accuracy compared with OoD baselines. |
Simon Zhang · Ryan DeMilt · Kun Jin · Cathy Honghui Xia 🔗 |
-
|
Identifying Causal Mechanism Shifts among Nonlinear Additive Noise Models
(
Poster
)
link »
Structural causal models (SCMs) are widely used in various disciplines to represent causal relationships among variables in complex systems. Unfortunately, the true underlying directed acyclic graph (DAG) structure is often unknown, and determining it from observational or interventional data remains a challenging task. However, in many situations, the end goal is to identify changes (shifts) in causal mechanisms between related SCMs rather than recovering the entire underlying DAG structure. This paper focuses on identifying mechanism shifts in two or more related SCMs over the same set of variables---without estimating the entire DAG structure of each SCM.In this work we assume that each SCM belongs to the class of nonlinear additive noise models. We prove a surprising result where the Jacobian of the score function for the mixture distribution reveals information about shifts in general non-parametric functional mechanisms. Once the shifted variables are identified, we leverage recent work to estimate the structural differences (if any) for the shifted variables. Experiments on synthetic and real-world data are provided. |
Tianyu Chen · Kevin Bello · Bryon Aragam · Pradeep Ravikumar 🔗 |
-
|
Mitigating Simplicity Bias in Deep Learning for Improved OOD Generalization and Robustness
(
Poster
)
link »
Neural networks are known to exhibit simplicity bias (SB) where they tend to prefer learning 'simple' features over more 'complex' ones, even when the latter may be more informative. SB can lead to the model making biased predictions which have poor out-of-distribution (OOD) generalization and robustness. To address this, we propose a framework that encourages the model to use a more diverse set of features to make predictions. We first train a simple model, and then regularize the conditional mutual information with respect to it to obtain the final model. We demonstrate the effectiveness of this framework in various problem settings and real-world applications, showing that it effectively addresses SB, and enhances OOD generalization, sub-group robustness and fairness. We complement these results with theoretical analyses of the effect of the regularization and its OOD generalization properties. |
Bhavya Vasudeva · Kameron Shahabi · Vatsal Sharan 🔗 |
-
|
Look Beneath the Surface: Exploiting Fundamental Symmetry for Sample-Efficient Offline Reinforcement Learning
(
Poster
)
link »
Offline reinforcement learning (RL) offers an appealing approach to real-world tasks by learning policies from pre-collected datasets. However, the performance of existing offline RL algorithms heavily depends on the scale and state-action space coverage of datasets. Real-world data collection is often expensive and uncontrollable, leading to small and narrowly covered datasets and posing significant challenges for practical deployments of offline RL. In this paper, we propose a Time-reversal symmetry (T-symmetry) enforced Dynamics Model (TDM), which establishes consistency between a pair of forward and reverse latent dynamics. TDM provides both well-behaved representations for small datasets and a new reliability measure for OOD samples based on compliance with the T-symmetry. These can be readily used to construct a new offline RL algorithm (TSRL) with less conservative policy constraints and a reliable latent space data augmentation procedure. We find TSRL achieves great performance on small benchmark datasets with as few as 1% of the original samples. |
PENG CHENG · Xianyuan Zhan · Zhihao Wu · Wenjia Zhang · Youfang Lin · Shou cheng Song · Han Wang 🔗 |
-
|
Learning Linear Causal Representations from Interventions under General Nonlinear Mixing
(
Poster
)
link »
We study the problem of learning causal representations from unknown, latent interventions in a general setting, where the latent distribution is Gaussian but the mixing function is completely general. We prove strong identifiability results given unknown single-node interventions, i.e., without having access to the intervention targets. This generalizes prior works which have focused on weaker classes, such as linear maps or paired counterfactual data. This is also the first instance of causal identifiability from non-paired interventions for deep neural network embeddings. Our proof relies on carefully uncovering the high-dimensional geometric structure present in the data distribution after a non-linear density transformation, which we capture by analyzing quadratic forms of precision matrices of the latent distributions. Finally, we propose a contrastive algorithm to identify the latent variables in practice and evaluate its performance on various tasks. |
Simon Buchholz · Goutham Rajendran · Elan Rosenfeld · Bryon Aragam · Bernhard Schölkopf · Pradeep Ravikumar 🔗 |
-
|
Identifiability of Discretized Latent Coordinate Systems via Density Landmarks Detection
(
Poster
)
link »
Disentanglement aims to recover meaningful latent ground-truth factors from only the observed distribution. Identifiability provides the theoretical grounding for disentanglement to be well-founded. Unfortunately, unsupervised identifiability of independent latent factors is a theoretically proven impossibility in the i.i.d. setting under a general nonlinear smooth map from factors to observations. In this work, we show that, remarkably, it is possible to recover discretized latent coordinates under the most general smooth mapping (diffeomorphism) without any additional inductive bias on the mapping. This is, provided the latent density has axis-aligned discontinuity landmarks, but without making the unrealistic assumption of statistical independence of the factors. We introduce this novel form of identifiability and provide a comprehensive proof of the recovery of discretized coordinates. |
Vitória Barin-Pacela · Kartik Ahuja · Simon Lacoste-Julien · Pascal Vincent 🔗 |
-
|
Neuro-Causal Factor Analysis
(
Poster
)
link »
We revisit nonlinear factor analysis from a comparatively new perspective given by advancements in causal discovery and deep learning, introducing a framework for \emph{Neuro-Causal Factor Analysis (NCFA)}. Our approach is fully nonparametric: It identifies factors via latent causal discovery methods and then uses a variational autoencoder (VAE) that is constrained to abide by the Markov factorization of the distribution with respect to the learned graph. We evaluate NCFA on real and synthetic data sets, finding that it performs comparably to standard VAEs on data reconstruction tasks but with the advantages of sparser architecture, lower model complexity, and causal interpretability. Unlike traditional factor analysis methods, our NCFA method allows learning and reasoning about the latent factors underlying observed data from a justifiably causal perspective, even when the relations between factors and measurements are highly nonlinear. |
Alex Markham · Mingyu Liu · Bryon Aragam · Liam Solus 🔗 |
-
|
Deep Neural Networks Extrapolate Cautiously (Most of the Time)
(
Poster
)
link »
Conventional wisdom suggests that neural network predictions tend to be unpredictable and overconfident when faced with out-of-distribution (OOD) inputs. Our work aims to reassess this assumption, particularly with regards to neural networks with high-dimensional inputs. We find that as input data becomes increasingly OOD, neural network predictions actually tend to converge towards a constant value, rather than extrapolating in arbitrary ways. Furthermore, this value often closely approximates the optimal input-independent solution that minimizes training loss, which corresponds to a more cautious prediction for many common machine learning losses. Our empirical investigation suggests that this phenomenon exists across a broad array of datasets, distributional shifts, and loss functions. Furthermore, we study the mechanism responsible for this observed behavior, providing both an empirical and theoretical analysis. |
Katie Kang · Amrith Setlur · Claire Tomlin · Sergey Levine 🔗 |
-
|
Approximate Causal Effect Identification under Weak Confounding
(
Poster
)
link »
In this paper, we analyze the effect of “weak confounding” on causal estimands. More specifically, under the assumption that the unobserved confounders that render a query non-identifiable have small entropy, we propose an efficient linear program to derive the upper and lower bounds of the causal effect. We show that our bounds are consistent in the sense that as the entropy of unobserved confounders goes to zero, the gap between the upper and lower bound vanishes. Finally, we conduct synthetic and real data simulations to compare our bounds with the bounds obtained by the existing work that cannot incorporate such entropy constraints and show that our bounds are tighter for the setting with weak confounders. |
Ziwei Jiang · Lai Wei · Murat Kocaoglu 🔗 |
-
|
Large Dimensional Change Point Detection with FWER Control as Automatic Stopping
(
Poster
)
link »
We propose a statistical inference method for detecting change points in time-series of large panel data. The change points can have a general impact on different subsets of the panel. Our novel statistical perspective for high-dimensional change point detection combines selective inference and multiple testing. Our easy-to-use and computationally efficient procedure has two stages: First, LASSO regressions for each time-series screen a candidate set of change points. Second, we apply post-selection inference with a novel multiple testing adjustment to select the change points. Our method controls for the panel family-wise error rate with theoretical guarantees; hence guarding against p-hacking without the need for tuning parameters. In extensive simulations, our method outperforms leading benchmarks in terms of correct selections and false discovery. We have higher detection and make fewer Type I errors, leading to over 20% higher F1 classification scores. |
Jiacheng Zou · Yang Fan · Markus Pelger 🔗 |
-
|
Robust Learning with Progressive Data Expansion Against Spurious Correlation
(
Poster
)
link »
While deep learning models have shown remarkable performance in various tasks, they are susceptible to learning non-generalizable spurious features rather than the core features that are genuinely correlated to the true label. In this paper, beyond existing analyses of linear models, we theoretically examine the learning process of a two-layer nonlinear convolutional neural network in the presence of spurious features. In light of our theory, we propose a new training algorithm called PDE that efficiently enhances the model's robustness for a better worst-group performance. PDE begins with a group-balanced subset of training data and progressively expands it to facilitate the learning of the core features. Experiments on synthetic and real-world benchmark datasets confirm the superior performance of our method on models such as ResNets and Transformers. |
Yihe Deng · Yu Yang · Baharan Mirzasoleiman · Quanquan Gu 🔗 |
-
|
Towards Understanding Feature Learning in Out-of-Distribution Generalization
(
Poster
)
link »
A common explanation for the failure of out-of-distribution (OOD) generalization is that the model trained with empirical risk minimization(ERM) learns spurious features instead of invariant features. However, several recent studies challenged this explanation and found that deep networks may have already learned sufficiently good features for OOD generalization. Despite the contradictions at first glance, we theoretically show that ERM essentially learns both spurious and invariant features, while ERM tends to learn spurious features faster if the spurious correlation is stronger. Moreover, when fed the ERM learned features to the OOD objectives, the invariant feature learning quality significantly affects the final OOD performance, as OOD objectives rarely learn new features. Therefore, ERM feature learning can be a bottleneck to OOD generalization. To alleviate the reliance, we propose Feature Augmented Training (FAT), to enforce the model to learn richer features ready for OOD generalization. FAT iteratively augments the model to learn new features while retaining the already learnedfeatures. In each round, the retention and augmentation operations are performed on different subsets of the training data that capture distinct features. Extensive experiments show that FAT effectively learns richer features thus boosting the performance of various OOD objectives |
Yongqiang Chen · Wei Huang · Kaiwen Zhou · Yatao Bian · Bo Han · James Cheng 🔗 |
-
|
Spuriosity Rankings for Free: A Simple Framework for Last Layer Retraining Based on Object Detection
(
Poster
)
link »
Deep neural networks have exhibited remarkable performance in various domains. However, the reliance of these models on spurious features has raised concerns about their reliability. A promising solution to this problem is last-layer retraining, which involves retraining the linear classifier head on a small subset of data without spurious cues. Nevertheless, selecting this subset requires human supervision, which reduces its scalability. Moreover, spurious cues may still exist in the selected subset. As a solution to this problem, we propose a novel ranking framework that leverages an open vocabulary object detection technique to identify images without spurious cues. More specifically, we use the object detector as a measure to score the presence of the target object in the images. Next, the images are sorted based on this score, and the last-layer of the model is retrained on a subset of the data with the highest scores. Our experiments on the ImageNet-1k dataset demonstrate the effectiveness of this ranking framework in sorting images based on spuriousness and using them for last-layer retraining. |
Mohammad Azizmalayeri · reza abbasi · Amir Hosein Haji Mohammad rezaie · Reihaneh Zohrabi · Mahdi Amiri · Mohammad Manzuri · Mohammad H Rohban 🔗 |
-
|
Uncertainty-Guided Online Test-Time Adaptation via Meta-Learning
(
Poster
)
link »
In real-world scenarios, machine learning systems may continually experience distributional shifts due to many different factors in the test environment, which makes the predictions unreliable. For this reason, it is important to learn a model that can robustly adapt to the environment in an online manner. In this work, we propose to meta-learn how to guide unsupervised online adaptation by taking into account the uncertainty in the predictions. Generally, all unlabeled test samples are equally incorporated for online test-time adaptation. However, uncertain samples can negatively affect the adaptation performance. Thus, we enable the model to adaptively learn test samples by quantifying the uncertainty during test-time online adaptation. We experimentally show that our uncertainty-guided online adaptation provides improved robustness and adaptation performance during test-time on image classification tasks with some distributional shift. |
kyubyung chae · Taesup Kim 🔗 |
-
|
Stabilizing GNN for Fairness via Lipschitz Bounds
(
Poster
)
link »
The Lipschitz bound, a technique from robust statistics, limits the maximum changes in output with respect to the input, considering associated irrelevant biased factors. It provides an efficient and provable method for examining the output stability of machine learning models without incurring additional computation costs. However, there has been no previous research investigating the Lipschitz bounds for Graph Neural Networks (GNNs), especially in the context of non-Euclidean data with inherent biases. This poses a challenge for constraining GNN output perturbations induced by input biases and ensuring fairness during training. This paper addresses this gap by formulating a Lipschitz bound for GNNs operating on attributed graphs, and analyzing how the Lipschitz constant can constrain output perturbations induced by biases for fairness training. The effectiveness of the Lipschitz bound is experimentally validated in limiting model output biases. Additionally, from a training dynamics perspective, we demonstrate how the theoretical Lipschitz bound can effectively guide GNN training to balance accuracy and fairness. |
Yaning Jia · Chunhui Zhang 🔗 |
-
|
SAFE: Stable Feature Extraction without Environment Labels
(
Poster
)
link »
We study the problem of stable (or invariant) feature extraction for OOD generalization, when environment labels are unknown. Prior works extract stable features by first inferring pseudo environment labels, before applying invariant learning methods like invariant risk minimization (IRM). These methods are highly sensitive to hyper-parameters and model selection strategies. Moreover, recent work shows that it is not always possible to identify stable features. In this paper, we present sufficient conditions under which stable features can be directly identified without environment labels. Using these conditions, we provide a practical algorithm called StAble Feature Extraction ($\textbf{SAFE}$), which selects stable features without inferring environment labels. We show that SAFE accurately removes spurious and selects stable features on synthetic as well as a diverse range of real-world datasets, improving the OOD performance and calibration of ERM as well as prior invariant learning algorithms. Our work highlights the inefficacy of current invariant learning methods, and calls for more attention to the identifiability problem of stable features.
|
Aayush Mishra · Anqi Liu 🔗 |
-
|
Leveraging Task Structures for Improved Identifiability in Neural Network Representations
(
Poster
)
link »
This work extends the theory of identifiability in supervised learning by considering the consequences of having access to a distribution of tasks. In such cases, we show that identifiability is achievable even in the case of regression, extending prior work restricted to the single-task classification case. Furthermore, we show that the existence of a task distribution which defines a conditional prior over latent variables reduces the equivalence class for identifiability to permutations and scaling, a much stronger and more useful result. When we further assume a causal structure over these tasks, our approach enables simple maximum marginal likelihood optimization together with downstream applicability to causal representation learning. Empirically, we validate that our model outperforms more general unsupervised models in recovering canonical representations for arbitrary non-linear data arising from randomly initialized neural networks. |
Wenlin Chen · Julien Horwood · Juyeon Heo · Jose Miguel Hernandez-Lobato 🔗 |
-
|
Contextual Vision Transformers for Robust Representation Learning
(
Poster
)
link »
We present Contextual Vision Transformers (ContextViT), a method for producing robust feature representations for images exhibiting grouped structure such as covariates. ContextViT introduces an extra context token to encode group-specific information, allowing the model to explain away group-specific covariate structures while keeping core visual features shared across groups. Specifically, given an input image, Context-ViT maps images that share the same covariate into this context token appended to the input image tokens to capture the effects of conditioning the model on group membership. We furthermore introduce a context inference network to predict such tokens on the fly given a few samples from a group distribution, enabling ContextViT to generalize to new testing distributions at inference time. We demonstrate the performance of ContextViT through a diverse range of applications. |
Yujia Bao · Theofanis Karaletsos 🔗 |
-
|
Learning Counterfactually Invariant Predictors
(
Poster
)
link »
Notions of counterfactual invariance have proven essential for predictors that are fair, robust, and generalizable in the real world. We propose simple graphical criteria that yield a sufficient condition for a predictor to be counterfactually invariant in terms of (conditional independence in) the observational distribution. Any predictor that satisfies our criterion is provably counterfactually invariant. In order to learn such predictors, we propose a model-agnostic framework, called Counterfactual Invariance Prediction (CIP), building on a kernel-based conditional dependence measure called Hilbert-Schmidt Conditional Independence Criterion (HSCIC). Our experimental results demonstrate the effectiveness of CIP in enforcing counterfactual invariance across various simulated and real-world datasets including scalar and multi-variate settings. |
Francesco Quinzan · Cecilia Casolo · Krikamol Muandet · Yucen Luo · Niki Kilbertus 🔗 |
-
|
Concept Algebra for Score-based Conditional Model
(
Poster
)
link »
This paper concerns the structure of learned representations in text-guided generative models, focusing on score-based models. A key property of such models is that they can compose disparate concepts in a 'disentangled' manner. This suggests these models have internal representations that encode concepts in a 'disentangled' manner. Here, we focus on the idea that concepts are encoded as subspaces of some representation space. We formalize what this means, show there's a natural choice for the representation, and develop a simple method for identifying the part of the representation corresponding to a given concept. In particular, this allows us to manipulate the concepts expressed by the model through algebraic manipulation of the representation. We demonstrate the idea with examples using Stable Diffusion. |
Zihao Wang · Lin Gui · Jeffrey Negrea · Victor Veitch 🔗 |
-
|
Tackling Shortcut Learning in Deep Neural Networks: An Iterative Approach with Interpretable Models
(
Poster
)
link »
We use concept-based interpretable models to mitigate shortcut learning. Existing methods lack interpretability.Beginning with a Blackbox, we iteratively carve out a mixture of interpretable experts (MoIE) and a residual network. Each expert explains a subset of data using First Order Logic (FOL). While explaining a sample, the FOL from biased BB-derived MoIE detects the shortcut effectively. Finetuning the BB with Metadata Normalization (MDN) eliminates the shortcut. The FOLs from the finetuned-BB-derived MoIE verify the elimination of the shortcut. Our experiments show that MoIE does not hurt the accuracy of the original BB and eliminates shortcuts effectively. |
Shantanu Ghosh · Ke Yu · Forough Arabshahi · Kayhan Batmanghelich 🔗 |
-
|
Invariant Causal Set Covering Machines
(
Poster
)
link »
Rule-based models, such as decision trees, appeal to practitioners due to their interpretable nature. However, the learning algorithms that produce such models are often vulnerable to spurious associations and thus, they are not guaranteed to extract causally-relevant insights. In this work, we build on ideas from the invariant causal prediction literature to propose Invariant Causal Set Covering Machines, an extension of the classical Set Covering Machine algorithm for conjunctions/disjunctions of binary-valued rules that provably avoids spurious associations. We demonstrate both theoretically and empirically that our method can identify the causal parents of a variable of interest in polynomial time. |
Thibaud Godon · Baptiste Bauvin · Pascal Germain · Jacques Corbeil · Alexandre Drouin 🔗 |
-
|
Replicable Reinforcement Learning
(
Poster
)
link »
The replicability crisis in the social, behavioral, and data sciences has led to the formulation of algorithm frameworks for replicability --- i.e., a requirement that an algorithm produce identical outputs (with high probability) when run on two different samples from the same underlying distribution. While still in its infancy, provably replicable algorithms have been developed for many fundamental tasks in machine learning and statistics, including statistical query learning, the heavy hitters problem, and distribution testing. In this work we initiate the study of replicable reinforcement learning, providing a provably replicable algorithm for parallel value iteration, and a provably replicable version of R-Max in the episodic setting. These are the first formal replicability results for control problems, which present different challenges for replication than batch learning settings. |
ERIC EATON · Marcel Hussing · Michael Kearns · Jessica Sorrell 🔗 |
-
|
(Almost) Provable Error Bounds Under Distribution Shift via Disagreement Discrepancy
(
Poster
)
link »
We derive an (almost) guaranteed upper bound on the error of deep neural networks under distribution shift using unlabeled test data. Prior methods either give bounds that are vacuous in practice or give \emph{estimates} that are accurate on average but heavily underestimate error for a sizeable fraction of shifts. Our bound requires a simple, intuitive condition which is well justified by prior empirical works and holds in practice effectively 100\% of the time. The bound is inspired by $\mathcal{H}\Delta\mathcal{H}$-divergence but is easier to evaluate and substantially tighter, consistently providing non-vacuous guarantees. Estimating the bound requires optimizing one multiclass classifier to disagree with another, for which some prior works have used sub-optimal proxy losses; we devise a ``disagreement loss'' which is theoretically justified and performs better in practice. Across a wide range of benchmarks, our method gives valid error bounds while achieving average accuracy comparable to competitive estimation baselines.
|
Elan Rosenfeld · Saurabh Garg 🔗 |
-
|
Seeing is not Believing: Robust Reinforcement Learning against Spurious Correlation
(
Poster
)
link »
In this work, we consider one critical type of robustness against spurious correlation, where different portions of the state do not have causality but have correlations induced by unobserved confounders. These spurious correlations are ubiquitous in real-world tasks, for instance, a self-driving car usually observes heavy traffic in the daytime and light traffic at night due to unobservable human activity.A model that learns such useless or even harmful correlation could catastrophically fail when the confounder in the test case deviates from the training one.Although motivated, enabling robustness against spurious correlation poses significant challenges since the uncertainty set, shaped by the unobserved confounder and sequential structure of RL, is difficult to characterize and identify. To solve this issue, we propose Robust State-Confounded Markov Decision Processes (RSC-MDPs) and theoretically demonstrate its superiority in breaking spurious correlations compared with other robust RL.We also design an empirical algorithm to learn the robust optimal policy for RSC-MDPs, which outperforms all baselines in eight realistic self-driving and manipulation tasks. |
Wenhao Ding · Laixi Shi · Yuejie Chi · Ding Zhao 🔗 |
-
|
Shortcut Detection with Variational Autoencoders
(
Poster
)
link »
For real-world applications of machine learning (ML), it is essential that models make predictions based on well-generalizing features rather than spurious correlations in the data. The identification of such spurious correlations, also known as shortcuts, is a challenging problem and has so far been scarcely addressed. In this work, we present a novel approach to detect shortcuts in image and audio datasets by leveraging variational autoencoders (VAEs). The disentanglement of features in the latent space of VAEs allows us to discover correlations in datasets and semi-automatically evaluate them for ML shortcuts. We demonstrate the applicability of our method on several real-world datasets and identify shortcuts that have not been discovered before. |
Nicolas Müller · Simon Roschmann · Shahbaz Khan · Philip Sperl · Konstantin Böttinger 🔗 |
-
|
Results on Counterfactual Invariance
(
Poster
)
link »
In this paper we provide a theoretical analysis of counterfactual invariance. We present a variety of existing definitions, study how they relate to each other and what their graphical implications are. We then turn to the current major question surrounding counterfactual invariance, how does it relate to conditional independence? We show that whilst counterfactual invariance implies conditional independence, conditional independence does not give any implications about the degree or likelihood of satisfying counterfactual invariance. Furthermore, we show that for discrete causal models counterfactually invariant functions are often constrained to be functions of particular variables, or even constant. |
Jake Fawkes · Robin Evans 🔗 |
-
|
The Role of Linguistic Priors in Measuring Compositional Generalization of Vision-language Models
(
Poster
)
link »
Compositionality is a common property in many modalities including natural languages and images, but the compositional generalization of multi-modal models is not well-understood. In this paper, we identify two sources of visual-linguistic compositionality: linguistic priors and the interplay between images and texts. We show that current attempts to improve compositional generalization rely on linguistic priors rather than on information in the image. We also propose a new metric for compositionality without such linguistic priors. |
Chenwei Wu · Li Li · Stefano Ermon · Patrick Haffner · Rong Ge · Zaiwei Zhang 🔗 |
-
|
Bridging the Domain Gap by Clustering-based Image-Text Graph Matching
(
Poster
)
link »
Learning domain-invariant representations is important to train a model that can generalize well to unseen domains.To this end, we propose a novel approach that leverages the semantic structures inherent in text descriptions as effective pivot embeddings for domain generalization. Specifically, we utilize graph representations of images and their associated textual descriptions to obtain domain-invariant pivot embeddings that capture the underlying semantic relationships between local images and text descriptors.Our approach involves a clustering-based graph-matching algorithm that matches graph-based image node features into textual graphs.Experimental results show the efficacy of our proposed method in enhancing the generalization ability of the model. |
Nokyung Park · Daewon Chae · Jeong Yong Shim · Sangpil Kim · Eun-Sol Kim · Jinkyu Kim 🔗 |
-
|
Group Robustness via Adaptive Class-Specific Scaling
(
Poster
)
link »
Group distributionally robust optimization, which aims to improve robust accuracies such as worst-group or unbiased accuracy, is one of the mainstream algorithms to mitigate spurious correlation and handle dataset bias. Existing approaches have apparently improved robust accuracy, but in fact these performance gains mainly come from trade-offs at the expense of average accuracy. To address the challenges, we first propose a simple class-specific scaling strategy to control the trade-off between robust and average accuracies flexibly and efficiently, which is directly applicable to existing debiasing algorithms without additional training; it reveals that a naive ERM baseline matches or even outperforms the recent debiasing approaches by only adopting the class-specific scaling. Then, we employ this technique to evaluate the performance of existing algorithms in a comprehensive manner by introducing a novel unified metric that summarizes the trade-off between the two accuracies as a scalar value. We also develop an instance-wise adaptive scaling technique for overcoming the trade-off and improving the performance even further in terms of both accuracies. We perform experiments on the datasets in computer vision and natural language processing domains and verify the effectiveness of the proposed frameworks. By considering the inherent trade-off, our frameworks provide meaningful insights in existing robust approaches beyond comparing only the robust accuracy. |
Seonguk Seo · Bohyung Han 🔗 |
-
|
ModelDiff: A Framework for Comparing Learning Algorithms
(
Oral
)
link »
We study the problem of (learning) algorithm comparison, where the goal is to find differences between models trained with two different learning algorithms. We begin by formalizing this goal as one of finding distinguishing feature transformations, i.e., input transformations that change the predictions of models trained with one learning algorithm but not the other. We then present ModelDiff, a method that leverages the datamodels framework (Ilyas et al., 2022) to compare learning algorithms based on how they use their training data. Finally, we use ModelDiff to demonstrate how training image classifiers with standard data augmentation can amplify reliance on specific instances of co-occurence and texture biases. |
Harshay Shah · Sung Min (Sam) Park · Andrew Ilyas · Aleksander Madry 🔗 |
-
|
Improve Identity-Robustness for Face Models
(
Poster
)
link »
Despite the success of deep-learning models in many tasks, there have been concerns about such models learning shortcuts, and their lack of robustness to irrelevant confounders. When it comes to models directly trained on human faces, a sensitive confounder is that of human identities. Due to the privacy concern and cost of such annotations, improving identity-related robustness without the need for such annotations is of great importance. Here, we explore using off-the-shelf face-recognition embedding vectors, as proxies for identities, to enforce such robustness. Given an identity-independent classification task and a face dataset, we propose to use the structure in the face-recognition embedding space, to implicitly emphasize rare samples within each class. We do so by weighting samples according to their conditional inverse density (CID) in the proxy embedding space. Our experiments suggest that such a simple sample weighting scheme, not only improves the training robustness, it often improves the overall performance as a result of such robustness. We also show that employing such constraints during training results in models that are significantly less sensitive to different levels of bias in the dataset. |
Qi Qi · Shervin Ardeshir 🔗 |
-
|
Impact of Noise on Calibration and Generalisation of Neural Networks
(
Poster
)
link »
Noise injection and data augmentation strategies have been effective for enhancing the generalisation and robustness of neural networks (NNs). Certain types of noise such as label smoothing and MixUp have also been shown to improve calibration. Since noise can be added in various stages of the NN's training, it motivates the question of when and where the noise is the most effective. We study a variety of noise types to determine how much they improve calibration and generalisation, and under what conditions. More specifically we evaluate various noise-injection strategies in both in-distribution (ID) and out-of-distribution (OOD) scenarios.The findings highlight that activation noise was the most transferable and effective in improving generalisation, while input augmentation noise was prominent in improving calibration on OOD but not necessarily ID data. |
Martin Ferianc · Ondrej Bohdal · Timothy Hospedales · Miguel Rodrigues 🔗 |
-
|
Bias-to-Text: Debiasing Unknown Visual Biases by Language Interpretation
(
Poster
)
link »
Biases in models pose a critical issue when deploying machine learning systems, but diagnosing them in an explainable manner can be challenging. To address this, we introduce the bias-to-text (B2T) framework, which uses language interpretation to identify and mitigate biases in vision models, such as image classifiers and text-to-image generative models. Our language descriptions of visual biases provide explainable forms that enable the discovery of novel biases and effective model debiasing. To achieve this, we analyze common keywords in the captions of mispredicted or generated images. Here, we propose novel score functions to avoid biases in captions by comparing the similarities between bias keywords and those images. Additionally, we present strategies to debias zero-shot classifiers and text-to-image diffusion models using the bias keywords from the B2T framework. We demonstrate the effectiveness of our framework on various image classification and generation tasks. For classifiers, we discover a new spurious correlation between the keywords "(sports) player" and "female" in Kaggle Face and improve the worst-group accuracy on Waterbirds by 11% through debiasing, compared to the baseline. For generative models, we detect and effectively prevent unfair (e.g., gender-biased) and unsafe (e.g., "naked") image generation. |
Younghyun Kim · Sangwoo Mo · Minkyu Kim · Kyungmin Lee · Jaeho Lee · Jinwoo Shin 🔗 |
-
|
Studying Generalization on Memory-Based Methods in Continual Learning
(
Poster
)
link »
One of the objectives of Continual Learning is to learn new concepts continually over a stream of experiences and at the same time avoid catastrophic forgetting. To mitigate complete knowledge overwriting, memory-based methods store a percentage of previous data distributions to be used during training. Although these methods produce good results, few studies have tested their out-of-distribution generalization properties, as well as whether these methods overfit the replay memory. In this work, we show that although these methods can help in traditional in-distribution generalization, they can strongly impair out-of-distribution generalization by learning spurious features and correlations. Using a controlled environment, using the Synbol benchmark generator (Lacoste et al., 2020), we demonstrate that this lack of out-of-distribution generalization mainly occurs in the linear classifier. |
Felipe del Rio · Julio Hurtado · Cristian Calderon · Alvaro Soto · Vincenzo Lomonaco 🔗 |
-
|
Causal Dynamics Learning with Quantized Local Independence Discovery
(
Poster
)
link »
Incorporating causal relationships between the variables into dynamics learning has emerged as a promising approach to enhance robustness and generalization in reinforcement learning (RL). Recent studies have focused on inferring the causal structure of the transition dynamics and leveraging only relevant subsets of the state and action variables for prediction or counterfactual reasoning. However, such approaches tend to overlook the fine-grained local independence relationships that exist among variables. In this work, we propose a novel approach to dynamics learning which infers event-specific causal relationships that hold under certain circumstances referred to as events. Our main idea is to learn a discrete latent variable that represents both the events and corresponding local causal structures via vector quantization. Compared to the prior models using the global causal structure, our approach provides a more detailed understanding of the dynamics by capturing event-specific causal relationships and locally invariant causal mechanisms. Experimental results demonstrate that our method successfully discovers event-specific causal structures, is robust to locally spurious correlations, and generalizes well to downstream tasks compared to previous approaches. |
Inwoo Hwang · Yunhyeok Kwak · Suhyung Choi · Byoung-Tak Zhang · Sanghack Lee 🔗 |
-
|
Shortcut Learning Through the Lens of Training Dynamics
(
Poster
)
link »
Deep Neural Networks (DNNs) are prone to learning *shortcut* patterns that damage the generalization of the DNN during deployment. This paper aims to better understand shortcut learning through the lens of the learning dynamics of the internal neurons during the training process. We make the following observations: (1) While previous works treat shortcuts as synonymous with spurious correlations, we emphasize that not all spurious correlations are shortcuts. We show that shortcuts are only those spurious features that are "easier" than the core features. (2) We build upon this premise and use *instance difficulty* methods (like Prediction Depth) to quantify "easy" and to identify this behavior during the training phase. (3) We empirically show that shortcut learning can be detected by observing the learning dynamics of the DNN's *early layers*. In other words, easy features learned by the initial layers of a DNN early during the training are potential shortcuts. We verify our claims on medical and vision datasets, both simulated and real, and justify the empirical success of our hypothesis by showing the theoretical connections between Prediction Depth and information-theoretic concepts like $\mathcal{V}$-usable information. Lastly, our experiments show the insufficiency of monitoring only accuracy plots during training (as is common in machine learning pipelines). We highlight the need for monitoring early training dynamics using example difficulty metrics.
|
Nihal Murali · Aahlad Puli · Ke Yu · Rajesh Ranganath · Kayhan Batmanghelich 🔗 |
-
|
Optimization or Architecture: What Matters in Non-Linear Filtering?
(
Poster
)
link »
In non-linear filtering, it is traditional to compare non-linear architectures such as neural networks to the standard linear Kalman Filter (KF). We observe that this methodology mixes the evaluation of two separate components: the non-linear architecture, and the numeric optimization method. In particular, the non-linear model is often optimized, whereas the reference KF model is not. We argue that both should be optimized similarly. We suggest the Optimized KF (OKF), which adjusts numeric optimization to the positive-definite KF parameters. We demonstrate how a significant advantage of a neural network over the KF may entirely vanish once the KF is optimized using OKF. This implies that experimental conclusions of certain previous studies were derived from a flawed process. The benefits of OKF over the non-optimized KF are further studied theoretically and empirically, where OKF demonstrates consistently improved accuracy in a variety of problems. |
Ido Greenberg · Netanel Yannay · Shie Mannor 🔗 |
-
|
Breaking the Spurious Causality of Conditional Generation via Fairness Intervention with Corrective Sampling
(
Poster
)
link »
To capture the relationship between samples and labels, conditional generative models often inherit spurious correlations from the training dataset. This can result in label-conditional distributions that are imbalanced with respect to another latent attribute. To mitigate this issue, which we call spurious causality, we propose a general two-step strategy. (a) Fairness Intervention (FI): emphasize the minority samples that are hard to generate due to the spurious correlation in the training dataset. (b) Corrective Sampling (CS): explicitly filter the generated samples and ensure that they follow the desired latent attribute distribution. We have designed the fairness intervention to work for various degrees of supervision on the spurious attribute, including unsupervised, weakly-supervised, and semi-supervised scenarios. Our experimental results demonstrate that the proposed FICS approach can effectively resolve spurious causality across various datasets. |
Jun Hyun Nam · Sangwoo Mo · Jaeho Lee · Jinwoo Shin 🔗 |
-
|
Causal-structure Driven Augmentations for Text OOD Generalization
(
Poster
)
link »
In this work, we propose counterfactual data augmentation methods, guided by knowledge of the causal structure of the data, to simulate interventions on spurious features. Our main motivation is classifying medical notes, and we use these methods to learn more robust text classifiers. In prediction problems where the label is spuriously correlated with an attribute, and under certain assumptions, we show that this strategy is appropriate and can enjoy improved sample complexity compared to importance re-weighting. Pragmatically, we match examples using auxiliary data, based on diff-in-diff methodology, and use a large language model (LLM) to represent a conditional probability of text. Experiments on learning caregiver-invariant predictors of clinical diagnoses from medical narratives and on semi-synthetic data, demonstrate that our method improves out-of-distribution (OOD) accuracy. |
Amir Feder · Yoav Wald · Claudia Shi · Suchi Saria · David Blei 🔗 |
-
|
Weighted Risk Invariance for Density-Aware Domain Generalization
(
Poster
)
link »
Learning how to generalize training performance to unseen test distributions is essential to building robust, practically useful models. To this end, many recent studies focus on learning invariant (causal) features from multiple domains. However, the problem of distribution shift in the invariant features is not well studied, and existing invariant learning methods that ignore this possibility can struggle to generalize. In this work, we focus on finding invariant predictors from multiple, potentially shifted invariant feature distributions. We propose a novel optimization problem, Weighted Risk Invariance (WRI), and we show that the solution to this problem provably achieves out-of-distribution generalization. We also introduce an algorithm to practically solve the WRI problem that learns the density of invariant features and model parameters simultaneously, and we demonstrate our approach outperforms previous invariant learning methods under covariate shift in the invariant features. Finally, we show that the learned density over invariant features effectively detects when the features are out-of-distribution. To the best of our knowledge, ours is the first invariant learning method to provide informative density estimates on invariant features for the domain generalization problem. |
Gina Wong · Joshua Gleason · Rama Chellappa · Yoav Wald · Anqi Liu 🔗 |
Author Information
Yoav Wald (Johns Hopkins University)
Claudia Shi (Columbia University)
Aahlad Puli (NYU)
Amir Feder (Columbia University, Google)
Limor Gultchin (University of Oxford)
Mark Goldstein (New York University)
Maggie Makar (University of Michigan)
Victor Veitch (Google; University of Chicago)
Uri Shalit (Technion)
More from the Same Authors
-
2021 : Causal-BALD: Deep Bayesian Active Learning of Outcomes to Infer Treatment-Effects from Observational Data »
Andrew Jesson · Panagiotis Tigas · Joost van Amersfoort · Andreas Kirsch · Uri Shalit · Yarin Gal -
2022 : In the Eye of the Beholder: Robust Prediction with Causal User Modeling »
Amir Feder · Guy Horowitz · Yoav Wald · Roi Reichart · Nir Rosenfeld -
2022 : Invariant and Transportable Representations for Anti-Causal Domain Shifts »
Yibo Jiang · Victor Veitch -
2022 : A Unified Causal View of Domain Invariant Representation Learning »
Zihao Wang · Victor Veitch -
2022 : Causally motivated multi-shortcut identification and removal »
Jiayun Zheng · Maggie Makar -
2022 : Fairness and robustness in anti-causal prediction »
Maggie Makar · Alexander D'Amour -
2022 : Leveraging Factored Action Spaces for Efficient Offline Reinforcement Learning in Healthcare »
Shengpu Tang · Maggie Makar · Michael Sjoding · Finale Doshi-Velez · Jenna Wiens -
2022 : Fairness and robustness in anti-causal prediction »
Maggie Makar · Alexander D'Amour -
2023 : Towards Modular Machine Learning Pipelines »
Aditya Modi · JIVAT NEET KAUR · Maggie Makar · Pavan Mallapragada · Amit Sharma · Emre Kiciman · Adith Swaminathan -
2023 : Concept Algebra for Score-based Conditional Model »
Zihao Wang · Lin Gui · Jeffrey Negrea · Victor Veitch -
2023 : Shortcut Learning Through the Lens of Training Dynamics »
Nihal Murali · Aahlad Puli · Ke Yu · Rajesh Ranganath · Kayhan Batmanghelich -
2023 : Causal-structure Driven Augmentations for Text OOD Generalization »
Amir Feder · Yoav Wald · Claudia Shi · Suchi Saria · David Blei -
2023 : Weighted Risk Invariance for Density-Aware Domain Generalization »
Gina Wong · Joshua Gleason · Rama Chellappa · Yoav Wald · Anqi Liu -
2023 : In the Eye of the Beholder: Robust Prediction with Causal User Modeling »
Amir Feder · Nir Rosenfeld -
2023 : Birds of an Odd Feather: Guaranteed Out-of-Distribution (OOD) Novel Category Detection »
Yoav Wald · Suchi Saria -
2023 : Concept Algebra for Score-based Conditional Model »
Zihao Wang · Lin Gui · Jeffrey Negrea · Victor Veitch -
2023 : SCIS 2023 Panel, The Future of Generalization: Scale, Safety and Beyond »
Maggie Makar · Samuel Bowman · Zachary Lipton · Adam Gleave -
2023 Poster: B-Learner: Quasi-Oracle Bounds on Heterogeneous Causal Effects Under Hidden Confounding »
Miruna Oprescu · Jacob Dorn · Marah Ghoummaid · Andrew Jesson · Nathan Kallus · Uri Shalit -
2022 Workshop: Spurious correlations, Invariance, and Stability (SCIS) »
Aahlad Puli · Maggie Makar · Victor Veitch · Yoav Wald · Mark Goldstein · Limor Gultchin · Angela Zhou · Uri Shalit · Suchi Saria -
2021 : Live Panel Discussion »
Thomas Dietterich · Chelsea Finn · Kamalika Chaudhuri · Yarin Gal · Uri Shalit -
2021 Workshop: The Neglected Assumptions In Causal Inference »
Niki Kilbertus · Lily Hu · Laura Balzer · Uri Shalit · Alexander D'Amour · Razieh Nabi -
2021 Poster: Operationalizing Complex Causes: A Pragmatic View of Mediation »
Limor Gultchin · David Watson · Matt J. Kusner · Ricardo Silva -
2021 Poster: Valid Causal Inference with (Some) Invalid Instruments »
Jason Hartford · Victor Veitch · Dhanya Sridhar · Kevin Leyton-Brown -
2021 Spotlight: Operationalizing Complex Causes: A Pragmatic View of Mediation »
Limor Gultchin · David Watson · Matt J. Kusner · Ricardo Silva -
2021 Spotlight: Valid Causal Inference with (Some) Invalid Instruments »
Jason Hartford · Victor Veitch · Dhanya Sridhar · Kevin Leyton-Brown -
2021 Poster: Quantifying Ignorance in Individual-Level Causal-Effect Estimates under Hidden Confounding »
Andrew Jesson · Sören Mindermann · Yarin Gal · Uri Shalit -
2021 Spotlight: Quantifying Ignorance in Individual-Level Causal-Effect Estimates under Hidden Confounding »
Andrew Jesson · Sören Mindermann · Yarin Gal · Uri Shalit -
2021 Poster: Conditional Distributional Treatment Effect with Kernel Conditional Mean Embeddings and U-Statistic Regression »
Junhyung Park · Uri Shalit · Bernhard Schölkopf · Krikamol Muandet -
2021 Poster: Proximal Causal Learning with Kernels: Two-Stage Estimation and Moment Restriction »
Afsaneh Mastouri · Yuchen Zhu · Limor Gultchin · Anna Korba · Ricardo Silva · Matt J. Kusner · Arthur Gretton · Krikamol Muandet -
2021 Spotlight: Proximal Causal Learning with Kernels: Two-Stage Estimation and Moment Restriction »
Afsaneh Mastouri · Yuchen Zhu · Limor Gultchin · Anna Korba · Ricardo Silva · Matt J. Kusner · Arthur Gretton · Krikamol Muandet -
2021 Spotlight: Conditional Distributional Treatment Effect with Kernel Conditional Mean Embeddings and U-Statistic Regression »
Junhyung Park · Uri Shalit · Bernhard Schölkopf · Krikamol Muandet -
2021 Poster: Understanding Failures in Out-of-Distribution Detection with Deep Generative Models »
Lily Zhang · Mark Goldstein · Rajesh Ranganath -
2021 Spotlight: Understanding Failures in Out-of-Distribution Detection with Deep Generative Models »
Lily Zhang · Mark Goldstein · Rajesh Ranganath -
2020 Poster: Robust Learning with the Hilbert-Schmidt Independence Criterion »
Daniel Greenfeld · Uri Shalit -
2019 Poster: Humor in Word Embeddings: Cockamamie Gobbledegook for Nincompoops »
Limor Gultchin · Genevieve Patterson · Nancy Baym · Nathaniel Swinger · Adam Kalai -
2019 Oral: Humor in Word Embeddings: Cockamamie Gobbledegook for Nincompoops »
Limor Gultchin · Genevieve Patterson · Nancy Baym · Nathaniel Swinger · Adam Kalai