Timezone: »
When facing data with imbalanced classes or groups, practitioners follow an intriguing strategy to achieve best results. They throw away examples until the classes or groups are balanced in size, and then perform empirical risk minimization on the reduced training set. This opposes common wisdom in learning theory, where the expected error is supposed to decrease as the dataset grows in size. In this work, we leverage extreme value theory to address this apparent contradiction. Our results show that the tails of the data distribution play an important role in determining the worst-group-accuracy of linear classifiers. When learning on data with heavy tails, throwing away data restores the geometric symmetry of the resulting classifier, and therefore improves its worst-group generalization.
Author Information
Kamalika Chaudhuri (UCSD, Meta AI Research, and FAIR)
Kartik Ahuja (FAIR (Meta AI))
Martin Arjovsky (New York University)
David Lopez-Paz (Facebook AI Research)
Related Events (a corresponding poster, oral, or spotlight)
-
2023 Poster: Why does Throwing Away Data Improve Worst-Group Error? »
Tue. Jul 25th 09:00 -- 11:30 PM Room Exhibit Hall 1 #123
More from the Same Authors
-
2020 : On the Equivalence of Bi-Level Optimization and Game-Theoretic Formulations of Invariant Risk Minimization »
Kartik Ahuja -
2021 : Understanding Instance-based Interpretability of Variational Auto-Encoders »
· Zhifeng Kong · Kamalika Chaudhuri -
2021 : Privacy Amplification by Bernoulli Sampling »
Jacob Imola · Kamalika Chaudhuri -
2021 : A Shuffling Framework For Local Differential Privacy »
Casey M Meehan · Amrita Roy Chowdhury · Kamalika Chaudhuri · Somesh Jha -
2021 : Privacy Amplification by Subsampling in Time Domain »
Tatsuki Koga · Casey M Meehan · Kamalika Chaudhuri -
2022 : Understanding Rare Spurious Correlations in Neural Networks »
Yao-Yuan Yang · Chi-Ning Chou · Kamalika Chaudhuri -
2023 : Cross-Risk Minimization: Inferring Groups Information for Improved Generalization »
Mohammad Pezeshki · Diane Bouchacourt · Mark Ibrahim · Nicolas Ballas · Pascal Vincent · David Lopez-Paz -
2023 : Identifiability of Discretized Latent Coordinate Systems via Density Landmarks Detection »
Vitória Barin-Pacela · Kartik Ahuja · Simon Lacoste-Julien · Pascal Vincent -
2023 : Machine Learning with Feature Differential Privacy »
Saeed Mahloujifar · Chuan Guo · G. Edward Suh · Kamalika Chaudhuri -
2023 : Identifiability of Discretized Latent Coordinate Systems via Density Landmarks Detection »
Vitória Barin-Pacela · Kartik Ahuja · Simon Lacoste-Julien · Pascal Vincent -
2023 : A Closer Look at In-Context Learning under Distribution Shifts »
Kartik Ahuja · David Lopez-Paz -
2023 : Panel Discussion »
Peter Kairouz · Song Han · Kamalika Chaudhuri · Florian Tramer -
2023 : Kamalika Chaudhuri »
Kamalika Chaudhuri -
2023 Poster: Privacy-Aware Compression for Federated Learning Through Numerical Mechanism Design »
Chuan Guo · Kamalika Chaudhuri · Pierre Stock · Michael Rabbat -
2023 Poster: Model Ratatouille: Recycling Diverse Models for Out-of-Distribution Generalization »
Alexandre Rame · Kartik Ahuja · Jianyu Zhang · Matthieu Cord · Leon Bottou · David Lopez-Paz -
2023 Oral: Interventional Causal Representation Learning »
Kartik Ahuja · Divyat Mahajan · Yixin Wang · Yoshua Bengio -
2023 Poster: Data-Copying in Generative Models: A Formal Framework »
Robi Bhattacharjee · Sanjoy Dasgupta · Kamalika Chaudhuri -
2023 Poster: A Two-Stage Active Learning Algorithm for k-Nearest Neighbors »
Nicholas Rittler · Kamalika Chaudhuri -
2023 Poster: Interventional Causal Representation Learning »
Kartik Ahuja · Divyat Mahajan · Yixin Wang · Yoshua Bengio -
2023 : Identifiability of Discretized Latent Coordinate Systems via Density Landmarks Detection »
Vitória Barin-Pacela · Kartik Ahuja · Simon Lacoste-Julien · Pascal Vincent -
2022 : Invited talks I, Q/A »
Bernhard Schölkopf · David Lopez-Paz -
2022 : Invited Talks 1, Bernhard Schölkopf and David Lopez-Paz »
Bernhard Schölkopf · David Lopez-Paz -
2022 Poster: Rich Feature Construction for the Optimization-Generalization Dilemma »
Jianyu Zhang · David Lopez-Paz · Léon Bottou -
2022 Spotlight: Rich Feature Construction for the Optimization-Generalization Dilemma »
Jianyu Zhang · David Lopez-Paz · Léon Bottou -
2022 Poster: Thompson Sampling for Robust Transfer in Multi-Task Bandits »
Zhi Wang · Chicheng Zhang · Kamalika Chaudhuri -
2022 Spotlight: Thompson Sampling for Robust Transfer in Multi-Task Bandits »
Zhi Wang · Chicheng Zhang · Kamalika Chaudhuri -
2022 Poster: Bounding Training Data Reconstruction in Private (Deep) Learning »
Chuan Guo · Brian Karrer · Kamalika Chaudhuri · Laurens van der Maaten -
2022 Oral: Bounding Training Data Reconstruction in Private (Deep) Learning »
Chuan Guo · Brian Karrer · Kamalika Chaudhuri · Laurens van der Maaten -
2021 : Discussion Panel #2 »
Bo Li · Nicholas Carlini · Andrzej Banburski · Kamalika Chaudhuri · Will Xiao · Cihang Xie -
2021 : Invited Talk #9 »
Kamalika Chaudhuri -
2021 : Invited Talk: Kamalika Chaudhuri »
Kamalika Chaudhuri -
2021 : Invited Talk: Kamalika Chaudhuri »
Kamalika Chaudhuri -
2021 : Live Panel Discussion »
Thomas Dietterich · Chelsea Finn · Kamalika Chaudhuri · Yarin Gal · Uri Shalit -
2021 Poster: Sample Complexity of Robust Linear Classification on Separated Data »
Robi Bhattacharjee · Somesh Jha · Kamalika Chaudhuri -
2021 Poster: Can Subnetwork Structure Be the Key to Out-of-Distribution Generalization? »
Dinghuai Zhang · Kartik Ahuja · Yilun Xu · Yisen Wang · Aaron Courville -
2021 Oral: Can Subnetwork Structure Be the Key to Out-of-Distribution Generalization? »
Dinghuai Zhang · Kartik Ahuja · Yilun Xu · Yisen Wang · Aaron Courville -
2021 Spotlight: Sample Complexity of Robust Linear Classification on Separated Data »
Robi Bhattacharjee · Somesh Jha · Kamalika Chaudhuri -
2021 Poster: Connecting Interpretability and Robustness in Decision Trees through Separation »
Michal Moshkovitz · Yao-Yuan Yang · Kamalika Chaudhuri -
2021 Spotlight: Connecting Interpretability and Robustness in Decision Trees through Separation »
Michal Moshkovitz · Yao-Yuan Yang · Kamalika Chaudhuri -
2020 Workshop: Workshop on Continual Learning »
Haytham Fayek · Arslan Chaudhry · David Lopez-Paz · Eugene Belilovsky · Jonathan Richard Schwarz · Marc Pickett · Rahaf Aljundi · Sayna Ebrahimi · Razvan Pascanu · Puneet Dokania -
2020 Poster: When are Non-Parametric Methods Robust? »
Robi Bhattacharjee · Kamalika Chaudhuri -
2020 Poster: Invariant Risk Minimization Games »
Kartik Ahuja · Karthikeyan Shanmugam · Kush Varshney · Amit Dhurandhar -
2019 Poster: Manifold Mixup: Better Representations by Interpolating Hidden States »
Vikas Verma · Alex Lamb · Christopher Beckham · Amir Najafi · Ioannis Mitliagkas · David Lopez-Paz · Yoshua Bengio -
2019 Poster: First-Order Adversarial Vulnerability of Neural Networks and Input Dimension »
Carl-Johann Simon-Gabriel · Yann Ollivier · Leon Bottou · Bernhard Schölkopf · David Lopez-Paz -
2019 Oral: Manifold Mixup: Better Representations by Interpolating Hidden States »
Vikas Verma · Alex Lamb · Christopher Beckham · Amir Najafi · Ioannis Mitliagkas · David Lopez-Paz · Yoshua Bengio -
2019 Oral: First-Order Adversarial Vulnerability of Neural Networks and Input Dimension »
Carl-Johann Simon-Gabriel · Yann Ollivier · Leon Bottou · Bernhard Schölkopf · David Lopez-Paz -
2019 Talk: Opening Remarks »
Kamalika Chaudhuri · Ruslan Salakhutdinov -
2018 Poster: Active Learning with Logged Data »
Songbai Yan · Kamalika Chaudhuri · Tara Javidi -
2018 Poster: Analyzing the Robustness of Nearest Neighbors to Adversarial Examples »
Yizhen Wang · Somesh Jha · Kamalika Chaudhuri -
2018 Oral: Active Learning with Logged Data »
Songbai Yan · Kamalika Chaudhuri · Tara Javidi -
2018 Oral: Analyzing the Robustness of Nearest Neighbors to Adversarial Examples »
Yizhen Wang · Somesh Jha · Kamalika Chaudhuri -
2018 Poster: Optimizing the Latent Space of Generative Networks »
Piotr Bojanowski · Armand Joulin · David Lopez-Paz · Arthur Szlam -
2018 Oral: Optimizing the Latent Space of Generative Networks »
Piotr Bojanowski · Armand Joulin · David Lopez-Paz · Arthur Szlam -
2017 Workshop: Picky Learners: Choosing Alternative Ways to Process Data. »
Corinna Cortes · Kamalika Chaudhuri · Giulia DeSalvo · Ningshan Zhang · Chicheng Zhang -
2017 Poster: Active Heteroscedastic Regression »
Kamalika Chaudhuri · Prateek Jain · Nagarajan Natarajan -
2017 Poster: Wasserstein Generative Adversarial Networks »
Martin Arjovsky · Soumith Chintala · Léon Bottou -
2017 Talk: Active Heteroscedastic Regression »
Kamalika Chaudhuri · Prateek Jain · Nagarajan Natarajan -
2017 Talk: Wasserstein Generative Adversarial Networks »
Martin Arjovsky · Soumith Chintala · Léon Bottou