Timezone: »
Methods for carefully selecting or generating a small set of training data to learn from, i.e., data pruning, coreset selection, and data distillation, have been shown to be effective in reducing the ever-increasing cost of training neural networks. Behind this success are rigorously designed strategies for identifying informative training examples out of large datasets. However, these strategies come with additional computational costs associated with subset selection or data distillation before training begins, and furthermore, many are shown to under-perform random sampling in high data compression regimes. As such, many data pruning, coreset selection, or distillation methods may not reduce 'time-to-accuracy', which has become a critical efficiency measure of training deep neural networks over large datasets. In this work, we revisit a powerful yet overlooked random sampling strategy to address these challenges and introduce an approach called Repeated Sampling of Random Subsets (RSRS or RS2), where we randomly sample the subset of training data for each epoch of model training. We test RS2 against thirty state-of-the-art data pruning and data distillation methods across four datasets including ImageNet. Our results demonstrate that RS2 significantly reduces time-to-accuracy compared to existing techniques. For example, when training on ImageNet in the high-compression regime (less than 10% of the data each epoch), RS2 yields accuracy improvements up to 29% compared to competing pruning methods while offering a runtime reduction of 7x. Beyond the above meta-study, we provide a convergence analysis for RS2 and discuss its generalization capability. The primary goal of our work is to establish RS2 as a competitive baseline for future data selection or distillation techniques aimed at efficient training.
Author Information
Patrik Okanovic (ETH Zurich)
Roger Waleffe (University of Wisconsin-Madison)
Vasileios Mageirakos (ETH Zurich)
Konstantinos Nikolakakis (Yale University)
Amin Karbasi (Yale & Google)

Amin Karbasi is currently an assistant professor of Electrical Engineering, Computer Science, and Statistics at Yale University. He has been the recipient of the National Science Foundation (NSF) Career Award 2019, Office of Naval Research (ONR) Young Investigator Award 2019, Air Force Office of Scientific Research (AFOSR) Young Investigator Award 2018, DARPA Young Faculty Award 2016, National Academy of Engineering Grainger Award 2017, Amazon Research Award 2018, Google Faculty Research Award 2016, Microsoft Azure Research Award 2016, Simons Research Fellowship 2017, and ETH Research Fellowship 2013. His work has also been recognized with a number of paper awards, including Medical Image Computing and Computer Assisted Interventions Conference (MICCAI) 2017, International Conference on Artificial Intelligence and Statistics (AISTAT) 2015, IEEE ComSoc Data Storage 2013, International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 2011, ACM SIGMETRICS 2010, and IEEE International Symposium on Information Theory (ISIT) 2010 (runner-up). His Ph.D. thesis received the Patrick Denantes Memorial Prize 2013 from the School of Computer and Communication Sciences at EPFL, Switzerland.
Dionysios Kalogerias (Yale University)

I am an assistant professor with the Department of Electrical Engineering (EE) at Yale. My research is in machine learning, reinforcement learning, optimization, signal processing, sequential decision making, and risk. Before joining Yale, I spent one year as an assistant professor at the Department of Electrical and Computer Engineering (ECE), Michigan State University. Prior to that, I was a postdoctoral researcher with the Department of Electrical and Systems Engineering, University of Pennsylvania, and before that I was a postdoctoral research associate with the Department of Operations Research and Financial Engineering (ORFE), Princeton University. I received the PhD degree in ECE from Rutgers University.
Nezihe Merve Gürel (TU Delft)
Theodoros Rekatsinas (ETH Zurich)
More from the Same Authors
-
2023 : Reward-Based Reinforcement Learning with Risk Constraints »
Jane Lee · Konstantinos Nikolakakis · Dionysios Kalogerias · Amin Karbasi -
2023 Workshop: DMLR Workshop: Data-centric Machine Learning Research »
Ce Zhang · Praveen Paritosh · Newsha Ardalani · Nezihe Merve Gürel · William Gaviria Rojas · Yang Liu · Rotem Dror · Manil Maskey · Lilith Bat-Leah · Tzu-Sheng Kuo · Luis Oala · Max Bartolo · Ludwig Schmidt · Alicia Parrish · Daniel Kondermann · Najoung Kim -
2023 Workshop: Knowledge and Logical Reasoning in the Era of Data-driven Learning »
Nezihe Merve Gürel · Bo Li · Theodoros Rekatsinas · Beliz Gunel · Alberto Sngiovanni Vincentelli · Paroma Varma -
2023 : Opening Remarks »
Nezihe Merve Gürel -
2022 : Principal Component Networks: Parameter Reduction Early in Training »
Roger Waleffe · Theodoros Rekatsinas -
2022 Poster: Scalable MCMC Sampling for Nonsymmetric Determinantal Point Processes »
Insu Han · Mike Gartrell · Elvis Dohmatob · Amin Karbasi -
2022 Oral: Scalable MCMC Sampling for Nonsymmetric Determinantal Point Processes »
Insu Han · Mike Gartrell · Elvis Dohmatob · Amin Karbasi -
2021 Workshop: Over-parameterization: Pitfalls and Opportunities »
Yasaman Bahri · Quanquan Gu · Amin Karbasi · Hanie Sedghi -
2021 : Greedy and Its Friends »
Amin Karbasi -
2021 Poster: Knowledge Enhanced Machine Learning Pipeline against Diverse Adversarial Attacks »
Nezihe Merve Gürel · Xiangyu Qi · Luka Rimanic · Ce Zhang · Bo Li -
2021 Spotlight: Knowledge Enhanced Machine Learning Pipeline against Diverse Adversarial Attacks »
Nezihe Merve Gürel · Xiangyu Qi · Luka Rimanic · Ce Zhang · Bo Li -
2021 Poster: Regularized Submodular Maximization at Scale »
Ehsan Kazemi · shervin minaee · Moran Feldman · Amin Karbasi -
2021 Spotlight: Regularized Submodular Maximization at Scale »
Ehsan Kazemi · shervin minaee · Moran Feldman · Amin Karbasi -
2020 : Mode Finding for SLC Distributions via Regularized Submodular Maximization »
Ehsan Kazemi · Amin Karbasi · Moran Feldman -
2020 Poster: More Data Can Expand The Generalization Gap Between Adversarially Robust and Standard Models »
Lin Chen · Yifei Min · Mingrui Zhang · Amin Karbasi -
2020 Poster: Streaming Submodular Maximization under a k-Set System Constraint »
Ran Haba · Ehsan Kazemi · Moran Feldman · Amin Karbasi -
2020 Tutorial: Submodular Optimization: From Discrete to Continuous and Back »
Hamed Hassani · Amin Karbasi -
2019 Poster: Submodular Maximization beyond Non-negativity: Guarantees, Fast Algorithms, and Applications »
Christopher Harshaw · Moran Feldman · Justin Ward · Amin Karbasi -
2019 Oral: Submodular Maximization beyond Non-negativity: Guarantees, Fast Algorithms, and Applications »
Christopher Harshaw · Moran Feldman · Justin Ward · Amin Karbasi -
2019 Poster: Submodular Streaming in All Its Glory: Tight Approximation, Minimum Memory and Low Adaptive Complexity »
Ehsan Kazemi · Marko Mitrovic · Morteza Zadimoghaddam · Silvio Lattanzi · Amin Karbasi -
2019 Oral: Submodular Streaming in All Its Glory: Tight Approximation, Minimum Memory and Low Adaptive Complexity »
Ehsan Kazemi · Marko Mitrovic · Morteza Zadimoghaddam · Silvio Lattanzi · Amin Karbasi -
2018 Poster: Decentralized Submodular Maximization: Bridging Discrete and Continuous Settings »
Aryan Mokhtari · Hamed Hassani · Amin Karbasi -
2018 Poster: Projection-Free Online Optimization with Stochastic Gradient: From Convexity to Submodularity »
Lin Chen · Christopher Harshaw · Hamed Hassani · Amin Karbasi -
2018 Oral: Projection-Free Online Optimization with Stochastic Gradient: From Convexity to Submodularity »
Lin Chen · Christopher Harshaw · Hamed Hassani · Amin Karbasi -
2018 Oral: Decentralized Submodular Maximization: Bridging Discrete and Continuous Settings »
Aryan Mokhtari · Hamed Hassani · Amin Karbasi -
2018 Poster: Scalable Deletion-Robust Submodular Maximization: Data Summarization with Privacy and Fairness Constraints »
Ehsan Kazemi · Morteza Zadimoghaddam · Amin Karbasi -
2018 Poster: Weakly Submodular Maximization Beyond Cardinality Constraints: Does Randomization Help Greedy? »
Lin Chen · Moran Feldman · Amin Karbasi -
2018 Poster: Data Summarization at Scale: A Two-Stage Submodular Approach »
Marko Mitrovic · Ehsan Kazemi · Morteza Zadimoghaddam · Amin Karbasi -
2018 Oral: Data Summarization at Scale: A Two-Stage Submodular Approach »
Marko Mitrovic · Ehsan Kazemi · Morteza Zadimoghaddam · Amin Karbasi -
2018 Oral: Scalable Deletion-Robust Submodular Maximization: Data Summarization with Privacy and Fairness Constraints »
Ehsan Kazemi · Morteza Zadimoghaddam · Amin Karbasi -
2018 Oral: Weakly Submodular Maximization Beyond Cardinality Constraints: Does Randomization Help Greedy? »
Lin Chen · Moran Feldman · Amin Karbasi -
2017 Poster: Differentially Private Submodular Maximization: Data Summarization in Disguise »
Marko Mitrovic · Mark Bun · Andreas Krause · Amin Karbasi -
2017 Poster: Deletion-Robust Submodular Maximization: Data Summarization with "the Right to be Forgotten" »
Baharan Mirzasoleiman · Amin Karbasi · Andreas Krause -
2017 Poster: Probabilistic Submodular Maximization in Sub-Linear Time »
Serban A Stan · Morteza Zadimoghaddam · Andreas Krause · Amin Karbasi -
2017 Talk: Deletion-Robust Submodular Maximization: Data Summarization with "the Right to be Forgotten" »
Baharan Mirzasoleiman · Amin Karbasi · Andreas Krause -
2017 Talk: Probabilistic Submodular Maximization in Sub-Linear Time »
Serban A Stan · Morteza Zadimoghaddam · Andreas Krause · Amin Karbasi -
2017 Talk: Differentially Private Submodular Maximization: Data Summarization in Disguise »
Marko Mitrovic · Mark Bun · Andreas Krause · Amin Karbasi