Tutorial

Foundations of Data-efficient Machine Learning

Siddharth Joshi · Baharan Mirzasoleiman

2024 Tutorial

Abstract

Over the last decade, machine learning models have achieved remarkable success by learning from large amounts of data. This is best exemplified by the recent rise of foundation models that are trained on billions of examples. Training on massive data is, however, dependent on exceptionally large and expensive computational resources, and incurs substantial financial and environmental costs, due to the significant energy consumption. To reduce these costs, there has been a recent surge of interest in data-efficient learning techniques to train machine learning models on smaller subsets of carefully-chosen training examples. The field is, however, filled with many heuristics that seem contradictory at times, and is increasingly difficult and diverse to grasp for a non-informed audience. The goal of this tutorial will be to provide a unifying perspective, by discussing recent theoretically-rigorous approaches for data-efficient machine learning. We will discuss rigorous techniques for data-efficient supervised learning, and self-supervised contrastive pre-training. Then, we will focus on foundation models and discuss data selection for (pre-)training large vision-language models, such as CLIP. We will conclude by discussing challenge and providing guidelines for data-efficient training of large language models (LLMs).

Video

Chat is not available.