Skip to yearly menu bar Skip to main content


Spotlight

Datamodels: Understanding Predictions with Data and Data with Predictions

Andrew Ilyas · Sung Min (Sam) Park · Logan Engstrom · Guillaume Leclerc · Aleksander Madry

Room 309

Abstract: We present a conceptual framework, \emph{datamodeling}, for analyzing the behavior of a model class in terms of the training data. For any fixed ``target'' example $x$, training set $S$, and learning algorithm, a {\em datamodel} is a parameterized function $2^S \to \mathbb{R}$ that for any subset of $S' \subset S$---using only information about which examples of $S$ are contained in $S'$---predicts the outcome of training a model on $S'$ and evaluating on $x$. Despite the complexity of the underlying process being approximated (e.g. end-to-end training and evaluation of deep neural networks), we show that even simple {\em linear} datamodels successfully predict model outputs. We then demonstrate that datamodels give rise to a variety of applications, such as:accurately predicting the effect of dataset counterfactuals; identifying brittle predictions; finding semantically similar examples; quantifying train-test leakage; and embedding data into a well-behaved and feature-rich representation space.

Chat is not available.