## Datamodels: Understanding Predictions with Data and Data with Predictions

### Andrew Ilyas · Sung Min (Sam) Park · Logan Engstrom · Guillaume Leclerc · Aleksander Madry

##### Hall E #526

Keywords: [ MISC: General Machine Learning Techniques ] [ DL: Robustness ] [ DL: Everything Else ]

[ Abstract ]
[ [
Wed 20 Jul 3:30 p.m. PDT — 5:30 p.m. PDT

Spotlight presentation: Deep Learning/Optimization
Wed 20 Jul 1:30 p.m. PDT — 3 p.m. PDT

Abstract: We present a conceptual framework, \emph{datamodeling}, for analyzing the behavior of a model class in terms of the training data. For any fixed target'' example $x$, training set $S$, and learning algorithm, a {\em datamodel} is a parameterized function $2^S \to \mathbb{R}$ that for any subset of $S' \subset S$---using only information about which examples of $S$ are contained in $S'$---predicts the outcome of training a model on $S'$ and evaluating on $x$. Despite the complexity of the underlying process being approximated (e.g. end-to-end training and evaluation of deep neural networks), we show that even simple {\em linear} datamodels successfully predict model outputs. We then demonstrate that datamodels give rise to a variety of applications, such as:accurately predicting the effect of dataset counterfactuals; identifying brittle predictions; finding semantically similar examples; quantifying train-test leakage; and embedding data into a well-behaved and feature-rich representation space.

Chat is not available.