arXiv:2202.00622 Abstract | arXiv Analytics

arXiv:2202.00622 [stat.ML]Abstract References Reviews Resources

Datamodels: Predicting Predictions from Training Data

Andrew Ilyas, Sung Min Park, Logan Engstrom, Guillaume Leclerc, Aleksander Madry

Published 2022-02-01Version 1

We present a conceptual framework, datamodeling, for analyzing the behavior of a model class in terms of the training data. For any fixed "target" example $x$, training set $S$, and learning algorithm, a datamodel is a parameterized function $2^S \to \mathbb{R}$ that for any subset of $S' \subset S$ -- using only information about which examples of $S$ are contained in $S'$ -- predicts the outcome of training a model on $S'$ and evaluating on $x$. Despite the potential complexity of the underlying process being approximated (e.g., end-to-end training and evaluation of deep neural networks), we show that even simple linear datamodels can successfully predict model outputs. We then demonstrate that datamodels give rise to a variety of applications, such as: accurately predicting the effect of dataset counterfactuals; identifying brittle predictions; finding semantically similar examples; quantifying train-test leakage; and embedding data into a well-behaved and feature-rich representation space. Data for this paper (including pre-computed datamodels as well as raw predictions from four million trained deep neural networks) is available at https://github.com/MadryLab/datamodels-data .

Categories: stat.ML, cs.CV, cs.LG

Keywords: training data, predicting predictions, million trained deep neural networks, simple linear datamodels, feature-rich representation space

Related articles: Most relevant | Search more

arXiv:2005.07939 [stat.ML] (Published 2020-05-16)

Predicting into unknown space? Estimating the area of applicability of spatial prediction models

Hanna Meyer, Edzer Pebesma

arXiv:1901.08552 [stat.ML] (Published 2019-01-24)

General Supervision via Probabilistic Transformations

Santiago Mazuelas, Aritz Perez

arXiv:2006.09796 [stat.ML] (Published 2020-06-17)

Kernel Alignment Risk Estimator: Risk Prediction from Training Data

Arthur Jacot, Berfin Şimşek, Francesco Spadaro, Clément Hongler, Franck Gabriel