Timezone: »

A Standardized Data Collection Toolkit for Model Benchmarking
Avanika Narayan · Piero Molino · Karan Goel · Christopher Re

Data is central to the machine learning (ML) pipeline. While most existing works in the literature focus on challenges regarding the data used as inputs for model training, this work places emphasis on the data generated during model training and evaluation. Useful for robust evaluation and model benchmarking, we refer to this type of data as “benchmarking metadata”. As ML has become ubiquitous across domains and deployment settings, there is interest amongst various communities (e.g. industry practitioners) to benchmark models across tasks and objectives of personal value. However, this personalized benchmarking necessitates a framework that enables multi-objective evaluation (by collecting benchmarking metadata like performance metrics and training statistics) and ensures fair model comparisons (by controlling for confounding variables). To address these needs, we introduce the open-source Ludwig Benchmarking Toolkit (LBT), a system that enables the standardized and personalized collection of benchmarking metadata, with automated methods to remove confounding factors. We demonstrate how LBT can be used to create personalized benchmark studies with a large-scale comparative analysis for text classification across 7 models and 9 datasets. Using the benchmarking metadata generated by LBT, we explore trade-offs between inference latency and performance, relationships between dataset attributes and performance, and the effects of pretraining on convergence and robustness.

Author Information

Avanika Narayan (Stanford University)
Piero Molino (Uber AI)
Karan Goel (Stanford)
Christopher Re (Stanford)

More from the Same Authors