A Theory of Data Acquisition and Pricing at Scale
Abstract
Data plays an invaluable role in large-scale ML training pipelines. Multiple factors, including the need to incentivize the creation of high-quality data and efforts to compensate creative data work, have led to increased interest in data {\em pricing}. Data pricing mechanisms seek to establish a market where data providers are compensated based (in part) on the value of their data to the data buyer, e.g., frontier AI labs. However, assessing the exact value that each provider's data adds to the data buyer's objective requires repeated re-training, which is infeasible in practice. Our work studies {\em data pricing under compute constraints}. In our setting, data buyers cannot make data acquisition decisions optimally due to limited compute. Inspired by existing practice in the field of data selection, we propose a model for this problem called ``pricing with an attribution oracle,'' and provide a theoretical and empirical analysis of compute-efficient acquisition and pricing.