Poster
in
Workshop: Data-centric Machine Learning Research (DMLR): Datasets for Foundation Models

Why Has Predicting Downstream Capabilities of Frontier AI Models with Scale Remained Elusive?

Rylan Schaeffer · Hailey Schoelkopf · Brando Miranda · Gabriel Mukobi · Varun Madan · Adam Ibrahim · Herbie Bradley · Stella Biderman · Sanmi Koyejo

Abstract

Predictable behavior from scaling advanced AI systems is an extremely desirable property. While a well-established literature exists on how pretraining performance scales, the literature on how particular downstream capabilities change with scale is significantly muddier. For instance, previous papers debated the origins of emergent abilities, and more recent work claimed that specific downstream capabilities become predictable only beyond a specific pretraining loss or if aggregated across dozens of benchmarks. In this work, we take a step back and ask: \textit{what makes predicting specific downstream capabilities with scale difficult?} We identify a critical factor contributing to this difficulty on multiple-choice benchmarks. Using five model families and twelve widely-used benchmarks, we show that downstream performance is computed from negative log likelihoods via a sequence of transformations that progressively deteriorates the statistical relationship between performance and scale. We demonstrate that this deterioration is caused by metrics that require comparing the correct answer against a small number of specific incorrect answers, meaning that accurately predicting downstream capabilities requires predicting not just how probability mass concentrates on the correct behavior with scale, but also how probability mass changes on specific incorrect behaviors with scale. We empirically study how probability mass on the correct choice covaries with probability mass on incorrect choices with increasing compute, suggesting that scaling laws for \textit{incorrect} choices might be achievable. Our work explains why pretraining scaling laws are regarded as more predictable and contribute towards establishing scaling-predictable evaluations of frontier AI models.

Chat is not available.