On the origin of neural scaling laws: from random graphs to natural language
Abstract
Scaling laws have played a major role in modern AI, providing predictive power over how model performance will improve with increasing resources. This has spurred intense interest in their origin, with a common suggestion being that they arise from power laws already present in the data. Here we study scaling laws for transformers trained to predict random walks on graphs with tunable complexity. We show that this simplified setting already yields scaling laws even in the absence of power laws in the data correlations. We further consider dialing down the complexity of language by training on sequences sampled from increasingly simplified generative language models, from 4,2,1-layer transformer language models down to language bigrams, revealing a monotonic evolution of the scaling exponents. Our results also include scaling laws obtained from training on random walks on random graphs drawn from Erdös-Renyi and scale-free Barabási-Albert ensembles. Finally, we revisit scaling laws for language modeling, demonstrating that several essential results can be reproduced using 2 layer transformers with context length of 100, demonstrate an alternative method for obtaining compute optimal curves, and provide preliminary evidence that maximal update parameterization may be more parameter efficient than standard parameterization.