Pretraining Numerical Frequency and Number-Line in Language Models
Mohammed Ibrahim Awad ⋅ Ahmed Elshehaby ⋅ Hilal AlQuabeh ⋅ Alejandro Solozabal
Abstract
Large language models exhibit compressed, non-uniform internal representations of numerical magnitude, but the pretraining factors associated with this geometry remain unclear. We study whether corpus-level integer statistics are related to the learned number-line geometry of pretrained language models. For four documented pretraining corpora, we count integers in $[0,10{,}000]$ and fit a magnitude-frequency power law, $\mathrm{count}(N) \propto N^{\alpha}$, where more negative $\alpha$ indicates steeper decay and less exposure to large magnitudes. For nine corresponding base models, we extract hidden states for numerical prompts, project them onto a one-dimensional number line with PCA, and estimate a scaling factor $\beta$, where smaller $\beta$ indicates stronger compression. We first show that $\beta$ is behaviorally meaningful: models with less compressed number-line geometry achieve higher likelihood-based number-comparison accuracy. We then find that flatter integer-frequency distributions, corresponding to less negative $\alpha$, are associated with larger $\beta$. These results provide correlational evidence that pretraining integer statistics are reflected in the geometry of LLM number representations.
Successful Page Load