Inner-layer Token Self-Modulation as Another Scaling Axis for LLMs
Abstract
LLMs have traditionally scaled along dense dimensions, where performance is coupled with near-linear increases in computational cost. While MoE decouples capacity from compute, it introduces large memory overhead and hardware efficiency challenges. To overcome these, we propose token-indexed parameters as a novel, orthogonal scaling axis that decouple model capacity from FLOPs. Specifically, we introduce ReToken and MoRT, which augment Transformer layers with modulation vectors retrieved from auxiliary embedding tables. These vectors modulate the backbone via lightweight, element-wise operations, incurring negligible FLOPs overhead. Extensive experiments on both dense and MoE backbones, spanning from 190M to 9.8B parameters, demonstrate that our approach consistently reduces validation loss and significantly improves downstream task performance (e.g., +7.3 on ARC-C, +6.3 on GSM8K). Rigorous isoFLOPs analysis further confirms that MoRT fundamentally shifts the quality–compute Pareto frontier, achieving comparable model quality with 35\% less compute relative to vanilla MoE architectures, and we validate that token-indexed parameters exhibit a predictable power-law scaling behavior. Moreover, our efficient implementation ensures that the overhead introduced by ReToken and MoRT remains marginal.