Beyond Sunk Costs: Boosting LLM Pre-training Efficiency via Orthogonal Growth of Mixture-of-Experts
Abstract
As the computational demands for pre-training Large Language Models (LLMs) continue to surge, the need for efficient training paradigms becomes critical. Despite the vast resources already invested in existing pre-trained checkpoints, these assets often remain under-leveraged due to architectural limitations. We introduce an "orthogonal growth" strategy designed to "recycle" these checkpoints by strategically expanding their parameters prior to continued training. Our method focuses on optimizing converged Mixture-of-Experts (MoE) models through two dimensions: interpositional layer copying for increased depth and noisy expert duplication for expanded width. Through extensive scaling laws analysis, we demonstrate a strong positive correlation between the "sunk cost" (prior investment) and the final model accuracy. Empirical results on models up to 70B parameters and 1T tokens show that our recycling approach yields a 10.6\% accuracy improvement compared to training from scratch under identical extra compute budgets. This work provides a cost-effective blueprint for sustainable large-scale LLM development.