AC-ODM: Actor–Critic Online Data Mixing for Sample-Efficient LLM Pretraining
Jing Ma ⋅ Chenhao Dang ⋅ Mingjie Liao
Abstract
Optimizing pretraining data composition is pivotal for LLM generalization. While dynamic mixing outperforms static strategies by capturing evolving training dynamics, current methods fail to reconcile computational efficiency with sample efficiency and structural flexibility for diverse pipelines.We introduce \textbf{Actor--Critic Online Data Mixing (AC-ODM)}, which approaches data mixing from a reinforcement learning perspective with a parameterized policy that we theoretically prove to act as a dynamic linear surrogate maximizing the constructive interference of gradients. To enhance practical flexibility, AC-ODM supports two operational modes: (i) a \textbf{proxy mode} for fixed, pre-prepared corpora, where a policy learned on a small model is transferred to a larger target; and (ii) a \textbf{non-proxy mode} for direct end-to-end training from scratch without priors. Empirically, AC-ODM significantly outperforms prior methods in convergence speed and downstream accuracy across various architectures. On Pythia-1B, it reaches optimal validation perplexity using up to 66\% fewer training steps than competitive baselines, delivering a 27.5\% relative improvement in MMLU accuracy and a 2.23$\times$ higher pass@1 on HumanEval, all while incurring a virtually negligible ($~$0.4\%) per-step wall-clock increase and only 2\% additional memory overhead.
Successful Page Load