Provable Sample Efficiency of Curriculum Post-Training for Transformer Reasoning
Abstract
Recent curriculum techniques in the post-training stage of LLMs have been empirically observed to outperform non-curriculum approaches in improving reasoning performance, yet a principled understanding of their effectiveness and limitations remains incomplete. To bridge this gap, we develop an abstract theoretical framework and identify sufficient conditions under which curriculum post-training yields exponential improvements in sample complexity. To substantiate this framework, we model the base model’s Chain-of-Thought generation as a state-conditioned autoregressive reasoning tree, and formalize curriculum subtasks as either depth-increasing curricula that progressively extend reasoning horizons or hint-decreasing curricula that gradually remove partial hints. Our analysis shows that, under outcome-only reward signals, reinforcement learning finetuning with both curriculum strategies achieves high accuracy with polynomial sample complexity, whereas non-curriculum counterpart encounters an exponential complexity bottleneck. We further establish analogous guarantees for test-time scaling, demonstrating that curriculum-aware querying strategies reduce both reward oracle complexity and sampling cost from exponential to polynomial order. Empirical simulations support our theoretical findings.