Tailoring the Training: Difficulty-Aware Learning Strategy Allocation for Large Language Models
Abstract
Although reinforcement learning (RL) enhances the reasoning capabilities of large language models (LLMs), it is primarily learned from the model's self-generated distribution, limiting its ability to acquire reasoning skills beyond its initial knowledge. To overcome this, we propose a Difficulty-Aware Learning Strategy Allocation (DALSA) framework, which adaptively assigns appropriate learning strategies to samples based on their difficulty signals. DALSA is built on the key insight that samples beyond models' knowledge scope are better addressed through supervised fine-tuning (SFT), while those within the boundary but insufficiently mastered benefit more from RL, and well-learned samples are discarded to avoid redundant updates. To realize this principle, we extract a series of difficulty-aware training characteristics and employ a learnable strategy allocator to dynamically determine the optimal learning strategy for each sample based on its training dynamics. The allocator and the LLM are alternately optimized, enabling adaptive strategy allocation. Furthermore, two regularization techniques, anti-curriculum weighting and adversarial label smoothing, are integrated to alleviate the inherent limitations of RL and SFT, backed by comprehensive theoretical analyses. Extensive experiments on ten LLMs ranging from 1.5B to 70B across various tasks indicate that DALSA consistently outperforms baselines under both full and parameter-efficient fine-tuning settings.