D-ARL: A Distribution-Matched Asynchronous Reinforcement Learning Framework for Language Reasoning
白 寅岐 ⋅ Tong Xialiang ⋅ Jie Wang ⋅ Hongyu Liu ⋅ Longdi Pan ⋅ Jiashuo Li ⋅ Zehao Wang ⋅ Jianye Hao ⋅ Mingxuan Yuan ⋅ Feng Wu
Abstract
Asynchronous reinforcement learning (RL) has shown notable success in accelerating the post-training of large language models (LLMs). However, its decoupled data generation and training paradigm introduces a fundamental distributional mismatch between data generated by stale behavior policies and current policy, leading to unstable training and degraded performance. To address this challenge, we propose D-ARL, a **D**istribution-matched **A**synchronous **R**einforcement **L**earning framework that selects high-quality asynchronous samples whose distributions are well aligned with the current policy for policy optimization. Specifically, D-ARL maintains a replay buffer that collects samples from the most recent $K$ behavior policies and proposes a variance-guided metric to select distribution-matched data. During training, D-ARL introduces a multi-behavior policy optimization algorithm to leverage the multi-source nature of the selected samples for policy update. Experiments on six widely used reasoning benchmarks show that D-ARL outperforms state-of-the-art asynchronous methods, achieving an average improvement of 6.4\% in reasoning performance and 34.7\% in sample efficiency.
Successful Page Load