Native Parallel Reasoner: Reasoning in Parallelism via Self-Distilled Reinforcement Learning
Tong Wu ⋅ Michael Liu ⋅ Jun Bai ⋅ Zixia Jia ⋅ Shuyi Zhang ⋅ Ziyong Lin ⋅ Yanting Wang ⋅ Song-Chun Zhu ⋅ Zilong Zheng
Abstract
We introduce **Native Parallel Reasoner (NPR)**, a teacher-free framework that enables Large Language Models (LLMs) to self-evolve genuine parallel reasoning capabilities. NPR transforms the model from sequential emulation to native parallel cognition through three key innovations: 1) a **self-distilled** progressive training paradigm that transitions from ``cold-start'' format discovery to strict topological constraints without external supervision; 2) a novel **Parallel-Aware Policy Optimization (PAPO)** algorithm that optimizes branching policies directly within the execution graph, allowing the model to learn adaptive decomposition via trial and error; and 3) a robust **NPR Engine** that refactors memory management and flow control of SGLang to enable stable, large-scale parallel RL training. Across eight reasoning benchmarks, NPR trained on Qwen3-4B achieves performance gains of up to 24.5\% and inference speedups up to 4.6$\times$. Unlike prior baselines that often fall back to autoregressive decoding, NPR demonstrates 100\% genuine parallel execution, establishing a new standard for self-evolving, efficient, and scalable agentic reasoning.
Successful Page Load