Escaping the Verifier: Learning to Reason via Demonstrations
Locke Cai ⋅ Max Ryabinin ⋅ Ivan Provilkov
Abstract
Training Large Language Models (LLMs) to reason often relies on Reinforcement Learning (RL) with task-specific verifiers. However, many real-world reasoning-intensive tasks lack verifiers, despite offering abundant expert demonstrations that remain underutilized for reasoning-focused training. We introduce **RARO** (Relativistic Adversarial Reasoning Optimization), which learns strong reasoning capabilities from expert demonstrations alone via **Inverse Reinforcement Learning**. Our method sets up an adversarial game between a **policy** and a **relativistic critic**: the policy learns to mimic expert answers, while the critic aims to identify the expert among (expert, policy) answer pairs. Both the policy and the critic are trained jointly and continuously via RL, and we identify key stabilization techniques required for robust learning. Empirically, RARO significantly outperforms strong verifier-free baselines across all evaluation tasks—achieving $+13.7$\% accuracy on Countdown ($1.5\text{B}$), $+8.2$\% on DeepMath ($7\text{B}$), and a $+19.1$\% win-rate on Poetry Writing ($7\text{B}$) against expert poems. RARO also exhibits the same robust scaling trends as RL with verifiers. These results demonstrate that our method effectively elicits strong reasoning performance from expert demonstrations alone, enabling robust reasoning learning even when task-specific verifiers are unavailable.
Successful Page Load