Natural Language Actor–Critic Is Bilevel: Learning to Reason with Textual Feedback
Abstract
Reinforcement learning with verifiable rewards can improve LLM reasoning, but learning is sample-inefficient under sparse terminal rewards. Prior work mitigates this by adding natural language critiques, yet it typically treats critique generation as fixed or auxiliary, so correct-sounding feedback may not translate into higher verified reward. We argue that natural language actor-critic for reasoning is inherently bilevel: the usefulness of the critique is defined by its downstream effect on the actor after adaptation. We formalize this coupling as a Stackelberg bilevel program and derive Bilevel Natural Language Actor-Critic (Bi-NAC), which jointly trains a critic to generate reward-improving feedback and an actor to exploit it. Across reasoning benchmarks, Bi-NAC improves sample and parameter efficiency over RL baselines and fixed-critic feedback methods. We perform experiments on MATH-500, MBPP, and GPQA demonstrating that Bi-NAC significantly enhances parameter and sample efficiency, enabling smaller models to outperform larger baselines. Specifically, our 2B model consistently outperforms the larger 3B GRPO baseline across all tasks (e.g., 46.6% vs. 41.4% on MATH-500), while our 6B model surpasses the 7B GRPO baseline (e.g., 49.3% vs. 43.6% on GPQA). These results show that aligning actor and critic via bilevel formulation provides a robust and efficient alternative for solving complex reasoning tasks.