On Learning to Think with Action Process Reward Models
Abstract
Large language models (LLMs) and reinforcement learning (RL) with verifiable task completion show exciting promise for tool calling tasks. However, extending this paradigm to multi-turn, long-horizon workflows results in sparse rewards and credit-assignment problems. Meanwhile, providing more frequent learning signal—either via step-level process feedback or behavioral cloning / supervised finetuning (SFT) on full reasoning traces—requires expensive human annotation that limits scalability. We study an alternative paradigm of learning from easy-to-collect, action-only demonstrations—successful trajectories only containing actions recordable without additional effort while a human completes a task. We first find that this action-only setting results in significant performance gaps across text-based games and function-calling benchmarks compared to training on full reasoning traces. To address this, we thus introduce Action Process Reward Models (Act-PRMs). We hypothesize that action-only traces only provide part of the context necessary to generate high-return future steps. This motivates Act-PRM as an expectation-maximization (EM) approach: we treat actions as observed data behind “latent” thoughts, and train LLMs to generate thoughts that maximize the likelihood of next-step observed actions. We further show that this EM objective can be implemented as policy gradient with dense rewards via the action likelihoods—enabling LLMs to iteratively generate high-likelihood reasoning traces without human annotation. Finally, we empirically show that Act-PRMs consistently turn action-only logs into effective training signal. Across multi-turn text-based games and tool-calling benchmarks, Act-PRMs match or exceed SFT on fully-annotated reasoning traces on end-to-end task completion. They also deliver substantial improvements over action-only behavioral cloning.