Poster
in
Workshop: RLxF: RL from World Feedback Fri, Jul 10, 2026 • 12:00 AM – 1:00 AM PDT

On Learning to Think with Action Process Reward Models

Michael Zhang ⋅ Madison Ho

Project Page

Abstract

Large language models (LLMs) and reinforcement learning (RL) with verifiable task completion show exciting promise for tool calling tasks. However, extending this paradigm to multi-turn, long-horizon workflows results in sparse rewards and credit-assignment problems. Meanwhile, providing more frequent learning signal—either via step-level process feedback or behavioral cloning / supervised finetuning (SFT) on full reasoning traces—requires expensive human annotation that limits scalability. We study an alternative paradigm of learning from easy-to-collect, action-only demonstrations—successful trajectories only containing actions recordable without additional effort while a human completes a task. We first find that this action-only setting results in significant performance gaps across text-based games and function-calling benchmarks compared to training on full reasoning traces. To address this, we thus introduce Action Process Reward Models (Act-PRMs). We hypothesize that action-only traces only provide part of the context necessary to generate high-return future steps. This motivates Act-PRM as an expectation-maximization (EM) approach: we treat actions as observed data behind “latent” thoughts, and train LLMs to generate thoughts that maximize the likelihood of next-step observed actions. We further show that this EM objective can be implemented as policy gradient with dense rewards via the action likelihoods—enabling LLMs to iteratively generate high-likelihood reasoning traces without human annotation. Finally, we empirically show that Act-PRMs consistently turn action-only logs into effective training signal. Across multi-turn text-based games and tool-calling benchmarks, Act-PRMs match or exceed SFT on fully-annotated reasoning traces on end-to-end task completion. They also deliver substantial improvements over action-only behavioral cloning.