PRM-PBE: Process Reward Model for Reinforcement Learning in Programming-by-Example
Abstract
Programming-by-Example (PBE), as a typical few-shot inductive reasoning paradigm, aims to synthesize corresponding algorithms from a set of input-output examples. Although Large Language Models (LLMs) have demonstrated strong program synthesis potential, they still remain ineffective when handling complex PBE tasks. Specifically, LLMs often struggle to accurately grasp the underlying intent of examples, resulting in synthesized programs that either partially satisfy the examples or completely deviate from the target. To address these limitations, we introduce a process-supervised reinforcement learning method that provides fine-grained feedback during the synthesis process, improving the ability of LLMs to capture the intended behavior of provided examples. Firstly, we develop a reasoning tree construction method that is used to build a PBE process supervision dataset. Subsequently, we train a process reward model through preference learning to evaluate the effectiveness of reasoning steps. Finally, we introduce a curriculum learning strategy based on the difficulty of PBE tasks, using Proximal Policy Optimization (PPO) to optimize the model. Experimental results on representative PBE benchmarks show that our approach achieves an average pass rate of 56.61\%, significantly outperforming the state-of-the-art baseline by 8.73\%.