ReSeek: A Self-Correcting Framework for Search Agents with Instructive Rewards
Abstract
Search agents powered by Large Language Models have demonstrated significant potential in tackling knowledge-intensive tasks. Reinforcement learning has emerged as a powerful paradigm for training these agents to perform complex, multi-step reasoning. However, prior RL-based methods often rely on sparse or rule-based rewards, which can lead agents to commit to suboptimal or erroneous reasoning paths without the ability to recover. To address these limitations, we propose ReSeek, a self-correcting framework enabling search agents to recover from erroneous search paths during an episode. By invoking a special JUDGE action, the agent can judge the information and re-plan its search strategy. To guide this process, we design a dense, instructive process reward function, which decomposes into a correctness reward for retrieving factual information and a utility reward for finding information genuinely useful for the query. Additionally, to mitigate the risk of data contamination in existing datasets, we introduce FictionalHot, a contamination-free benchmark requiring complex reasoning. Experiments show ReSeek significantly outperforms SOTA baselines in task success and path faithfulness.