Poster Tue, Jul 7, 2026 • 6:30 PM – 8:15 PM PDT HALL A #1901

Reward Auditor: Inference on Reward Modeling Suitability in Real-World Perturbed Scenarios

Jianxiang Zang ⋅ Yongda Wei ⋅ Ruxue Bai ⋅ Shiyu Jiang ⋅ Nijia Mo ⋅ Binhong Li ⋅ Qiang Sun ⋅ Hui Liu

Abstract

Reliable reward models (RMs) are critical for ensuring the safe alignment of large language models (LLMs). However, current RM evaluation methods focus solely on preference perception accuracies in given specific scenarios, obscuring the critical vulnerabilities of RMs in real-world scenarios. We identify the true challenge lies in assessing a novel dimension: Suitability, defined as conditional reliability under specific real-world perturbations. To this end, we introduce Reward Auditor, a hypothesis-testing framework specifically designed for RM suitability inference. Rather than answering “How accurate is the RM's preference perception for given samples?”, it employs scientific auditing to answer: “Can we infer RMs exhibit systematic vulnerabilities in specific real-world scenarios?". Under real-world perturbed scenarios, Reward Auditor quantifies statistical significance and effect size by auditing distribution degradation of RM preference perception confidence. This enables inference of both the certainty and severity of RM vulnerabilities across diverse real-world scenarios, thereby laying a solid foundation for building next-generation LLM alignment systems that are verifiably safe, more robust, and trustworthy.

Lay Summary

(1) Problem: To ensure artificial intelligence (AI) behaves safely, developers use "reward models" to judge AI responses. However, we typically evaluate these judges in perfect, textbook conditions. In the real world, human inputs are messy—filled with typos, changing formats, and diverse phrasing. Current evaluation methods fail to show if an AI truly understands human values, or if it is just memorizing superficial patterns that easily break down under real-world noise. (2) Solution: To solve this, we developed Reward Auditor, a new diagnostic framework. Instead of simply asking if an AI gets the right answer on a standard test, our tool scientifically stress-tests it. We apply realistic variations to the text—like swapping synonyms or changing the format—and rigorously check if the AI's judgment systematically collapses. (3) Impact: Using this tool, we discovered that many popular AI judges are surprisingly fragile when facing these everyday variations. By identifying these hidden vulnerabilities, Reward Auditor helps developers predict real-world failures and build next-generation AI systems that are verifiably safe, robust, and trustworthy in unpredictable environments.