RubricRobustness: A Simple Framework for Evaluating the Robustness of Rubrics-Based Benchmarks
Manasi Sharma
Abstract
The advancement of Large Language Models (LLMs) into higher-level reasoning domains has rendered traditional heuristic evaluators insufficient for long-form open-ended responses, precipitating the widespread adoption of rubric-based benchmarks. While these frameworks utilize expert-curated criteria and LLM-as-a-judge to assess open-ended generation, the intrinsic robustness of these evaluation harnesses to fundamental validity assessments remains critically under-investigated. To bridge this gap, we introduce RubricRobustness, a systematic sensitivity analysis framework that subjects these benchmarks to three common sense perturbations: semantic negation, stochastic deletion and irrelevant addition. We investigate the extent to which manipulating the semantic veracity of a model’s response impacts its resulting score by applying the robustness framework to two of the most popular rubrics-based benchmarks: HealthBench and WildBench. Our findings reveal systematic vulnerabilities: while both benchmarks respond sharply to semantic negation (e.g., degradation slopes of approximately $-0.38$ on HealthBench and $-0.55$ on WildBench), they are substantially less responsive to irrelevant addition, often requiring over 35% of sentences to be perturbed before inducing even a 25% score drop. We argue that perturbation-based sensitivity analyses of this form are a necessary prerequisite for validating rubric coverage, ensuring that automated evaluation frameworks reliably penalize basic semantic failures. We plan to release our framework as an open-source tool to facilitate the development of more resilient benchmarks.
Successful Page Load