PerceptionRubrics: Calibrating Multimodal Evaluation to Human Perception
Abstract
We introduce the Perception Rubric Benchmark (PRB), a rubric-based evaluation framework for Multimodal Large Language Models (MLLMs) that addresses the growing gap between benchmark scores and human-perceived quality. While standard perception metrics approach saturation, they produce compressed rankings that obscure meaningful performance differences, largely due to their linear and lenient reward designs. PRB reframes evaluation from holistic scoring to rubric-based verification. It is built through a scalable hybrid automation pipeline over a stratified collection of complex, multi-domain visual inputs. Using pair-wise contrastive generation, PRB distills over 15,000 diagnostic rubric criteria that function as explicit unit tests for perception and are evaluated via a ternary protocol distinguishing benign approximations from perceptually critical errors. Experiments show that PRB decouples compressed leaderboard rankings, reveals perceptual blind spots in top-performing models, and aligns more closely with human preference than conventional metrics. Beyond evaluation, generated rubrics can be reused as inference-time verifiers, yielding consistent gains on multiple perception benchmarks. PRB provides a principled foundation for reliable and discriminative multimodal evaluation.