SemanticSRJudge: Spatially-Grounded VLM Evaluation for Super-Resolution Quality Assessment
Abstract
Single-image super-resolution (SR) has advanced rapidly, but its evaluation still relies heavily on scalar metrics such as PSNR, SSIM, LPIPS. These metrics give useful aggregate signals, but they do not explain why one model is better than another, nor do they reveal the localized failure modes that distinguish modern SR architectures. Different SR architectures also fail in qualitatively different ways: some sharpen aggressively at the cost of edge ringing, others preserve content faithfully but flatten fine texture, and diffusion-based models can hallucinate detail that has no support in the reference. Full-image vision-language model (VLM) judges provide richer feedback, but their attention is diluted over the entire image, while many SR errors are small, spatially concentrated, and content-dependent. We introduce SemanticSRJudge, a training-free framework that makes these tradeoffs visible. A frozen DINOv2 backbone identifies regions where an SR output semantically drifts from its reference, and a VLM judge evaluates those regions alongside the full image. This turns a single global judgment into a structured diagnostic across seven perceptual dimensions, revealing where each architecture succeeds, where it fails, and what kind of failure it commits. We also introduce Semantic-SR Bench, a content-stratified benchmark covering seven semantic categories, designed to expose model preferences that are hidden at the dataset level. Across 5,072 matched judge calls spanning four SR architectures, four datasets, and both 2× and 4× scales, SemanticSRJudge consistently corrects the optimistic bias of full-image VLM scoring and recovers content-specific model tradeoffs. In a controlled human study on RealSR Canon 4×, DINOv2-guided routing raises mean Win% from 41.9% to 48.6% (+6.7pp) and improves mean Spearman correlation with human ratings from +0.21 to +0.31 (+0.10).