Position: Peer Review Should Be Calibrated via LLM Scoring
Abstract
As submission volumes grow, AI conference peer review increasingly suffers from scale drift and non-comparable scoring: similar rationales can yield markedly different numeric ratings due to subjective calibration and occasional incoherent or strategic scoring, even though scores often strongly influence outcomes. This position paper argues that AI conference workflows should incorporate an LLM-driven calibration layer that maps reviewer rationales (e.g., strengths and weaknesses) into consistent and auditable anchor scores. The residual between a reviewer’s reported score and the anchor score turns rationale--score misalignment into a measurable signal for targeted escalation. We instantiate an end-to-end pipeline and apply it to OpenReview data from ICLR 2023--2025 to quantify severity/leniency patterns and where misalignment concentrates. We further propose a lightweight post-check---requesting added justification or score revision when residuals are large---and estimate its impact via an offline counterfactual simulation. Finally, we outline an adoption playbook and governance boundaries, emphasizing that the LLM audits scoring coherence rather than replacing human judgment or making accept/reject decisions.