Beyond Drift: Stabilizing Subjective LLM Evaluation with Information-Theoretic Rubrics
Abstract
Despite the growing use of large language models (LLMs) in subjective tasks such as role-playing, humor, emotional intelligence, and dialogue quality, their evaluation faces a pressing reproducibility crisis: even the same evaluator may contradict itself when re-judging the exact same sample. We attribute this instability to dimension drift, where free-form evaluation protocols (e.g., Chain-of-Thought reasoning) unpredictably shift their implicit criteria, undermining reliability. To address this fundamental challenge, we reformulate subjective evaluation as an information-theoretic optimization problem. Specifically, we propose an Expected Information Gain (EIG)-based framework that constructs a stable yet adaptive personalized rubric to eliminate dimension drift. Our two-stage “generate–then–score” design first produces a diverse pool of candidate evaluation questions and then selects the most informative subset via EIG, yielding explicit and repeatable criteria. Experiments on six benchmarks, including CharacterEval, The rJokes, and MT_bench, demonstrate that our approach substantially improves both evaluation consistency and alignment with human judgments, outperforming CoT-based and fixed-questionnaire baselines. These results highlight that information-theoretic questionnaire construction offers a principled and reliable path toward reproducible evaluation of subjective tasks.