Evaluating Language Models in Realistic Conversational Contexts
Abstract
As Large Language Models (LLMs) are increasingly deployed to serve open-ended, multi-turn interactions, evaluating conversational quality at human scale has become a central challenge. Existing evaluation frameworks built for summarization, translation, or short-form QA tasks fall short of adequately measuring the consistency of human-scale dialogue, especially when derivation and validation of these metrics themselves often rely on synthetic rather than human sources. We fill the gap by introducing UPHELD (Utility & Planning Human-Scale Evaluated Long Dialogues), a large, reference-full benchmark for evaluating human-scale conversational ability beyond factual correctness. UPHELD consists of hundreds of complete human-to-human dialogues authored by professional script writers, with realistic turn densities and 36,000+ per-turn human annotations across 10,000+ expert-generated dialogue turns. Using UPHELD, we systematically evaluate classical automatic metrics and reference-free LLM-as-a-judge approaches, and find them unreliable when correlated with expert human judgment. Building off this analysis, we use UPHELD to develop a Mixture-of-Judges framework that combines multiple evaluative signals and improves correlation with human assessments by approximately 30%. Overall, UPHELD provides a robust, human-grounded foundation for evaluating long, human-scale conversational intelligence that fills a crucial gap in the pre-existing LLM dataset landscape