The Consistency Trap in LLMs: Generator-Evaluator Agreement and Vulnerability to Mistakes
Abstract
Large language models are often evaluated for correctness on isolated questions. But modern deployments also rely on a different property: whether the model stays consistent as it generates, critiques, and revises over multiple steps that rely on the same underlying concepts. In these settings, self-consistency seems desirable, since it limits drift as models reuse and assess their own outputs. Yet we lack standard deployment-relevant ways of quantifying it. How do we measure self-consistency, and what can it signal for reliability in deployment? In this paper we propose a new measure: generator–evaluator self-consistency, which assesses whether a model applies the same underlying concept consistently when it is invoked across related prompts. We find that models exhibit substantial variation in self-consistency that is independent of their accuracy on benchmark questions involving those same concepts. Examining this variation in a clinical setting with physician-validated mistakes, we find that higher self-consistency is linked to greater vulnerability to mistakes. Rather than signaling robust understanding, consistency can reflect stable commitment to incomplete or brittle conceptualizations. We interpret this pattern as a consistency trap: self-consistency is operationally useful, but can also be evidence of systematized errors.