Generalized Correctness Models: Learning Calibrated and Cross-Model Correctness Predictors from Historical Patterns
Abstract
Generating accurate and calibrated confidence estimates is critical for deploying LLMs in high-stakes or user-facing applications, and remains an open challenge. Prior research has often framed confidence as a problem of eliciting a model’s “self-knowledge”, i.e., the ability of an LLM to judge whether its own answers are correct; this approach implicitly assumes that there is some privileged information about the answer’s correctness that is accessible to the model itself. However, we find that whether trained or training-free, an LLM attempting to predict the correctness of its own outputs generally performs no better than an unrelated LLM attempting the same task. Moreover, we hypothesize that a key factor in predicting model correctness, i.e., building a “Correctness Model” (CM), is exposure to a target model’s historical predictions. We propose multiple methods to inject this historical correctness information, including training an LLM to predict the confidences of many other LLMs, i.e., creating a Generalized Correctness Model (GCM). We use GCMs and CMs as a lens for studying the source of correctness prediction ability and its generalization, studying the importance of answer phrasing, world-knowledge, performance history, in-context examples, and posthoc-calibration for correctness prediction. We evaluate GCMs based on Qwen3-8B across 5 model families and the MMLU and TriviaQA datasets, as well as on a downstream selective prediction task, finding that reliable LLM confidence estimation is a generalizable and cross-model skill learned by systematically encoding correctness history rather than a model-specific skill reliant on introspection.