Automatic Layer Selection for Hallucination Detection
Abstract
Recent studies on hallucination detection have shown that hallucination-related signals are more strongly encoded in intermediate layers than in the final layer of large language models (LLMs). While a growing body of work has sought to exploit this property for hallucination detection, the problem of how to automate the selection of high-performing layers is underexplored, and the development of principled methods for this purpose remains an open challenge. To address this gap, we first propose several hypotheses for why such signals emerge in intermediate layers and test corresponding criteria for automatic layer selection. We evaluate these criteria across two LLM architectures and five datasets, and find that none of them deliver satisfying performance. Instead, we propose a new selection criterion, First Effective Peak of Intrinsic Dimension (FEPoID), that is able to consistently identify optimal or near-optimal layers and outperforms the aforementioned criteria and existing hallucination detection baselines. This criterion is training-free and requires negligible computational overhead. Additionally, we study the generation behaviors of LLMs and introduce a simple yet effective truncation strategy, which further amplifies the hallucination-related signals and leads to substantial improvements in overall detection performance.