REALISTA: Realistic Latent Adversarial Attacks that Elicit LLM Hallucinations
Abstract
Large language models (LLMs) achieve strong performance across many tasks but remain vulnerable to hallucinations, motivating the need to find adversarial prompts that realistically elicit such failures. We formulate hallucination elicitation as a constrained optimization problem, where the goal is to find semantically coherent adversarial prompts that are equivalent to benign user prompts. Existing approaches struggle to solve this problem. On the one hand, attacks that optimize directly over the discrete prompt space can enforce both semantic equivalence and coherence, but are limited to a finite set of prompt variations. This constraint reduces attack diversity and often leads to suboptimal optimization. On the other hand, attacks that optimize over the continuous LLM latent space enable powerful continuous optimization methods, but typically fail to produce prompts that are both semantically equivalent and coherent. To address these limitations, we propose REALISTA, an adversarial attack framework that bridges the semantic diversity of continuous attacks with the semantic realism of discrete attacks. REALISTA operates in the LLM latent space, expressing adversarial perturbations as continuous combinations of editing directions. By construction, solutions to the optimization problem correspond to valid rephrasings, which naturally encourage semantic equivalence and coherence. Experiments demonstrate that REALISTA achieves superior or comparable performance to state-of-the-art realistic attacks on open-source LLMs and, crucially, succeeds in attacking large reasoning models under free-form response settings, where prior realistic attacks fail.