Security–Fidelity Tradeoffs: No Universal Defense Against Prompt Injection
Abstract
We identify a fundamental tension in securing LLMs: the \textbf{security--fidelity tradeoff}. While defenses against indirect prompt injection are becoming more robust, we show that they inevitably impair the model's ability to process benign, instruction-like text. Current evaluations miss this cost because they conflate utility with fidelity. We address this gap with \textsc{SecFid}, a benchmark that uses behaviorally separable probes to unambiguously distinguish between resisting an attack, succumbing to it, and faithfully processing it as data. Our evaluation reveals this tradeoff across a diverse set of models and highlights how the strongest defenses achieve security often by aggressively suppressing valid content, causing fidelity failure rates up to 50\% on translation. We ground these results in a decision-theoretic framework, proving that when benign and adversarial inputs overlap, no universal defense exists. Therefore, optimal robustness is strictly task-dependent, determined by an application’s tolerance for fidelity errors versus security failures.