Causal Detection of Multi-Step LLM Agent Attacks
Abstract
Multi-step prompt injection attacks on LLM agents present a fundamental detection challenge because malicious intent emerges only after workflows complete, while individual actions remain legitimate in isolation. Existing defenses, including input sanitization, output validation, and instruction hierarchy, operate on individual actions or content patterns and cannot capture this sequential structure. We present \texttt{CausalTrace}, a detection system that reframes prompt-injection defense as causal inference. It constructs Structural Causal Models from agent trajectories with typed edges capturing data dependency, trust transfer, and state enablement, then applies Pearl's do-calculus to answer a counterfactual question, namely whether the harmful outcome would have occurred if the injection had been blocked. This formalization enables a principled distinction between attacks that depend on injections and benign workflows that share surface-level features. Evaluation on a dataset spanning crowdsourced traces, LLM agent benchmarks, and semi-real and real scenarios demonstrates strong detection performance, outperforming content-based baselines while requiring minimal LLM inference cost. Bidirectional slicing recovers complete attack chains, providing interpretable explanations that trace exploitation to its causal origins.