Detecting Fluent Optimization Based Adversarial Prompts via Sequential Entropy Changes
Mohammed Alshaalan ⋅ Miguel Rodrigues
Abstract
Optimization-based adversarial suffixes can jailbreak aligned large language models (LLMs) while remaining fluent, weakening detectors based on static global or windowed perplexity statistics. We cast adversarial suffix detection as an \emph{online change-point detection} problem over the token-level next-token entropy stream. Using the fixed system prompt to estimate a robust baseline via the median and median absolute deviation, we standardize user-token entropies and monitor them with a one-sided CUSUM statistic. The resulting detector is model-agnostic, training-free, operates online, and localizes the onset of adversarial suffixes. On a benchmark of $724$ optimization-based suffix attacks (GCG, AutoDAN, AdvPrompter) and $765$ benign prompts from a TyDiQA+OpenOrca mixture with controlled post-prefix perplexity, CPD consistently outperforms perplexity baselines; on LLaMA-2-7B it reaches AUROC $0.90$ and F1 $0.82$. At an operating point with $\approx 10\%$ benign false-positive rate, CPD detects $74\%$ of suffix attacks and concentrates $87\%$ of its triggers inside the adversarial suffix. By comparison, windowed perplexity detects $35$--$43\%$ and frequently fires on boundary-straddling windows. Finally, we show CPD Online can act as a lightweight gate for LLaMA Guard, reducing guard invocations by $17$--$22\%$ on a high-volume stream dominated by benign prompts while preserving guard-level detection quality.
Successful Page Load