Adversarial Reasoning Manipulation in Small Language Models: Attack Taxonomy, Empirical Evaluation, and Detection Framework
Abstract
Agentic AI systems built on Small Language Models (SLMs) are increasingly deployed in sensitive enterprise settings, yet their susceptibility to adversarial manipulation of internal reasoning remains poorly understood. We introduce Adversarial Reasoning Manipulation (ARM), a threat class encompassing direct prompt injection, memory poisoning, and multi-turn stealth attacks that covertly redirect an agent's chain of thought without exploiting any software vulnerability. Through a 300-sample benchmark evaluated across four representative open-source SLMs in a standardised ReAct framework, we find a mean attack success rate of 52.1%, with a 23.5 percentage-point spread across models that directly implicates instruction-tuning quality as the primary susceptibility factor. We further compare four detection mechanisms, finding that a LoRA fine-tuned judge achieves the highest F1 of 0.872 at only 7.2% latency overhead, while classical ML classifiers provide competitive, fully interpretable detection at near-zero cost.