When Embedding-Based Defenses Fail: Rethinking Safety in LLM-Based Multi-Agent Systems
Abstract
Large language model (LLM)-powered multi-agent systems (MAS) enable agents to communicate and share information, achieving strong performance on complex tasks. However, this communication also creates an attack surface where malicious agents can propagate misinformation and steer group decisions, undermining MAS safety. Existing embedding-based defenses aim to detect and prune suspicious agents, but their effectiveness depends on a clear separation between the text embeddings of malicious and benign messages. Stronger attackers can break such defenses by crafting messages whose embeddings lie close to benign ones. We analyze this failure mode theoretically and validate it empirically with three stronger attacks, Slow Drift, Benign Wrapper, and Chaos Seeding. Our findings also expose a key limitation of embedding-based defenses: they operate only on the text embeddings and ignore token-level confidence signals such as logits, which can remain informative when embeddings are not distinguishable under attack. In this paper, we use confidence score to prune or down-weight messages during MAS communication. Experiments show improved robustness across models, datasets, and communication topologies. Moreover, we find that the effectiveness of confidence signals decays over communication rounds, highlighting the importance of early intervention. This insights can inform and inspire future work on MAS attacks and defenses.