Speech-Audio Compositional Attacks on Multimodal LLMs and Their Defense with SALMONN-Guard
Abstract
Recent progress in large language models (LLMs) has enabled understanding of both speech and non-speech audio, but has also exposed new safety risks arising from complex audio inputs that are inadequately handled by current safeguards. We introduce SACRED-Bench (Speech–Audio Composition for RED-teaming) to evaluate the robustness of LLMs under complex audio-based attacks. Unlike existing perturbation-based methods that rely on noise optimization or white-box access, SACRED-Bench exploits speech–audio composition to enable effective black-box attacks. SACRED-Bench adopts three composition mechanisms: (a) speech overlap, (b) multi-speaker dialogue, and (c) mixtures of speech and non-speech audio. These mechanisms focus on evaluating safety in settings where benign and harmful intents co-occur within a single auditory scene. Moreover, questions in SACRED-Bench are designed to implicitly refer to content in the audio, such that no explicit harmful information appears in the text prompt alone. Experiments demonstrate that even Gemini 2.5 Pro, a state-of-the-art proprietary LLM with safety guardrails fully enabled, still exhibits a 66% attack success rate. To bridge this gap, we propose SALMONN-Guard, the first guard model that jointly inspects speech, audio, and text for safety judgments, reducing the attack success rate to 20\%. Our results highlight the need for audio-aware defenses to ensure the safety of multimodal LLMs.