Poster
in
Workshop: Actionable Interpretability

MULTIGUARD: An Efficient Approach for AI Safety Moderation Across Modalities

Sahil Verma ⋅ Keegan Hines ⋅ Jeff Bilmes ⋅ Charlotte Siska ⋅ Luke Zettlemoyer ⋅ Hila Gonen ⋅ Chandan Singh

2025 Poster
in
Workshop: Actionable Interpretability

Project Page [ OpenReview]

Abstract

The emerging capabilities of large language models (LLMs) have sparked concerns about their potential for harmful use.The core approach to mitigate these concerns is the detection of harmful queries to the model. Current detection approaches are fallible, and are particularly susceptible to attacks that exploit mismatched generalization of model capabilities (e.g., prompts in low-resource languages or prompts in other modalities such as image and audio) that can often bypass standard safety mechanisms.To tackle this challenge, we propose MultiGuard, an effective and efficient approach for detecting harmful prompts across languages and modalities. Our approach (i) identifies internal representations of an LLM/MLLM that are aligned across languages or modalities and then (ii) uses them to train a language-agnostic or modality-agnostic classifier for detecting harmful prompts. In a multilingual setting, MultiGuard detects harmful prompts with an accuracy of 85.54%, an improvement of 11.57% over the the best performing baseline. For image-based prompts, MultiGuard detects harmful prompts with an accuracy of 88.31%, an improvement of 20.44% over the best performing baseline. Finally, MultiGuard is the first approach to detect harmful audio prompts with an accuracy of 93.09%. MultiGuard is also about 96X faster than the next fastest baseline, as it repurposes the embeddings computed during generation for safety moderation.

Chat is not available.