Meerkat-VL: Implicit Risk Safety Alignment in Multimodal LLMs via Perceptual Reasoning and Self-Verification
Abstract
Multimodal LLMs (MLLMs) are increasingly deployed across diverse applications, but they pose significant safety concerns due to cross-modal interactions. To improve model safety awareness, existing methods rely on explicit-risk preference datasets and reinforcement learning guided by safety rewards. While effective in improving models' safety awareness, these methods still face data scarcity and reward hacking in implicit-risk scenarios, leading to insufficient risk perception and harmful responses. To address these challenges, we propose Meerkat-VL, a framework that enables models to perceive and verify implicit risks while generating safe responses. First, we introduce Meerkat-Safe, the first training dataset with detailed labels for implicit risks. Second, we develop Normative Perceptual Self-Verification, which enables models to verify both perceptual reasoning and responses. This process provides denser and more reliable rewards for perception accuracy and answer safety, thereby mitigating reward hacking. Finally, we propose Dual-Objective Perceptual Consistency Alignment, encouraging models to generate safe responses by penalizing answers that follow safe templates without accurate risk perception. Extensive experiments show that Meerkat-VL consistently outperforms baselines on multimodal safety benchmarks, improving safety and helpfulness by 16\% and 13\%, and achieving a 32\% safety gain on implicit-risk tasks. Our codes are available here.