ASRU: Activation Steering Meets Reinforcement Unlearning for Multimodal Large Language Models
Abstract
Multimodal large language models (MLLMs) inevitably memorize sensitive cross-modal information during pretraining, making post-deployment unlearning crucial for safety. Existing methods often evaluate unlearning based on output deviations, neglecting generation quality, which can lead to hallucinations or rigid refusals, compromising usability and safety. We propose Activation Steering Meets Reinforcement Unlearning for Multimodal Large Language Models (ASRU), a novel, controllable multimodal unlearning framework that considers generation quality as a key evaluation metric. ASRU first constructs an initial model exhibiting refusal-style responses via activation redirection. It then reinforces the refusal boundary using a customized reward function, effectively unlearning target knowledge while retaining non-target knowledge. Experimental results show that ASRU preserves model utility while markedly enhancing unlearning (+24.61%) and generation quality (5.8×), achieving strong generalization and an efficient forget–retain trade-off with minimal retained supervision.