Adversarial Reinforcement Learning for Robust Diffusion Large Language Model Unlearning
Abstract
Diffusion language models (DLMs) have recently emerged as an alternative to autoregressive approaches, enabling parallel sequence generation and flexible token generation orders. Machine unlearning plays a critical role in mitigating legal and ethical risks by removing the influence of specific training examples from trained models. While unlearning has been extensively studied for autoregressive language models, its applicability to DLMs remains unexplored. The architectural differences of DLMs raise new challenges for effective and robust unlearning that are not addressed by existing methods. In this paper, we present the first comprehensive study of unlearning for DLMs. Through systematic empirical analysis, we show that unlearning performance in DLMs is highly sensitive to generation hyperparameters, highlighting the need for evaluation across diverse generation settings. We further observe that DLMs tend to reproduce unlearned information when target inputs are embedded within informative contexts, due to their ability to incorporate both prefix and suffix conditioning, which increases vulnerability to elicitation attacks and weakens the robustness of existing unlearning methods. To design a robust unlearning method, we propose an adversarial reinforcement learning framework, where a context generator adversarially produces informative contexts to elicit unlearned knowledge, while the DLM is optimized to suppress undesired recall. We further introduce novel components to address credit assignment and stability issues in this adversarial learning setup. Extensive experiments demonstrate that our method significantly improves unlearning effectiveness while preserving model utility. Our code is available at: https://anonymous.4open.science/r/dllm_unlearning-771D/