Divergence Decoding: Targeted Unlearning via Auxiliary Models
Abstract
Large Language Models (LLMs) frequently memorize sensitive training data, creating significant privacy and copyright risks. We present a novel unlearning framework rooted in the principle that learning is easier than forgetting. We first introduce \textbf{Divergence Decoding (DD)}, a mechanism that uses small, efficiently trained auxiliary models to steer the logits of the LLM away from specific data during inference. We then demonstrate this steered distribution can be trivially distilled back into the base model. Our method decisively outperforms \textbf{state-of-the-art (SOTA)} baselines on TOFU and MUSE benchmarks, and we find evidence of generalization in the domain of images. \href{https://anonymous.4open.science/r/targetedunlearningicml2026/}{Code is available at this anonymous link. }