Poster
in
Workshop: Next Generation of AI Safety
Robust Knowledge Unlearning via Mechanistic Localizations
Phillip Guo · Aaquib Syed · Abhay Sheshadri · Aidan Ewart · Gintare Karolina Dziugaite
Keywords: [ factual recall ] [ mechanistic interpretability ] [ Machine unlearning ]
Methods for machine unlearning in large language models seek to remove undesirable knowledge or capabilities without compromising general language modeling performance. This work investigates the use of mechanistic interpretability to improve the precision and effectiveness of unlearning.We demonstrate that localizing unlearning to components with particular mechanisms in factual recall leads to more robust unlearning across different input/output formats, relearning, and latent knowledge, and reduces unintended side effects compared to nonlocalized unlearning. Additionally, we analyze the strengths and weaknesses of different automated (rather than manual) interpretability methods for guiding unlearning, finding that their corresponding unlearned models require smaller edit sizes to achieve unlearning but are much less robust.