ICML Robust Knowledge Unlearning via Mechanistic Localizations

Poster
in
Workshop: Next Generation of AI Safety

Robust Knowledge Unlearning via Mechanistic Localizations

Phillip Guo · Aaquib Syed · Abhay Sheshadri · Aidan Ewart · Gintare Karolina Dziugaite

Keywords: [ factual recall ] [ mechanistic interpretability ] [ Machine unlearning ]

[ Abstract ] [ Project Page ]

[ OpenReview]

Abstract:

Methods for machine unlearning in large language models seek to remove undesirable knowledge or capabilities without compromising general language modeling performance. This work investigates the use of mechanistic interpretability to improve the precision and effectiveness of unlearning.We demonstrate that localizing unlearning to components with particular mechanisms in factual recall leads to more robust unlearning across different input/output formats, relearning, and latent knowledge, and reduces unintended side effects compared to nonlocalized unlearning. Additionally, we analyze the strengths and weaknesses of different automated (rather than manual) interpretability methods for guiding unlearning, finding that their corresponding unlearned models require smaller edit sizes to achieve unlearning but are much less robust.

Chat is not available.

Poster in Workshop: Next Generation of AI Safety

Robust Knowledge Unlearning via Mechanistic Localizations

Phillip Guo · Aaquib Syed · Abhay Sheshadri · Aidan Ewart · Gintare Karolina Dziugaite

Poster
in
Workshop: Next Generation of AI Safety