Tutorial
Tutorial on Mechanistic Interpretability for Language Models
Ziyu Yao · Daking Rai
Mechanistic interpretability (MI) is an emerging sub-field of interpretability that seeks to understand a neural network model by reverse-engineering its internal computations. Recently, MI has garnered significant attention for interpreting transformer-based language models (LMs), resulting in many novel insights yet introducing new challenges. Given how fast this topic is now attracting the ML/AI community's attention, the goal of this tutorial is to provide a comprehensive overview of MI for LMs, including its historical contexts, the various techniques to implement and evaluate MI, findings and applications based on MI, and future challenges. The tutorial will particularly be presented following an innovative Beginner's Roadmap that the presenters carefully curated, aiming to enable researchers new to MI to quickly pick up this field and leverage MI techniques in their LM applications.
Live content is unavailable. Log in and register to view live content