Tutorial

Tutorial on Mechanistic Interpretability for Language Models

Ziyu Yao · Daking Rai

2025 Tutorial

Project Page

Abstract

Mechanistic interpretability (MI) is an emerging sub-field of interpretability that seeks to understand a neural network model by reverse-engineering its internal computations. Recently, MI has garnered significant attention for interpreting transformer-based language models (LMs), resulting in many novel insights yet introducing new challenges. Given how fast this topic is now attracting the ML/AI community's attention, the goal of this tutorial is to provide a comprehensive overview of MI for LMs, including its historical contexts, the various techniques to implement and evaluate MI, findings and applications based on MI, and future challenges. The tutorial will particularly be presented following an innovative Beginner's Roadmap that the presenters carefully curated, aiming to enable researchers new to MI to quickly pick up this field and leverage MI techniques in their LM applications.

Video

Chat is not available.

Schedule

Timezone: America/Los_Angeles

Tutorial on Mechanistic Interpretability for Language Models

Ziyu Yao · Daking Rai

Video