Workshop on Mechanistic Interpretability
Andrew Lee ⋅ Ivan Arcuschin Moreno
Abstract
We propose a third Workshop on Mechanistic Interpretability – the study of how neural networks function – following highly successful workshops at ICML 2024 and NeurIPS 2025, the latter which attracted over 600 attendees. Mechanistic Interpretability is a cross-cutting area with relevance to multiple topics at ICML: anyone who has trained or interacted with neural networks has likely wondered how they work, and our current lack of understanding causes significant issues for safety and scientific understanding. We have designed our program to foster debate around the arising debate between pragmatic and ambitious approaches in the field, in addition to showcasing and sharing knowledge on emerging methodologies
Successful Page Load