Course description
What is the internal structure of modern neural networks and how can we study it? This course provides a broad and deep introduction to interpretability, the subfield of machine learning concerned with understanding precisely how models process information and why they produce the outputs they do. We will cover topics such as probing, steering, causal abstraction, and sparse autoencoders, with a particular emphasis on causal methods and large language models. The course will include guest lectures from leading interpretability labs across academia and industry.
Staff
Please reach the staff at cs221m-spr2526-staff@lists.stanford.edu.
Logistics
- Time: Monday, Wednesday 2:30pm-3:50pm
- Location: McMurtry Art Building, Oshman
- Office hours: By appointment - email course staff to schedule (cs221m-spr2526-staff@lists.stanford.edu)
Coursework
The course will have five weeks of notebook-guided lectures, four weeks of guest lectures, and one week of final presentations. Students will be graded for participation in lectures and for their final project.
- Participation: Students are expected to attend all lectures and engage with the course materials; please notify staff if you are unable to attend lectures in advance.
- Final project: Students will implement and present the key results from a recent paper in mechanistic interpretability.

