CS 221M

Course description

What is the internal structure of modern neural networks and how can we study it? This course provides a broad and deep introduction to interpretability, the subfield of machine learning concerned with understanding precisely how models process information and why they produce the outputs they do. We will cover topics such as probing, steering, causal abstraction, and sparse autoencoders, with a particular emphasis on causal methods and large language models. The course will include guest lectures from leading interpretability labs across academia and industry.

Staff

Thomas Icard

Instructor

Atticus Geiger

Instructor

Amir Zur

Instructor

Jing Huang

Instructor

Junyi Tao

Teaching Assistant

Siri Vatsavaya

Course Manager

Please reach the staff at cs221m-spr2526-staff@lists.stanford.edu.

Logistics

Time: Monday, Wednesday 2:30pm-3:50pm
Location: McMurtry Art Building, Oshman
Office hours: By appointment - email course staff to schedule (cs221m-spr2526-staff@lists.stanford.edu)

Coursework

The course will have five weeks of notebook-guided lectures, four weeks of guest lectures, and one week of final presentations. Students will be graded for participation in lectures and for their final project.

Participation: Students are expected to attend all lectures and engage with the course materials; please notify staff if you are unable to attend lectures in advance.
Final project: Students will implement and present the key results from a recent paper in mechanistic interpretability.

Schedule

Note: schedule is subject to change.

Date	Lesson	Readings	Materials
Week 1 Mon. March 30	Introduction	—	—
Week 1 Wed. April 1	Review of language models	Rush et al. 2018 annotated transformer Elhage et al. 2021 circuits	—
Week 2 Mon. April 6	Behavioral analysis and input attribution	Jia and Liang 2017 adversarial evaluation Sundararajan et al. 2017 integrated gradients	—
Week 2 Wed. April 8	Probes for decoding activations	Wendler et al. 2024 llamas think in english Tenney et al. 2019 BERT rediscovers NLP pipeline Marks et al. 2023 geometry of truth	—
Week 3 Mon. April 13	Interventions for steering activations	Meng et al. 2022 editing facts Ravfogel et al. 2020 nullspace projection Geva et al. 2023 attention knockout	—
Week 3 Wed. April 15	Causal mediation analysis	Vig et al. 2020 analyzing gender bias Wang et al. 2023 IOI Mueller et al. 2025 quest for the right mediator	—
Week 4 Mon. April 20	Theory of causal abstraction I	Rubenstein et al. 2017 Beckers et al. 2019 Geiger et al. 2021, 2025	—
Week 4 Wed. April 22	Designing counterfactuals	Parakash et al. 2025 lookback mechanism Gur-Arieh et al. 2025 entity binding	—
Week 5 Mon. April 27	Automated causal interpretability	Davies et al. 2023 Cao et al. 2020, 2022 Geiger et al. 2023 DAS Wu et al. 2023 boundless DAS	—
Week 5 Wed. April 29	Theory of causal abstraction II	—	—
Week 6 Mon. May 4	—	—	—
Week 6 Wed. May 6	—	—	—
Week 7 Mon. May 11	—	—	—
Week 7 Wed. May 13	—	—	—
Week 8 Mon. May 18	—	—	—
Week 8 Wed. May 20	—	—	—
Week 9 Mon. May 25	—	—	—
Week 9 Wed. May 27	—	—	—
Week 10 Mon. June 1	Project presentations		—
Week 10 Wed. June 3	Project presentations		—

CS 221M: Mechanistic Interpretability

Course description

Staff

Logistics

Coursework

Schedule

Sponsors