Course description

What is the internal structure of modern neural networks and how can we study it? This course provides a broad and deep introduction to interpretability, the subfield of machine learning concerned with understanding precisely how models process information and why they produce the outputs they do. We will cover topics such as probing, steering, causal abstraction, and sparse autoencoders, with a particular emphasis on causal methods and large language models. The course will include guest lectures from leading interpretability labs across academia and industry.

Staff

Thomas Icard
Thomas Icard StanfordInstructor
Atticus Geiger
Atticus Geiger GoodfireInstructor
Amir Zur
Amir Zur StanfordInstructor
Jing Huang
Jing Huang StanfordInstructor
Junyi Tao
Junyi Tao StanfordTeaching Assistant
Siri Vatsavaya
Siri Vatsavaya GoodfireCourse Manager

Please reach the staff at cs221m-spr2526-staff@lists.stanford.edu.

Logistics

Coursework

The course will have five weeks of notebook-guided lectures, four weeks of guest lectures, and one week of final presentations. Students will be graded for participation in lectures and for their final project.


Schedule

Note: schedule is subject to change.

Date Lesson Readings Materials
Week 1
Mon. March 30
Introduction
Week 1
Wed. April 1
Review of language models
Week 2
Mon. April 6
Behavioral analysis and input attribution
Week 2
Wed. April 8
Probes for decoding activations
Week 3
Mon. April 13
Interventions for steering activations
Week 3
Wed. April 15
Causal mediation analysis
Week 4
Mon. April 20
Theory of causal abstraction I
Week 4
Wed. April 22
Designing counterfactuals
Week 5
Mon. April 27
Automated causal interpretability
Davies et al. 2023
Cao et al. 2020, 2022
Geiger et al. 2023 DAS
Wu et al. 2023 boundless DAS
Week 5
Wed. April 29
Theory of causal abstraction II
Week 6
Mon. May 4
Week 6
Wed. May 6
Week 7
Mon. May 11
Week 7
Wed. May 13
Week 8
Mon. May 18
Week 8
Wed. May 20
Week 9
Mon. May 25
Week 9
Wed. May 27
Week 10
Mon. June 1
Project presentations
Week 10
Wed. June 3
Project presentations