Mech Interp reading list

# Mech Interp reading list > Progress in AI is birthing a new kind of intelligence, reminiscent of our own in some ways but entirely alien in others. Understanding the nature of this intelligence is a profound scientific challenge, which has the potential to reshape our conception of what it means to think. > > On the Biology of a Large Language Model, Lindsey et al. 2025 ### Introduction to LLMs - [Karpathy's Neural Networks: Zero to Hero youtube series](https://www.youtube.com/playlist?list=PLAqhIrjkxbuWI23v9cThsA9GvCAUhRvKZ) - [3blue1brown's Neural Networks series](https://www.youtube.com/playlist?list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi) - [Essence of linear algebra](https://www.youtube.com/playlist?list=PLZHQObOWTQDPD3MizzM2xVFitgF8hE_ab) ### Tooling - [Neuronpedia](https://www.neuronpedia.org/) - [EleutherAI](https://github.com/EleutherAI) for feature autointerp - [EasySteer](https://github.com/ZJU-REAL/EasySteer) ### Introduction on mech interp - [x] [The Urgency of Interpretability, Dario Amodei, 2025](https://www.darioamodei.com/post/the-urgency-of-interpretability#the-dangers-of-ignorance) - [x] [How To Become A Mechanistic Interpretability Researcher, Nanda, 2025](https://www.alignmentforum.org/posts/jP9KDyMkchuv6tHwm/how-to-become-a-mechanistic-interpretability-researcher) ### Meta reads - [ ] [Emergent Introspective Awareness in Large Language Models, Lindsey, 2025](https://transformer-circuits.pub/2025/introspection/index.html) ### Sparse Autoencoders (SAEs) - [x] [An Intuitive Explanation of Sparse Autoencoders for LLM Interpretability, Karvonen, 2024](https://adamkarvonen.github.io/machine_learning/2024/06/11/sae-intuitions.html) - [x] [Towards Monosemanticity: Decomposing Language Models With Dictionary Learning, Bricken et al., 2023](https://transformer-circuits.pub/2023/monosemantic-features) - [x] [Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet, Templeton et al., 2024](https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html) - [Comment](https://www.lesswrong.com/posts/zzmhsKx5dBpChKhry/comments-on-anthropic-s-scaling-monosemanticity) on that ### Crosscoders - [ ] [Sparse Crosscoders for Cross-Layer Features and Model Diffing, Lindsey et al., 2024](https://transformer-circuits.pub/2024/crosscoders/index.html) ### Attribution graphs - [x] [On the Biology of a Large Language Model, Lindsey et al., 2025](https://transformer-circuits.pub/2025/attribution-graphs/biology.html) - [x] [How a failed experiment broke (and fixed) my view on feature labels, Bottazzi, 2026](https://www.lesswrong.com/posts/zDcrmqdqvh3KmsBuF/how-a-failed-experiment-broke-and-fixed-my-view-on-feature) - [ ] [Circuit Tracing: Revealing Computational Graphs in Language Models, Ameisen et al., 2025](https://transformer-circuits.pub/2025/attribution-graphs/methods.html) ### Natural language autoencoders - [x] [Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations, Fraser-Taliente et al., 2026](https://transformer-circuits.pub/2026/nla/index.html) ### Discovering thinking patterns - [ ] [Language Models Use Trigonometry to Do Addition, Kantamneni and Tegmark, 2025](https://arxiv.org/abs/2502.00873)