# Mech Interp reading list
> Progress in AI is birthing a new kind of intelligence, reminiscent of our own in some ways but entirely alien in others. Understanding the nature of this intelligence is a profound scientific challenge, which has the potential to reshape our conception of what it means to think.
>
> On the Biology of a Large Language Model, Lindsey et al. 2025
### Introduction to LLMs
- [Karpathy's Neural Networks: Zero to Hero youtube series](https://www.youtube.com/playlist?list=PLAqhIrjkxbuWI23v9cThsA9GvCAUhRvKZ)
- [3blue1brown's Neural Networks series](https://www.youtube.com/playlist?list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi)
- [Essence of linear algebra](https://www.youtube.com/playlist?list=PLZHQObOWTQDPD3MizzM2xVFitgF8hE_ab)
### Tooling
- [Neuronpedia](https://www.neuronpedia.org/)
- [EleutherAI](https://github.com/EleutherAI) for feature autointerp
- [EasySteer](https://github.com/ZJU-REAL/EasySteer)
### Introduction on mech interp
- [x] [The Urgency of Interpretability, Dario Amodei, 2025](https://www.darioamodei.com/post/the-urgency-of-interpretability#the-dangers-of-ignorance)
- [x] [How To Become A Mechanistic Interpretability Researcher, Nanda, 2025](https://www.alignmentforum.org/posts/jP9KDyMkchuv6tHwm/how-to-become-a-mechanistic-interpretability-researcher)
### Meta reads
- [ ] [Emergent Introspective Awareness in Large Language Models, Lindsey, 2025](https://transformer-circuits.pub/2025/introspection/index.html)
### Sparse Autoencoders (SAEs)
- [x] [An Intuitive Explanation of Sparse Autoencoders for LLM Interpretability, Karvonen, 2024](https://adamkarvonen.github.io/machine_learning/2024/06/11/sae-intuitions.html)
- [x] [Towards Monosemanticity: Decomposing Language Models With Dictionary Learning, Bricken et al., 2023](https://transformer-circuits.pub/2023/monosemantic-features)
- [x] [Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet, Templeton et al., 2024](https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html)
- [Comment](https://www.lesswrong.com/posts/zzmhsKx5dBpChKhry/comments-on-anthropic-s-scaling-monosemanticity) on that
### Crosscoders
- [ ] [Sparse Crosscoders for Cross-Layer Features and Model Diffing, Lindsey et al., 2024](https://transformer-circuits.pub/2024/crosscoders/index.html)
### Attribution graphs
- [x] [On the Biology of a Large Language Model, Lindsey et al., 2025](https://transformer-circuits.pub/2025/attribution-graphs/biology.html)
- [x] [How a failed experiment broke (and fixed) my view on feature labels, Bottazzi, 2026](https://www.lesswrong.com/posts/zDcrmqdqvh3KmsBuF/how-a-failed-experiment-broke-and-fixed-my-view-on-feature)
- [ ] [Circuit Tracing: Revealing Computational Graphs in Language Models, Ameisen et al., 2025](https://transformer-circuits.pub/2025/attribution-graphs/methods.html)
### Natural language autoencoders
- [x] [Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations, Fraser-Taliente et al., 2026](https://transformer-circuits.pub/2026/nla/index.html)
### Discovering thinking patterns
- [ ] [Language Models Use Trigonometry to Do Addition, Kantamneni and Tegmark, 2025](https://arxiv.org/abs/2502.00873)