---
# System prepended metadata

title: Mech Interp reading list

---

# Mech Interp reading list

> Progress in AI is birthing a new kind of intelligence, reminiscent of our own in some ways but entirely alien in others. Understanding the nature of this intelligence is a profound scientific challenge, which has the potential to reshape our conception of what it means to think. 
> 
> On the Biology of a Large Language Model, Lindsey et al. 2025

### Introduction to LLMs

- [Karpathy's Neural Networks: Zero to Hero youtube series](https://www.youtube.com/playlist?list=PLAqhIrjkxbuWI23v9cThsA9GvCAUhRvKZ)
- [3blue1brown's Neural Networks series](https://www.youtube.com/playlist?list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi)
- [Essence of linear algebra](https://www.youtube.com/playlist?list=PLZHQObOWTQDPD3MizzM2xVFitgF8hE_ab)

### Tooling

- [Neuronpedia](https://www.neuronpedia.org/)
- [EleutherAI](https://github.com/EleutherAI) for feature autointerp
- [EasySteer](https://github.com/ZJU-REAL/EasySteer)

### Introduction on mech interp

- [x] [The Urgency of Interpretability, Dario Amodei, 2025](https://www.darioamodei.com/post/the-urgency-of-interpretability#the-dangers-of-ignorance) 
- [x] [How To Become A Mechanistic Interpretability Researcher, Nanda, 2025](https://www.alignmentforum.org/posts/jP9KDyMkchuv6tHwm/how-to-become-a-mechanistic-interpretability-researcher)

### Meta reads

- [ ] [Emergent Introspective Awareness in Large Language Models, Lindsey, 2025](https://transformer-circuits.pub/2025/introspection/index.html)

### Sparse Autoencoders (SAEs)

- [x] [An Intuitive Explanation of Sparse Autoencoders for LLM Interpretability, Karvonen, 2024](https://adamkarvonen.github.io/machine_learning/2024/06/11/sae-intuitions.html)
- [x] [Towards Monosemanticity: Decomposing Language Models With Dictionary Learning, Bricken et al., 2023](https://transformer-circuits.pub/2023/monosemantic-features) 
- [x] [Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet, Templeton et al., 2024](https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html)
    - [Comment](https://www.lesswrong.com/posts/zzmhsKx5dBpChKhry/comments-on-anthropic-s-scaling-monosemanticity) on that 

### Crosscoders

- [ ] [Sparse Crosscoders for Cross-Layer Features and Model Diffing, Lindsey et al., 2024](https://transformer-circuits.pub/2024/crosscoders/index.html)

### Attribution graphs

- [x] [On the Biology of a Large Language Model, Lindsey et al., 2025](https://transformer-circuits.pub/2025/attribution-graphs/biology.html)
- [x] [How a failed experiment broke (and fixed) my view on feature labels, Bottazzi, 2026](https://www.lesswrong.com/posts/zDcrmqdqvh3KmsBuF/how-a-failed-experiment-broke-and-fixed-my-view-on-feature)
- [ ] [Circuit Tracing: Revealing Computational Graphs in Language Models, Ameisen et al., 2025](https://transformer-circuits.pub/2025/attribution-graphs/methods.html)


### Natural language autoencoders

- [x] [Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations, Fraser-Taliente et al., 2026](https://transformer-circuits.pub/2026/nla/index.html)


### Discovering thinking patterns 

- [ ] [Language Models Use Trigonometry to Do Addition, Kantamneni and Tegmark, 2025](https://arxiv.org/abs/2502.00873)