A collection of simple, reproducible experiments to build intuition for mechanistic interpretability in transformer-based language models. This repo accompanies the blog post "Elements Of Mechanistic Interpretability: From Observation to Causation" on sifal.social, where we explore three key techniques: Logit Lens, Probing Classifiers, and Activation Patching.
These experiments use small, accessible models (like SmolLM2-360M-Instruct and GPT-2 Small) to demonstrate how to peek inside LLMs, locate concepts, and prove causal relationships in their internal computations.
Mechanistic interpretability aims to reverse-engineer the "circuits" inside neural networks to understand how they process information. This repo provides standalone Jupyter notebooks for three foundational experiments:
- Logit Lens: Watch a model's "thought process" evolve layer by layer.
- Probing Classifiers: Locate where specific concepts (e.g., part-of-speech tags) are represented.
- Activation Patching: Perform "causal surgery" to identify which attention heads drive behavior.
Each notebook is self-contained, generates visualizations, and includes comments for easy understanding. The experiments are based on toy problems like factual recall and Indirect Object Identification (IOI) to keep things simple and focused.
For more details, read the blog post.
Each notebook can be opened and run in Jupyter Notebook, JupyterLab, or Google Colab, note that these experiments work on CPU but run faster on GPU (CUDA-enabled).
upload to Google Colab for cloud execution.
This project is licensed under the MIT License. See LICENSE for details.
- Inspired by resources like Callum McDougall's Mechanistic Interpretability Course, Anthropic's Transformer Circuits, and the IOI Paper.
- Thanks to libraries: TransformerLens, Hugging Face Transformers, and spaCy.
- Models from Hugging Face: SmolLM2, GPT-2.