A Rust port of Karpathy's microgpt that trains and runs inference entirely on an ESP32.
The model learns to generate human-like names from scratch — no pre-trained weights, no cloud API, just 4,192 parameters training on a microcontroller.
step 0/1000: loss = 3.3071
step 100/1000: loss = 2.4193
step 500/1000: loss = 1.9888
step 999/1000: loss = 2.0980
--- Generated names (temperature=0.5) ---
arona, raeli, cealin, malie, sunaya, arishel, mosile ...
A 1-layer GPT transformer matching the original Python implementation:
| Parameters | 4,192 (16.4 KB) |
| Embedding dim | 16 |
| Attention heads | 4 |
| Layers | 1 |
| Block size | 16 |
| Vocab | 27 tokens (a-z + BOS) |
| Normalization | RMSNorm (initial + pre-attention + pre-MLP, no learnable params) |
| Optimizer | Adam (lr=0.01, beta1=0.85, beta2=0.99) |
| Training | 1,000 steps on 32K names |
The Python microgpt uses a scalar-level autograd (one graph node per multiply/add). For a single forward pass, this creates ~30K-50K nodes consuming 1-2 MB — more than the ESP32's entire 520 KB of SRAM.
Instead, this port uses explicit matrix-level forward and backward passes, storing only the activations needed for backprop (~25 KB). The backward pass is hand-derived and verified against numerical gradients.
| Model parameters | 17 KB |
| Gradients | 17 KB |
| Adam state (m + v) | 34 KB |
| Activation cache | 25 KB |
| Dataset (in flash, not RAM) | 0 KB |
| Total SRAM | ~100 KB of ~300 KB available |
src/
main.rs Training loop + inference entry point
model.rs GPT forward pass, parameter layout, KV cache
backward.rs Manual backward pass with gradient accumulation
optimizer.rs Adam optimizer
tensor.rs Vector-matrix math, RMSNorm
tokenizer.rs Character-level encode/decode (a-z + BOS)
rng.rs Xorshift32 PRNG + Box-Muller for Gaussian init
data/
names.txt 32K training names (embedded in flash via include_str!)
~1,000 lines of Rust. All core logic is platform-independent and testable on the host.
- Rust ESP32 toolchain (
espup install) espflashfor flashing (cargo install espflash)- An ESP32 dev board (any ESP32-WROOM-32 variant)
# Run tests on host
make test
# Build for ESP32
make build
# Flash and monitor serial output
make flash
# Just monitor (already flashed)
make monitorThe project compiles and runs natively for development:
RUST_LOG=info RUSTUP_TOOLCHAIN=stable cargo run --target aarch64-apple-darwinTraining: For each of 1,000 steps, a random name is sampled from the dataset, encoded as tokens, and fed through the transformer. The cross-entropy loss is backpropagated through every operation — attention, FFN, embeddings — and Adam updates the weights.
Inference: Starting from the BOS token, the model autoregressively samples one character at a time (with temperature scaling) until it produces another BOS or hits the block size limit.
The hard part is the attention backward pass: position t's query attends to all keys/values at positions 0..t, so key and value gradients accumulate contributions from every future position. Processing backward through the sequence ensures each position's KV gradients are complete before they're used.