Skip to content

duggan/esp32gpt

Repository files navigation

esp32gpt

A Rust port of Karpathy's microgpt that trains and runs inference entirely on an ESP32.

The model learns to generate human-like names from scratch — no pre-trained weights, no cloud API, just 4,192 parameters training on a microcontroller.

step    0/1000: loss = 3.3071
step  100/1000: loss = 2.4193
step  500/1000: loss = 1.9888
step  999/1000: loss = 2.0980
--- Generated names (temperature=0.5) ---
  arona, raeli, cealin, malie, sunaya, arishel, mosile ...

Architecture

A 1-layer GPT transformer matching the original Python implementation:

Parameters 4,192 (16.4 KB)
Embedding dim 16
Attention heads 4
Layers 1
Block size 16
Vocab 27 tokens (a-z + BOS)
Normalization RMSNorm (initial + pre-attention + pre-MLP, no learnable params)
Optimizer Adam (lr=0.01, beta1=0.85, beta2=0.99)
Training 1,000 steps on 32K names

Why not just port the autograd engine?

The Python microgpt uses a scalar-level autograd (one graph node per multiply/add). For a single forward pass, this creates ~30K-50K nodes consuming 1-2 MB — more than the ESP32's entire 520 KB of SRAM.

Instead, this port uses explicit matrix-level forward and backward passes, storing only the activations needed for backprop (~25 KB). The backward pass is hand-derived and verified against numerical gradients.

Memory budget

Model parameters 17 KB
Gradients 17 KB
Adam state (m + v) 34 KB
Activation cache 25 KB
Dataset (in flash, not RAM) 0 KB
Total SRAM ~100 KB of ~300 KB available

Project structure

src/
  main.rs         Training loop + inference entry point
  model.rs        GPT forward pass, parameter layout, KV cache
  backward.rs     Manual backward pass with gradient accumulation
  optimizer.rs    Adam optimizer
  tensor.rs       Vector-matrix math, RMSNorm
  tokenizer.rs    Character-level encode/decode (a-z + BOS)
  rng.rs          Xorshift32 PRNG + Box-Muller for Gaussian init
data/
  names.txt       32K training names (embedded in flash via include_str!)

~1,000 lines of Rust. All core logic is platform-independent and testable on the host.

Prerequisites

  • Rust ESP32 toolchain (espup install)
  • espflash for flashing (cargo install espflash)
  • An ESP32 dev board (any ESP32-WROOM-32 variant)

Usage

# Run tests on host
make test

# Build for ESP32
make build

# Flash and monitor serial output
make flash

# Just monitor (already flashed)
make monitor

Running on host (no ESP32 needed)

The project compiles and runs natively for development:

RUST_LOG=info RUSTUP_TOOLCHAIN=stable cargo run --target aarch64-apple-darwin

How it works

Training: For each of 1,000 steps, a random name is sampled from the dataset, encoded as tokens, and fed through the transformer. The cross-entropy loss is backpropagated through every operation — attention, FFN, embeddings — and Adam updates the weights.

Inference: Starting from the BOS token, the model autoregressively samples one character at a time (with temperature scaling) until it produces another BOS or hits the block size limit.

The hard part is the attention backward pass: position t's query attends to all keys/values at positions 0..t, so key and value gradients accumulate contributions from every future position. Processing backward through the sequence ensures each position's KV gradients are complete before they're used.

About

A Rust port of Karpathy's microgpt that trains and runs inference entirely on an ESP32.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors