Learn to build the hardware that runs AI — from writing your first CUDA kernel to designing a custom AI chip.
Every AI model — GPT, Stable Diffusion, your self-driving car — runs on specialized hardware. Someone has to build that hardware, write the software that drives it, and make the two work together efficiently.
This is a free, community-driven curriculum that teaches you to do exactly that. It covers the full stack from the AI application at the top down to the chip design at the bottom — organized as a self-paced learning roadmap with guides, projects, and curated resources.
You will learn to:
- Write GPU kernels and parallel code that runs at hardware speed
- Deploy AI models on real embedded hardware (NVIDIA Jetson, Xilinx FPGA)
- Understand how ML compilers turn PyTorch into chip instructions
- Read and reason about chip architecture — the way AI accelerators are designed
| Background | What you'll get from this |
|---|---|
| Software engineer wanting to go deeper into AI infrastructure | CUDA, parallel computing, ML compilers, GPU runtimes |
| ML / AI engineer who wants to understand the hardware | How chips work, why quantization matters, how to optimize inference |
| Embedded / firmware engineer moving into AI products | AI workloads, edge deployment, Jetson, sensor fusion |
| Computer science student aiming at AI hardware roles | A structured curriculum from foundations to specialization |
| Hardware engineer adding AI/software skills | Neural networks, CUDA, ML frameworks, model optimization |
A chip that runs AI isn't just silicon. It's 8 layers of technology that must work together. Think of it like a building: the foundation (silicon) holds up the floors above it (firmware, OS, drivers), which hold up the penthouse (your AI application).
┌─────────────────────────────────────┐
│ L1 AI App & Framework │ ← PyTorch model, your code runs here
│ L2 ML Compiler │ ← turns model into chip instructions
│ L3 Runtime & Driver │ ← OS talks to the GPU/chip
│ L4 Firmware & OS │ ← boots the device, manages resources
│ L5 Hardware Architecture │ ← the chip's blueprint (systolic arrays, HBM)
│ L6 RTL & Logic Design │ ← describes the chip in hardware language
│ L7 Physical Implementation │ ← places transistors on silicon
│ L8 Fabrication & Packaging │ ← the foundry makes the physical chip
└─────────────────────────────────────┘
| Layer | Plain English | Technologies |
|---|---|---|
| L1 | Where your AI model lives and runs | PyTorch, ONNX, TensorRT, MLOps |
| L2 | Translates the model into efficient chip instructions | MLIR, TVM, LLVM, Triton |
| L3 | The bridge between software and the chip | CUDA runtime, kernel drivers, APIs |
| L4 | The firmware that boots and controls the device | FreeRTOS, embedded Linux, bootloaders |
| L5 | How the chip is architected internally | Systolic arrays, HBM memory, NoC |
| L6 | Writing the chip's logic in hardware code | SystemVerilog, FPGA, verification |
| L7 | Physically placing circuits on a chip | Place & route, timing, EDA tools |
| L8 | Sending to a foundry and getting chips back | TSMC process, CoWoS, packaging |
L1–L6: Full hands-on projects throughout this curriculum. L7–L8: Conceptual with guided labs (OpenROAD, TinyTapeout).
Pick your entry point based on where you are today:
Coming from software / ML?
→ Start at Phase 1 (C++ and Parallel Computing) then Phase 3 (AI)
Coming from embedded / firmware?
→ Start at Phase 1 (Computer Architecture) then Phase 2 (Embedded Systems)
Already know CUDA and ML frameworks?
→ Jump to Phase 4 (your track: FPGA, Jetson, or ML Compiler)
Targeting chip design?
→ Follow Phase 1 → 2 → 4A → 5F in order
Learn the language of hardware. Go from logic gates to writing GPU code.
| Module | What you'll learn |
|---|---|
| Digital Design & HDL | How digital logic works; write Verilog, simulate circuits |
| Computer Architecture | How CPUs and GPUs work internally — pipelines, caches, memory |
| Operating Systems | Processes, memory, scheduling, device drivers |
| C++ & Parallel Computing | SIMD, OpenMP, oneTBB, CUDA, ROCm, OpenCL/SYCL |
Get hands-on with real hardware: microcontrollers, sensors, and embedded Linux.
| Module | What you'll learn |
|---|---|
| Embedded Software | ARM Cortex-M, FreeRTOS, communication buses (SPI/I2C/CAN), power management |
| Embedded Linux | Build custom Linux for embedded devices with Yocto and PetaLinux |
Understand the AI workloads your hardware must run. Two tracks — pick one or both.
Core (everyone does these):
| Module | What you'll learn |
|---|---|
| Neural Networks | How neural networks learn — backprop, CNNs, transformers from scratch |
| Deep Learning Frameworks | micrograd → PyTorch → tinygrad: understand what frameworks actually do |
Track A — Hardware & Edge AI (leads to Phase 4A/B)
| Module | What you'll learn |
|---|---|
| Computer Vision | Object detection, segmentation, 3D vision, OpenCV |
| Sensor Fusion | Fuse camera + LiDAR + IMU; Kalman filters, BEVFusion |
| Voice AI | Speech-to-text (Whisper), TTS, wake-word detection |
| Edge AI & Optimization | Quantization, pruning, deploying models on constrained devices |
Track B — Agentic AI & ML Engineering (leads to Phase 4C / Phase 5)
| Module | What you'll learn |
|---|---|
| Agentic AI & GenAI | Build LLM agents, RAG systems, tool-using AI |
| ML Engineering & MLOps | Training pipelines, model serving, monitoring |
| LLM Application Development | Fine-tuning, RAG architecture, production LLM apps |
Deploy AI on real chips. Three specialized tracks — choose based on your target role.
Design hardware accelerators and deploy AI on programmable chips.
| Module | What you'll learn |
|---|---|
| FPGA Development | Vivado, IP cores, timing constraints, hardware debugging |
| Zynq MPSoC | Combine ARM CPU + FPGA fabric on one chip |
| Advanced FPGA Design | Clock domain crossing, floorplanning, power |
| HLS (High-Level Synthesis) | Write C++ → get hardware automatically |
| Runtime & Drivers | Linux driver for your FPGA, DMA, Vitis AI |
| Projects | Build a 4K wireless video pipeline end-to-end |
Ship AI products on NVIDIA's embedded GPU platform.
| Module | What you'll learn |
|---|---|
| Jetson Platform | JetPack, L4T, GPU on Orin — get up and running |
| Carrier Board Design | Design your own PCB that hosts a Jetson module |
| L4T Customization | Custom Linux kernel, device tree, OTA updates |
| Firmware (FSP) | FreeRTOS on the safety co-processor |
| AI Application Dev | ML inference, ROS 2, real-time video on Jetson |
| Security & OTA | Secure boot, encrypted storage, over-the-air updates |
| Manufacturing | FCC/CE compliance, production flashing, DFM |
| TensorRT & DLA | Optimize models for Jetson's GPU and neural accelerator |
Learn how AI models are compiled and optimized into chip instructions.
| Module | What you'll learn |
|---|---|
| Compiler Fundamentals | How MLIR, TVM, and LLVM work; build a custom backend |
| DL Inference Optimization | Triton kernels, Flash-Attention, TensorRT-LLM, quantization |
Go deep in one area. These tracks are ongoing and expand continuously.
| Track | What you'll specialize in | Guide |
|---|---|---|
| GPU Infrastructure | Multi-GPU systems, NVLink, NCCL, AMD ROCm/HIP, MI300X | → |
| High-Performance Computing | 40+ CUDA-X libraries: cuBLAS, cuDNN, NVSHMEM and more | → |
| Edge AI | Efficient model architectures, Holoscan, real-time pipelines | → |
| Robotics | ROS 2, Nav2, MoveIt, motion planning | → |
| Autonomous Vehicles | openpilot, BEV perception, functional safety, hardware debug | → |
| AI Chip Design | Systolic arrays, dataflow architectures, tinygrad↔hardware, ASIC flow | → |
| Target Role | Key Layers | Recommended Path |
|---|---|---|
| ML Inference Engineer | L1 | Phase 3 → Phase 4C |
| Edge AI Engineer | L1 | Phase 3 Track A → Phase 4B |
| AI Compiler Engineer | L2 | Phase 1 → Phase 4C → Phase 5B |
| GPU Runtime Engineer | L3 | Phase 1 (CUDA) → Phase 4A/B §Runtime |
| Firmware / Embedded Engineer | L4 | Phase 1 → Phase 2 → Phase 4B |
| AI Accelerator Architect | L5 | Phase 1 → Phase 4A → Phase 5F |
| RTL / FPGA Design Engineer | L6 | Phase 1 (HDL) → Phase 4A |
| Autonomous Vehicles Engineer | L1–L4 | Phase 3 Track A → Phase 4B → Phase 5E |
| AI Hardware Engineer (Full-Stack) | L1–L6 | Full curriculum — the signature role this roadmap targets |
| Project | Why it's used |
|---|---|
| tinygrad | A tiny DL framework (~2,500 lines) — shows exactly how frameworks, compilers, and hardware backends connect |
| openpilot | Real-world ADAS software — shows how perception, ML, and hardware work together in production |
- Roles & Market Analysis — 23 sub-roles, salary data, job postings, remote %, hiring priorities
A community-driven educational roadmap for AI hardware engineering.
⭐ Star this repo if you find it useful — it helps others discover it.
