A local AI agent that builds a Tetris clone from scratch — no cloud, no API fees, running on hardware you already own.
Most guides assume you have a high-end GPU. This project proves otherwise. The experiment documented here runs on two everyday machines — a laptop and a modest desktop — both far below what most people consider the "minimum" for local AI. No RTX 3090. No $500 GPU. Just real hardware, real results, and every step documented so you can reproduce it.
If your machine can run Windows 11, there's a good chance it can run a local AI agent. The question is not whether it's possible. It's how far you can push it.
📖 Follow the full experiment — iterations, results, and discoveries — on the blog: blog.nikodindon.dpdns.org
A fully local AI agent system capable of:
- Reading and writing files on disk
- Executing terminal commands and reading their output
- Iterating on its own output without human intervention
- Generating a complete game (Tetris with sound) from a single prompt
Zero lines sent to an external server. The model runs on your machine. The agent acts on your filesystem. You own the entire stack.
This project documents the same experiment running in parallel on two very different setups. The goal: find the real floor for local AI agent work in 2026.
| Device | Asus VivoBook 15 |
| CPU | AMD Ryzen 5 5500U (6 cores / 12 threads) |
| RAM | 20 GB |
| GPU | Integrated graphics only — no CUDA, no Vulkan |
| Mode | CPU-only inference |
| Model | Qwen2.5-Coder-7B-Instruct Q4_K_M (~4.3 GB) |
| Measured speed | 4.81 tok/s generation (performance mode, charger plugged in) |
| RAM usage | ~7.4 GB (4460 MB model + 2976 MB repack buffer) |
| Context | 16K tokens |
| llama.cpp threads | 10 / 12 |
💡 Important: Always plug in your charger and switch Windows to Performance mode before running inference. Battery saver mode cost ~15% speed in our tests (4.09 → 4.81 tok/s).
| Device | Custom desktop |
| CPU | AMD Ryzen 5 1600 AF |
| RAM | 32 GB |
| GPU | NVIDIA GTX 1650 Super — 4 GB VRAM |
| Mode | Partial GPU offload (~28–30 layers on GPU, rest on CPU) |
| Model | Qwen2.5-Coder-7B-Instruct Q4_K_M (~4.3 GB) |
| Expected speed | ~20–28 tok/s |
| Context | 32K tokens |
| Status | 🔄 Setup in progress |
The GTX 1650 Super has 4 GB of VRAM — technically not enough to fit the full model. llama.cpp handles this gracefully: as many layers as possible go to the GPU, the rest fall back to RAM. Noticeably faster than CPU-only, on a GPU that cost under $150 used.
The project that inspired this experiment (Octopus Invaders by @sudoingX) used an RTX 3060 with 12 GB of VRAM — already considered a modest setup. We're going further down. If a Ryzen 5 laptop with no GPU can run an AI coding agent, the barrier to entry for local AI is lower than almost anyone thinks.
local-agent-tetris/
│
├── README.md
├── agent.py # Our custom agent harness (see why below)
├── docs/
│ └── screenshots/
│ └── hermes_agent1.png # Hermes Agent running on the VivoBook
│
├── setup/
├── prompts/
│ ├── phase-0-init.md
│ ├── phase-1-bugfix.md
│ └── phase-2-polish.md
│
├── tetris/ # Code generated by the agent
│ ├── index.html
│ ├── game.js
│ ├── pieces.js
│ ├── audio.js
│ ├── ui.js
│ └── style.css
│
└── iterations/
├── v1-blank-screen/
├── v2-working/
└── v3-polished/
| Component | Tool | Role |
|---|---|---|
| Model server | llama.cpp v8664 | Run the LLM locally with performance flags |
| Model | Qwen2.5-Coder-7B-Instruct Q4_K_M | 7.62B parameters, quantized to 4.36 GiB |
| Agent harness | agent.py (custom, ~80 lines) | Tool loop with real-time debug output |
Open PowerShell and run:
Get-WmiObject Win32_Processor | Select-Object Name, NumberOfCores
Get-WmiObject Win32_ComputerSystem | Select-Object @{N='RAM_GB';E={[math]::Round($_.TotalPhysicalMemory/1GB,1)}}
Get-WmiObject Win32_VideoController | Select-Object Name, AdapterRAMDivide AdapterRAM by 1073741824 to get GB.
| VRAM | Mode | Recommended model | File size | Est. speed |
|---|---|---|---|---|
| No GPU — CPU only | CPU inference | Qwen2.5-Coder-7B Q4_K_M | 4.3 GB | 3–8 tok/s |
| 4 GB (GTX 1650…) | Partial offload | Qwen2.5-Coder-7B Q4_K_M | 4.3 GB | 20–28 tok/s |
| 8 GB | Full GPU | Qwen2.5-Coder-7B Q4_K_M | 4.3 GB | 50–55 tok/s |
| 12 GB | Full GPU | Qwen2.5-Coder-9B Q4_K_M | 5.3 GB | 50 tok/s |
| 24 GB | Full GPU | Qwen2.5-Coder-27B Q4_K_M | 16.7 GB | 35 tok/s |
- Go to github.com/ggerganov/llama.cpp/releases
- Download
llama-bXXXX-bin-win-cpu-x64.zip - Extract to
C:\llama.cpp\and add to PATH
Same steps but download llama-bXXXX-bin-win-cuda-cu12.x.x-x64.zip.
Requires CUDA 12.x from developer.nvidia.com/cuda-downloads.
llama-server --version
# Expected: version: 8664 (or newer)huggingface-cli is often missing from PATH on Windows. Use Python directly:
mkdir C:\models
python -c "
from huggingface_hub import hf_hub_download
hf_hub_download(
repo_id='bartowski/Qwen2.5-Coder-7B-Instruct-GGUF',
filename='Qwen2.5-Coder-7B-Instruct-Q4_K_M.gguf',
local_dir='C:/models/'
)
"4.3 GB. No HuggingFace account needed — bartowski's repos are always public.
Why bartowski? The official Qwen GGUF repos sometimes require authentication. Bartowski's builds are public, well-maintained, and use importance-matrix quantization for better quality at the same size.
llama-server `
-m C:\models\Qwen2.5-Coder-7B-Instruct-Q4_K_M.gguf `
-ngl 0 `
-c 16384 `
-np 1 `
-t 10 `
--host 0.0.0.0 `
--port 8080llama-server `
-m C:\models\Qwen2.5-Coder-7B-Instruct-Q4_K_M.gguf `
-ngl 99 `
-c 32768 `
-np 1 `
-fa on `
--cache-type-k q4_0 `
--cache-type-v q4_0 `
--host 0.0.0.0 `
--port 8080| Flag | Effect |
|---|---|
-ngl 99 |
Load all layers to GPU (auto partial offload if VRAM insufficient) |
-ngl 0 |
CPU-only, no GPU |
-t 10 |
10 threads — sweet spot for Ryzen 5 5500U |
-c 16384 |
16K context window |
-np 1 |
Single slot — saves ~190 MB VRAM |
-fa on |
Flash attention — constant speed as context grows |
--cache-type-k/v q4_0 |
Quantize KV cache — essential for large context on small VRAM |
Always plug in your charger and set Windows to Performance mode. On the VivoBook this alone pushed generation from 4.09 → 4.81 tok/s (+17%).
Model: Qwen2.5-Coder-7B-Instruct Q4_K_M
Machine: Asus VivoBook 15 — Ryzen 5 5500U, 20 GB RAM
Power mode: Performance (charger plugged in)
Threads: 10 / 12
prompt eval: 18.30 tok/s
generation: 4.81 tok/s ← our bottleneck
RAM used: ~7.4 GB
At 4.81 tok/s, a 500-token response takes ~1m44s. Slow by cloud standards. Completely functional for agent work.
The project that inspired this experiment used Hermes Agent by NousResearch as the tool harness. We tried it. Here's what happened and why we moved on.
git clone https://github.com/NousResearch/hermes-agent
cd hermes-agent
pip install -r requirements.txt
copy .env.example .env
Add-Content .env "OPENROUTER_API_KEY=sk-or-placeholder"
python cli.py `
--model "qwen2.5-coder-7b" `
--base_url "http://localhost:8080/v1" `
--api_key "sk-placeholder" `
--toolsets "terminal"
Hermes Agent v0.7.0 launched successfully on the VivoBook.
Hermes launched fine and the model correctly generated tool call JSON. But the tools never actually executed. The session summary always showed 0 tool calls despite the model producing valid-looking JSON.
The root cause is a tool call format mismatch:
- Hermes expects tool calls in a specific parsed format depending on the model family
- Qwen2.5-Coder-7B generates tool calls wrapped in
<tools>tags (its native format) - Without the
--jinjaflag on llama-server, the chat template isn't applied correctly and the model improvises its own format - Hermes's Qwen parser is calibrated for newer Qwen3 models, not Qwen2.5
Additionally, Hermes loads ~4K tokens of system prompt just for tool definitions. On a 16K context budget, that's 25% gone before the first user message. On a small model doing complex code generation, every token counts.
He used Qwen 3.5 9B (a newer model with better instruction following) on a 12 GB GPU (more VRAM = larger context = more headroom for Hermes's system prompt). The combination of a more capable model and more resources made the tool call parsing reliable.
Rather than debugging Hermes internals, we wrote a ~80-line Python agent that:
- Calls llama-server directly via the OpenAI-compatible API
- Parses tool calls using a simple
<tool>{...}</tool>format defined in the system prompt - Executes Windows commands natively via
subprocess - Shows real-time debug output for every turn
Validation test result:
Turn 1: agent ran "dir C:\local-agent-tetris" → got file listing ✅
Turn 2: agent created test.txt with correct content ✅
Turn 3: agent recognized task was complete, stopped ✅
Tool calls: 2 / 2 executed successfully
This is the agent we use for the Tetris generation.
Save agent.py at the root of the project. The full source is in the repo.
Key design decisions:
- System prompt tells the model to use
<tool>{"command": "..."}format — simple, unambiguous - Windows-native commands (
dir,echo >>,mkdir) — no Linux assumptions - Real-time debug output: tok/s, context size, tool outputs, file detection
max_turns=40to give the model enough room for a multi-file project
Launch:
cd C:\local-agent-tetris
python agent.pyThe task sent to the agent:
Build a Tetris clone as a self-contained web project.
Create all files in C:\local-agent-tetris\tetris\
Files to create:
- index.html (game entry point, loads all scripts)
- game.js (game loop, collision detection, line clearing, scoring)
- pieces.js (all 7 tetrominoes: I, O, T, S, Z, J, L with colors)
- audio.js (sound effects using Web Audio API only, no external files)
- ui.js (score, level, next piece preview, start/pause screen)
- style.css (dark theme, pixel font, centered layout)
Gameplay:
- All 7 tetrominoes with distinct colors
- Speed increases every 10 lines cleared
- Ghost piece showing drop position
- Score: 100/300/500/800 for 1/2/3/4 lines
- High score in localStorage
- Pause with P key
Audio using Web Audio API only (no external files):
- Click sound on move/rotate
- Rising tone on line clear
- Game over jingle
Phase 0 — Generate and test
- Agent creates files in
tetris/ - Open
index.htmlin browser - Open console (F12), note every error
Phase 1 — Fix with precision
The game has the following issues. Fix all of them:
1. [EXACT console error]
2. [OBSERVED vs EXPECTED behavior]
Do not rewrite files that work. Only fix what is broken.
Phase 2 — Polish
The game works. Now polish it:
- Flash animation on line clear
- Ghost piece opacity 20%
- Monospace font for score
- "READY?" countdown before start
| Action | Agent | You |
|---|---|---|
| Design the file architecture | ✅ | |
| Write the game loop | ✅ | |
| Implement Web Audio API | ✅ | |
| Detect its own variable scope bugs | ✅ | |
| Know what to test in the browser | ✅ | |
| Read console errors and rephrase them | ✅ | |
| Decide when the result is good enough | ✅ |
| Machine | Model | Mode | Speed | Context | Status |
|---|---|---|---|---|---|
| Asus VivoBook 15 — Ryzen 5 5500U, 20 GB RAM | Qwen2.5-Coder-7B Q4_K_M | CPU only | 4.81 tok/s | 16K | ✅ Agent running |
| Desktop — Ryzen 5 1600 AF, GTX 1650 Super 4 GB | Qwen2.5-Coder-7B Q4_K_M | Partial GPU offload | TBD | 32K | 🔄 In progress |
This README documents the setup. The full experiment — iterations, failures, discoveries, and final results — is covered in detail on the blog:
- llama.cpp — local inference engine
- Hermes Agent — NousResearch — tool harness (tested, see notes above)
- Qwen2.5-Coder GGUF — bartowski — the model
- Octopus Invaders — @sudoingX — the experiment that inspired this project
MIT — do what you want with it, document what you find, share what works.