🤖 local-agent-tetris

A local AI agent that builds a Tetris clone from scratch — no cloud, no API fees, running on hardware you already own.

You don't need a powerful machine to get started with local AI.

Most guides assume you have a high-end GPU. This project proves otherwise. The experiment documented here runs on two everyday machines — a laptop and a modest desktop — both far below what most people consider the "minimum" for local AI. No RTX 3090. No $500 GPU. Just real hardware, real results, and every step documented so you can reproduce it.

If your machine can run Windows 11, there's a good chance it can run a local AI agent. The question is not whether it's possible. It's how far you can push it.

📖 Follow the full experiment — iterations, results, and discoveries — on the blog: blog.nikodindon.dpdns.org

🎯 What we're building

A fully local AI agent system capable of:

Reading and writing files on disk
Executing terminal commands and reading their output
Iterating on its own output without human intervention
Generating a complete game (Tetris with sound) from a single prompt

Zero lines sent to an external server. The model runs on your machine. The agent acts on your filesystem. You own the entire stack.

🖥️ The two test machines

This project documents the same experiment running in parallel on two very different setups. The goal: find the real floor for local AI agent work in 2026.

Machine 1 — The Laptop (CPU only)


Device	Asus VivoBook 15
CPU	AMD Ryzen 5 5500U (6 cores / 12 threads)
RAM	20 GB
GPU	Integrated graphics only — no CUDA, no Vulkan
Mode	CPU-only inference
Model	Qwen2.5-Coder-7B-Instruct Q4_K_M (~4.3 GB)
Measured speed	4.81 tok/s generation (performance mode, charger plugged in)
RAM usage	~7.4 GB (4460 MB model + 2976 MB repack buffer)
Context	16K tokens
llama.cpp threads	10 / 12

💡 Important: Always plug in your charger and switch Windows to Performance mode before running inference. Battery saver mode cost ~15% speed in our tests (4.09 → 4.81 tok/s).

Machine 2 — The Desktop (GPU partial offload)


Device	Custom desktop
CPU	AMD Ryzen 5 1600 AF
RAM	32 GB
GPU	NVIDIA GTX 1650 Super — 4 GB VRAM
Mode	Partial GPU offload (~28–30 layers on GPU, rest on CPU)
Model	Qwen2.5-Coder-7B-Instruct Q4_K_M (~4.3 GB)
Expected speed	~20–28 tok/s
Context	32K tokens
Status	🔄 Setup in progress

The GTX 1650 Super has 4 GB of VRAM — technically not enough to fit the full model. llama.cpp handles this gracefully: as many layers as possible go to the GPU, the rest fall back to RAM. Noticeably faster than CPU-only, on a GPU that cost under $150 used.

Why this matters

The project that inspired this experiment (Octopus Invaders by @sudoingX) used an RTX 3060 with 12 GB of VRAM — already considered a modest setup. We're going further down. If a Ryzen 5 laptop with no GPU can run an AI coding agent, the barrier to entry for local AI is lower than almost anyone thinks.

🗂️ Repository structure

local-agent-tetris/
│
├── README.md
├── agent.py                   # Our custom agent harness (see why below)
├── docs/
│   └── screenshots/
│       └── hermes_agent1.png  # Hermes Agent running on the VivoBook
│
├── setup/
├── prompts/
│   ├── phase-0-init.md
│   ├── phase-1-bugfix.md
│   └── phase-2-polish.md
│
├── tetris/                    # Code generated by the agent
│   ├── index.html
│   ├── game.js
│   ├── pieces.js
│   ├── audio.js
│   ├── ui.js
│   └── style.css
│
└── iterations/
    ├── v1-blank-screen/
    ├── v2-working/
    └── v3-polished/

⚙️ Tech stack

Component	Tool	Role
Model server	llama.cpp v8664	Run the LLM locally with performance flags
Model	Qwen2.5-Coder-7B-Instruct Q4_K_M	7.62B parameters, quantized to 4.36 GiB
Agent harness	agent.py (custom, ~80 lines)	Tool loop with real-time debug output

🔍 Step 0 — Identify your hardware

Open PowerShell and run:

Get-WmiObject Win32_Processor | Select-Object Name, NumberOfCores
Get-WmiObject Win32_ComputerSystem | Select-Object @{N='RAM_GB';E={[math]::Round($_.TotalPhysicalMemory/1GB,1)}}
Get-WmiObject Win32_VideoController | Select-Object Name, AdapterRAM

Divide AdapterRAM by 1073741824 to get GB.

Which model for which hardware?

VRAM	Mode	Recommended model	File size	Est. speed
No GPU — CPU only	CPU inference	Qwen2.5-Coder-7B Q4_K_M	4.3 GB	3–8 tok/s
4 GB (GTX 1650…)	Partial offload	Qwen2.5-Coder-7B Q4_K_M	4.3 GB	20–28 tok/s
8 GB	Full GPU	Qwen2.5-Coder-7B Q4_K_M	4.3 GB	50–55 tok/s
12 GB	Full GPU	Qwen2.5-Coder-9B Q4_K_M	5.3 GB	50 tok/s
24 GB	Full GPU	Qwen2.5-Coder-27B Q4_K_M	16.7 GB	35 tok/s

🛠️ Step 1 — Install llama.cpp (Windows)

CPU-only machine

Go to github.com/ggerganov/llama.cpp/releases
Download llama-bXXXX-bin-win-cpu-x64.zip
Extract to C:\llama.cpp\ and add to PATH

NVIDIA GPU machine

Same steps but download llama-bXXXX-bin-win-cuda-cu12.x.x-x64.zip. Requires CUDA 12.x from developer.nvidia.com/cuda-downloads.

llama-server --version
# Expected: version: 8664 (or newer)

📥 Step 2 — Download the model

huggingface-cli is often missing from PATH on Windows. Use Python directly:

mkdir C:\models

python -c "
from huggingface_hub import hf_hub_download
hf_hub_download(
    repo_id='bartowski/Qwen2.5-Coder-7B-Instruct-GGUF',
    filename='Qwen2.5-Coder-7B-Instruct-Q4_K_M.gguf',
    local_dir='C:/models/'
)
"

4.3 GB. No HuggingFace account needed — bartowski's repos are always public.

Why bartowski? The official Qwen GGUF repos sometimes require authentication. Bartowski's builds are public, well-maintained, and use importance-matrix quantization for better quality at the same size.

🚀 Step 3 — Launch llama-server

CPU-only (Laptop / VivoBook)

llama-server `
  -m C:\models\Qwen2.5-Coder-7B-Instruct-Q4_K_M.gguf `
  -ngl 0 `
  -c 16384 `
  -np 1 `
  -t 10 `
  --host 0.0.0.0 `
  --port 8080

GPU partial offload (Desktop / GTX 1650 Super)

llama-server `
  -m C:\models\Qwen2.5-Coder-7B-Instruct-Q4_K_M.gguf `
  -ngl 99 `
  -c 32768 `
  -np 1 `
  -fa on `
  --cache-type-k q4_0 `
  --cache-type-v q4_0 `
  --host 0.0.0.0 `
  --port 8080

Flag reference

Flag	Effect
`-ngl 99`	Load all layers to GPU (auto partial offload if VRAM insufficient)
`-ngl 0`	CPU-only, no GPU
`-t 10`	10 threads — sweet spot for Ryzen 5 5500U
`-c 16384`	16K context window
`-np 1`	Single slot — saves ~190 MB VRAM
`-fa on`	Flash attention — constant speed as context grows
`--cache-type-k/v q4_0`	Quantize KV cache — essential for large context on small VRAM

⚡ Performance tip

Always plug in your charger and set Windows to Performance mode. On the VivoBook this alone pushed generation from 4.09 → 4.81 tok/s (+17%).

📊 Measured benchmark — Laptop (CPU only)

Model:        Qwen2.5-Coder-7B-Instruct Q4_K_M
Machine:      Asus VivoBook 15 — Ryzen 5 5500U, 20 GB RAM
Power mode:   Performance (charger plugged in)
Threads:      10 / 12

prompt eval:  18.30 tok/s
generation:    4.81 tok/s   ← our bottleneck
RAM used:     ~7.4 GB

At 4.81 tok/s, a 500-token response takes ~1m44s. Slow by cloud standards. Completely functional for agent work.

🧠 Step 4 — Why we didn't use Hermes Agent

The project that inspired this experiment used Hermes Agent by NousResearch as the tool harness. We tried it. Here's what happened and why we moved on.

What we tried

git clone https://github.com/NousResearch/hermes-agent
cd hermes-agent
pip install -r requirements.txt
copy .env.example .env
Add-Content .env "OPENROUTER_API_KEY=sk-or-placeholder"

python cli.py `
  --model "qwen2.5-coder-7b" `
  --base_url "http://localhost:8080/v1" `
  --api_key "sk-placeholder" `
  --toolsets "terminal"

Hermes Agent v0.7.0 launched successfully on the VivoBook.

What went wrong

Hermes launched fine and the model correctly generated tool call JSON. But the tools never actually executed. The session summary always showed 0 tool calls despite the model producing valid-looking JSON.

The root cause is a tool call format mismatch:

Hermes expects tool calls in a specific parsed format depending on the model family
Qwen2.5-Coder-7B generates tool calls wrapped in <tools> tags (its native format)
Without the --jinja flag on llama-server, the chat template isn't applied correctly and the model improvises its own format
Hermes's Qwen parser is calibrated for newer Qwen3 models, not Qwen2.5

Additionally, Hermes loads ~4K tokens of system prompt just for tool definitions. On a 16K context budget, that's 25% gone before the first user message. On a small model doing complex code generation, every token counts.

Why @sudoingX succeeded with Hermes

He used Qwen 3.5 9B (a newer model with better instruction following) on a 12 GB GPU (more VRAM = larger context = more headroom for Hermes's system prompt). The combination of a more capable model and more resources made the tool call parsing reliable.

Our solution: a minimal custom agent

Rather than debugging Hermes internals, we wrote a ~80-line Python agent that:

Calls llama-server directly via the OpenAI-compatible API
Parses tool calls using a simple <tool>{...}</tool> format defined in the system prompt
Executes Windows commands natively via subprocess
Shows real-time debug output for every turn

Validation test result:

Turn 1: agent ran "dir C:\local-agent-tetris" → got file listing ✅
Turn 2: agent created test.txt with correct content ✅  
Turn 3: agent recognized task was complete, stopped ✅
Tool calls: 2 / 2 executed successfully

This is the agent we use for the Tetris generation.

🚀 Step 5 — The custom agent

Save agent.py at the root of the project. The full source is in the repo.

Key design decisions:

System prompt tells the model to use <tool>{"command": "..."} format — simple, unambiguous
Windows-native commands (dir, echo >>, mkdir) — no Linux assumptions
Real-time debug output: tok/s, context size, tool outputs, file detection
max_turns=40 to give the model enough room for a multi-file project

Launch:

cd C:\local-agent-tetris
python agent.py

🎮 Step 6 — The Tetris generation prompt

The task sent to the agent:

Build a Tetris clone as a self-contained web project.
Create all files in C:\local-agent-tetris\tetris\

Files to create:
- index.html (game entry point, loads all scripts)
- game.js (game loop, collision detection, line clearing, scoring)
- pieces.js (all 7 tetrominoes: I, O, T, S, Z, J, L with colors)
- audio.js (sound effects using Web Audio API only, no external files)
- ui.js (score, level, next piece preview, start/pause screen)
- style.css (dark theme, pixel font, centered layout)

Gameplay:
- All 7 tetrominoes with distinct colors
- Speed increases every 10 lines cleared
- Ghost piece showing drop position
- Score: 100/300/500/800 for 1/2/3/4 lines
- High score in localStorage
- Pause with P key

Audio using Web Audio API only (no external files):
- Click sound on move/rotate
- Rising tone on line clear
- Game over jingle

🔁 Step 7 — Iteration workflow

Phase 0 — Generate and test

Agent creates files in tetris/
Open index.html in browser
Open console (F12), note every error

Phase 1 — Fix with precision

The game has the following issues. Fix all of them:
1. [EXACT console error]
2. [OBSERVED vs EXPECTED behavior]
Do not rewrite files that work. Only fix what is broken.

Phase 2 — Polish

The game works. Now polish it:
- Flash animation on line clear
- Ghost piece opacity 20%
- Monospace font for score
- "READY?" countdown before start

📊 What the agent does vs what you guide

Action	Agent	You
Design the file architecture	✅
Write the game loop	✅
Implement Web Audio API	✅
Detect its own variable scope bugs	✅
Know what to test in the browser		✅
Read console errors and rephrase them		✅
Decide when the result is good enough		✅

📈 Benchmark results

Machine	Model	Mode	Speed	Context	Status
Asus VivoBook 15 — Ryzen 5 5500U, 20 GB RAM	Qwen2.5-Coder-7B Q4_K_M	CPU only	4.81 tok/s	16K	✅ Agent running
Desktop — Ryzen 5 1600 AF, GTX 1650 Super 4 GB	Qwen2.5-Coder-7B Q4_K_M	Partial GPU offload	TBD	32K	🔄 In progress

📖 Follow the experiment

This README documents the setup. The full experiment — iterations, failures, discoveries, and final results — is covered in detail on the blog:

👉 blog.nikodindon.dpdns.org

🔗 Resources

llama.cpp — local inference engine
Hermes Agent — NousResearch — tool harness (tested, see notes above)
Qwen2.5-Coder GGUF — bartowski — the model
Octopus Invaders — @sudoingX — the experiment that inspired this project

📝 License

MIT — do what you want with it, document what you find, share what works.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
docs/screenshots		docs/screenshots
iterations		iterations
prompts		prompts
setup		setup
tetris		tetris
.gitignore		.gitignore
README.md		README.md
agent.py		agent.py
agent_debug.py		agent_debug.py

Folders and files

Latest commit

History

Repository files navigation

🤖 local-agent-tetris

You don't need a powerful machine to get started with local AI.

🎯 What we're building

🖥️ The two test machines

Machine 1 — The Laptop (CPU only)

Machine 2 — The Desktop (GPU partial offload)

Why this matters

🗂️ Repository structure

⚙️ Tech stack

🔍 Step 0 — Identify your hardware

Which model for which hardware?

🛠️ Step 1 — Install llama.cpp (Windows)

CPU-only machine

NVIDIA GPU machine

📥 Step 2 — Download the model

🚀 Step 3 — Launch llama-server

CPU-only (Laptop / VivoBook)

GPU partial offload (Desktop / GTX 1650 Super)

Flag reference

⚡ Performance tip

📊 Measured benchmark — Laptop (CPU only)

🧠 Step 4 — Why we didn't use Hermes Agent

What we tried

What went wrong

Why @sudoingX succeeded with Hermes

Our solution: a minimal custom agent

🚀 Step 5 — The custom agent

🎮 Step 6 — The Tetris generation prompt

🔁 Step 7 — Iteration workflow

📊 What the agent does vs what you guide

📈 Benchmark results

📖 Follow the experiment

🔗 Resources

📝 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages