Skip to content

nikodindon/local-agent-tetris

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🤖 local-agent-tetris

A local AI agent that builds a Tetris clone from scratch — no cloud, no API fees, running on hardware you already own.

You don't need a powerful machine to get started with local AI.

Most guides assume you have a high-end GPU. This project proves otherwise. The experiment documented here runs on two everyday machines — a laptop and a modest desktop — both far below what most people consider the "minimum" for local AI. No RTX 3090. No $500 GPU. Just real hardware, real results, and every step documented so you can reproduce it.

If your machine can run Windows 11, there's a good chance it can run a local AI agent. The question is not whether it's possible. It's how far you can push it.

📖 Follow the full experiment — iterations, results, and discoveries — on the blog: blog.nikodindon.dpdns.org


🎯 What we're building

A fully local AI agent system capable of:

  • Reading and writing files on disk
  • Executing terminal commands and reading their output
  • Iterating on its own output without human intervention
  • Generating a complete game (Tetris with sound) from a single prompt

Zero lines sent to an external server. The model runs on your machine. The agent acts on your filesystem. You own the entire stack.


🖥️ The two test machines

This project documents the same experiment running in parallel on two very different setups. The goal: find the real floor for local AI agent work in 2026.

Machine 1 — The Laptop (CPU only)

Device Asus VivoBook 15
CPU AMD Ryzen 5 5500U (6 cores / 12 threads)
RAM 20 GB
GPU Integrated graphics only — no CUDA, no Vulkan
Mode CPU-only inference
Model Qwen2.5-Coder-7B-Instruct Q4_K_M (~4.3 GB)
Measured speed 4.81 tok/s generation (performance mode, charger plugged in)
RAM usage ~7.4 GB (4460 MB model + 2976 MB repack buffer)
Context 16K tokens
llama.cpp threads 10 / 12

💡 Important: Always plug in your charger and switch Windows to Performance mode before running inference. Battery saver mode cost ~15% speed in our tests (4.09 → 4.81 tok/s).

Machine 2 — The Desktop (GPU partial offload)

Device Custom desktop
CPU AMD Ryzen 5 1600 AF
RAM 32 GB
GPU NVIDIA GTX 1650 Super — 4 GB VRAM
Mode Partial GPU offload (~28–30 layers on GPU, rest on CPU)
Model Qwen2.5-Coder-7B-Instruct Q4_K_M (~4.3 GB)
Expected speed ~20–28 tok/s
Context 32K tokens
Status 🔄 Setup in progress

The GTX 1650 Super has 4 GB of VRAM — technically not enough to fit the full model. llama.cpp handles this gracefully: as many layers as possible go to the GPU, the rest fall back to RAM. Noticeably faster than CPU-only, on a GPU that cost under $150 used.

Why this matters

The project that inspired this experiment (Octopus Invaders by @sudoingX) used an RTX 3060 with 12 GB of VRAM — already considered a modest setup. We're going further down. If a Ryzen 5 laptop with no GPU can run an AI coding agent, the barrier to entry for local AI is lower than almost anyone thinks.


🗂️ Repository structure

local-agent-tetris/
│
├── README.md
├── agent.py                   # Our custom agent harness (see why below)
├── docs/
│   └── screenshots/
│       └── hermes_agent1.png  # Hermes Agent running on the VivoBook
│
├── setup/
├── prompts/
│   ├── phase-0-init.md
│   ├── phase-1-bugfix.md
│   └── phase-2-polish.md
│
├── tetris/                    # Code generated by the agent
│   ├── index.html
│   ├── game.js
│   ├── pieces.js
│   ├── audio.js
│   ├── ui.js
│   └── style.css
│
└── iterations/
    ├── v1-blank-screen/
    ├── v2-working/
    └── v3-polished/

⚙️ Tech stack

Component Tool Role
Model server llama.cpp v8664 Run the LLM locally with performance flags
Model Qwen2.5-Coder-7B-Instruct Q4_K_M 7.62B parameters, quantized to 4.36 GiB
Agent harness agent.py (custom, ~80 lines) Tool loop with real-time debug output

🔍 Step 0 — Identify your hardware

Open PowerShell and run:

Get-WmiObject Win32_Processor | Select-Object Name, NumberOfCores
Get-WmiObject Win32_ComputerSystem | Select-Object @{N='RAM_GB';E={[math]::Round($_.TotalPhysicalMemory/1GB,1)}}
Get-WmiObject Win32_VideoController | Select-Object Name, AdapterRAM

Divide AdapterRAM by 1073741824 to get GB.

Which model for which hardware?

VRAM Mode Recommended model File size Est. speed
No GPU — CPU only CPU inference Qwen2.5-Coder-7B Q4_K_M 4.3 GB 3–8 tok/s
4 GB (GTX 1650…) Partial offload Qwen2.5-Coder-7B Q4_K_M 4.3 GB 20–28 tok/s
8 GB Full GPU Qwen2.5-Coder-7B Q4_K_M 4.3 GB 50–55 tok/s
12 GB Full GPU Qwen2.5-Coder-9B Q4_K_M 5.3 GB 50 tok/s
24 GB Full GPU Qwen2.5-Coder-27B Q4_K_M 16.7 GB 35 tok/s

🛠️ Step 1 — Install llama.cpp (Windows)

CPU-only machine

  1. Go to github.com/ggerganov/llama.cpp/releases
  2. Download llama-bXXXX-bin-win-cpu-x64.zip
  3. Extract to C:\llama.cpp\ and add to PATH

NVIDIA GPU machine

Same steps but download llama-bXXXX-bin-win-cuda-cu12.x.x-x64.zip. Requires CUDA 12.x from developer.nvidia.com/cuda-downloads.

llama-server --version
# Expected: version: 8664 (or newer)

📥 Step 2 — Download the model

huggingface-cli is often missing from PATH on Windows. Use Python directly:

mkdir C:\models

python -c "
from huggingface_hub import hf_hub_download
hf_hub_download(
    repo_id='bartowski/Qwen2.5-Coder-7B-Instruct-GGUF',
    filename='Qwen2.5-Coder-7B-Instruct-Q4_K_M.gguf',
    local_dir='C:/models/'
)
"

4.3 GB. No HuggingFace account needed — bartowski's repos are always public.

Why bartowski? The official Qwen GGUF repos sometimes require authentication. Bartowski's builds are public, well-maintained, and use importance-matrix quantization for better quality at the same size.


🚀 Step 3 — Launch llama-server

CPU-only (Laptop / VivoBook)

llama-server `
  -m C:\models\Qwen2.5-Coder-7B-Instruct-Q4_K_M.gguf `
  -ngl 0 `
  -c 16384 `
  -np 1 `
  -t 10 `
  --host 0.0.0.0 `
  --port 8080

GPU partial offload (Desktop / GTX 1650 Super)

llama-server `
  -m C:\models\Qwen2.5-Coder-7B-Instruct-Q4_K_M.gguf `
  -ngl 99 `
  -c 32768 `
  -np 1 `
  -fa on `
  --cache-type-k q4_0 `
  --cache-type-v q4_0 `
  --host 0.0.0.0 `
  --port 8080

Flag reference

Flag Effect
-ngl 99 Load all layers to GPU (auto partial offload if VRAM insufficient)
-ngl 0 CPU-only, no GPU
-t 10 10 threads — sweet spot for Ryzen 5 5500U
-c 16384 16K context window
-np 1 Single slot — saves ~190 MB VRAM
-fa on Flash attention — constant speed as context grows
--cache-type-k/v q4_0 Quantize KV cache — essential for large context on small VRAM

⚡ Performance tip

Always plug in your charger and set Windows to Performance mode. On the VivoBook this alone pushed generation from 4.09 → 4.81 tok/s (+17%).


📊 Measured benchmark — Laptop (CPU only)

Model:        Qwen2.5-Coder-7B-Instruct Q4_K_M
Machine:      Asus VivoBook 15 — Ryzen 5 5500U, 20 GB RAM
Power mode:   Performance (charger plugged in)
Threads:      10 / 12

prompt eval:  18.30 tok/s
generation:    4.81 tok/s   ← our bottleneck
RAM used:     ~7.4 GB

At 4.81 tok/s, a 500-token response takes ~1m44s. Slow by cloud standards. Completely functional for agent work.


🧠 Step 4 — Why we didn't use Hermes Agent

The project that inspired this experiment used Hermes Agent by NousResearch as the tool harness. We tried it. Here's what happened and why we moved on.

What we tried

git clone https://github.com/NousResearch/hermes-agent
cd hermes-agent
pip install -r requirements.txt
copy .env.example .env
Add-Content .env "OPENROUTER_API_KEY=sk-or-placeholder"

python cli.py `
  --model "qwen2.5-coder-7b" `
  --base_url "http://localhost:8080/v1" `
  --api_key "sk-placeholder" `
  --toolsets "terminal"

Hermes Agent running Hermes Agent v0.7.0 launched successfully on the VivoBook.

What went wrong

Hermes launched fine and the model correctly generated tool call JSON. But the tools never actually executed. The session summary always showed 0 tool calls despite the model producing valid-looking JSON.

The root cause is a tool call format mismatch:

  • Hermes expects tool calls in a specific parsed format depending on the model family
  • Qwen2.5-Coder-7B generates tool calls wrapped in <tools> tags (its native format)
  • Without the --jinja flag on llama-server, the chat template isn't applied correctly and the model improvises its own format
  • Hermes's Qwen parser is calibrated for newer Qwen3 models, not Qwen2.5

Additionally, Hermes loads ~4K tokens of system prompt just for tool definitions. On a 16K context budget, that's 25% gone before the first user message. On a small model doing complex code generation, every token counts.

Why @sudoingX succeeded with Hermes

He used Qwen 3.5 9B (a newer model with better instruction following) on a 12 GB GPU (more VRAM = larger context = more headroom for Hermes's system prompt). The combination of a more capable model and more resources made the tool call parsing reliable.

Our solution: a minimal custom agent

Rather than debugging Hermes internals, we wrote a ~80-line Python agent that:

  • Calls llama-server directly via the OpenAI-compatible API
  • Parses tool calls using a simple <tool>{...}</tool> format defined in the system prompt
  • Executes Windows commands natively via subprocess
  • Shows real-time debug output for every turn

Validation test result:

Turn 1: agent ran "dir C:\local-agent-tetris" → got file listing ✅
Turn 2: agent created test.txt with correct content ✅  
Turn 3: agent recognized task was complete, stopped ✅
Tool calls: 2 / 2 executed successfully

This is the agent we use for the Tetris generation.


🚀 Step 5 — The custom agent

Save agent.py at the root of the project. The full source is in the repo.

Key design decisions:

  • System prompt tells the model to use <tool>{"command": "..."} format — simple, unambiguous
  • Windows-native commands (dir, echo >>, mkdir) — no Linux assumptions
  • Real-time debug output: tok/s, context size, tool outputs, file detection
  • max_turns=40 to give the model enough room for a multi-file project

Launch:

cd C:\local-agent-tetris
python agent.py

🎮 Step 6 — The Tetris generation prompt

The task sent to the agent:

Build a Tetris clone as a self-contained web project.
Create all files in C:\local-agent-tetris\tetris\

Files to create:
- index.html (game entry point, loads all scripts)
- game.js (game loop, collision detection, line clearing, scoring)
- pieces.js (all 7 tetrominoes: I, O, T, S, Z, J, L with colors)
- audio.js (sound effects using Web Audio API only, no external files)
- ui.js (score, level, next piece preview, start/pause screen)
- style.css (dark theme, pixel font, centered layout)

Gameplay:
- All 7 tetrominoes with distinct colors
- Speed increases every 10 lines cleared
- Ghost piece showing drop position
- Score: 100/300/500/800 for 1/2/3/4 lines
- High score in localStorage
- Pause with P key

Audio using Web Audio API only (no external files):
- Click sound on move/rotate
- Rising tone on line clear
- Game over jingle

🔁 Step 7 — Iteration workflow

Phase 0 — Generate and test

  1. Agent creates files in tetris/
  2. Open index.html in browser
  3. Open console (F12), note every error

Phase 1 — Fix with precision

The game has the following issues. Fix all of them:
1. [EXACT console error]
2. [OBSERVED vs EXPECTED behavior]
Do not rewrite files that work. Only fix what is broken.

Phase 2 — Polish

The game works. Now polish it:
- Flash animation on line clear
- Ghost piece opacity 20%
- Monospace font for score
- "READY?" countdown before start

📊 What the agent does vs what you guide

Action Agent You
Design the file architecture
Write the game loop
Implement Web Audio API
Detect its own variable scope bugs
Know what to test in the browser
Read console errors and rephrase them
Decide when the result is good enough

📈 Benchmark results

Machine Model Mode Speed Context Status
Asus VivoBook 15 — Ryzen 5 5500U, 20 GB RAM Qwen2.5-Coder-7B Q4_K_M CPU only 4.81 tok/s 16K ✅ Agent running
Desktop — Ryzen 5 1600 AF, GTX 1650 Super 4 GB Qwen2.5-Coder-7B Q4_K_M Partial GPU offload TBD 32K 🔄 In progress

📖 Follow the experiment

This README documents the setup. The full experiment — iterations, failures, discoveries, and final results — is covered in detail on the blog:

👉 blog.nikodindon.dpdns.org


🔗 Resources


📝 License

MIT — do what you want with it, document what you find, share what works.

About

Experiencing local coding agents creation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages