Skip to content

yunbow/ai-dev-os-benchmark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AI Dev OS Benchmark

Same prompt, same model, different rules — measuring AI code quality impact

License: MIT

Does giving your AI coding rules actually improve output? This benchmark measures the impact of AI Dev OS guidelines on AI-generated code quality.

The Key Finding: Generate → Check → Fix

Architecture Score vs Baseline
No guidelines (baseline) ~84
Static 10 files in CLAUDE.md ~85-91 ≈ 0 (within variance)
Check+fix only (no CLAUDE.md) 95.2 +9.9
3 files in CLAUDE.md + check+fix 96.9 +12.5

Static guidelines alone don't improve total score. Post-generation verification is the real quality mechanism. The optimal architecture:

CLAUDE.md: 3 project-specific files (~8K tokens)
  → AI generates code
  → /ai-dev-os-check verifies against checklist
  → AI fixes all violations
  → Score: 96.9/100

The "Less is More" Principle — Confirmed 3 Times

Test What was added Result Details
001: Guideline Impact +28 guideline files (+73K tokens) Worse than baseline (−4.8 pts) Attention dilution
002: Before/After Examples +280 lines of code examples (+55%) No improvement Prose already sufficient
003: YAML Checklist +165 lines of YAML/Quick Rules (+33%) No improvement Metadata adds no value

Specificity Fixes — 0% → 100% on All 4 Items

Items that failed in every run of Test 001 were fixed by making guidelines more specific (not longer):

Test Item Before (0% Pass) After (100% Pass) What changed
004 V9: Date range "Cross-field constraints" "MUST validate using .refine()" Named the method
005 N7: Handler naming No rule "MUST use handle + noun + verb" Added ❌/✅ examples
006 T5: Exhaustive check No rule "MUST use never in default case" Showed the exact pattern
007 P4: Dynamic import 1-line example only SHOULD + candidate list (> 50KB) Added decision criteria

Key insight: The AI already knows these patterns. What it needs is where and how to apply them — not what they are.

Test Results

001: Guideline Impact (3 runs × 3 conditions)

Condition Avg Score vs Baseline
A: No guidelines 84.1
B: All 28 files (~75K tokens) 79.3 −4.8
C: Curated 10 files (~24K tokens) 84.9 +0.8

→ Details & analysis

002: Before/After Examples (5 tasks × 2 conditions)

D (no examples) E (with examples)
Wins 3 0
Ties 2 2
Total Pass 18/18 17/18

→ Details & analysis

003: YAML Checklist Format (5 tasks × 2 conditions)

A (prose-only) C (YAML+QR)
Total Pass 18/18 17/18

→ Details & analysis

011: Dynamic Check Effect (generate → check → fix)

Condition Score Delta
baseline-only 85.3
baseline-then-check 95.2 +9.9

Post-generation check+fix produced the largest improvement of any test. → Details

012: Minimal Static + Check (the optimal architecture)

Condition Score Delta
check-only (0 files) 87.4
3 files + check 96.9 +9.5

96.9 — highest score ever. 3 project-specific files + check+fix is the optimal architecture. → Details

Reproduce

git clone https://github.com/yunbow/ai-dev-os-benchmark.git
cd ai-dev-os-benchmark

# Each test has its own assets and prompts
cat tests/001_guideline-impact/assets/prompts/curated-guidelines.md
Repository Structure
ai-dev-os-benchmark/
├── spec/
│   └── benchmark_design.md
├── tests/
│   ├── 001_guideline-impact/
│   │   ├── README.md                  # Summary (EN + JA)
│   │   ├── assets/                    # Prompts, guidelines, eval sheet
│   │   └── results/                   # 3 runs × 3 conditions + analysis.md
│   ├── 002_before-after-effect/
│   │   ├── README.md
│   │   ├── assets/                    # 10 prompts (D/E × 5 tasks)
│   │   └── results/                   # 5 tasks × 2 conditions + analysis.md
│   ├── 003_checklist-format/
│   │   ├── README.md
│   │   ├── assets/                    # 30 prompts (A/B/C × 10 tasks)
│   │   └── results/                   # 5 tasks × 2 conditions + analysis.md
│   └── 004_xxx/                       # Future tests...
└── LICENSE

Related

Repository Description
ai-dev-os Framework specification and theory
rules-typescript TypeScript / Next.js / Node.js guidelines
rules-python Python / FastAPI guidelines
plugin-claude-code Skills, Hooks, and Agents for Claude Code
plugin-kiro Steering Rules and Hooks for Kiro
plugin-cursor Cursor Rules (.mdc)
cli npx ai-dev-os init

License

MIT


Languages: English | 日本語

About

Benchmark: how AI coding guidelines affect code quality — 3 conditions × 9 dimensions × 100 points

Topics

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages