MIT
g023 (https://github.com/g023/harnessharvest/)
A self-learning, self-correcting, LLM-powered harness creation and management system with FAISS-powered RAG, sandboxed execution, and autonomous improvement modes. Powered by Ollama and offline models.
HarnessHarvester generates executable Python harnesses from natural language task descriptions, executes them in a sandboxed environment, reviews them with multi-faceted LLM judges, and repairs failures using branching strategies. It includes two autonomous modes: autolearn (continuous discovery loop) and autoimprove (iterative enhancement of existing harnesses). This concept is designed to be an offline first harness/scaffolding builder where you get the harness instead of some remote api.
tl;dr: You give it a task -> it makes a python "harness" to do the task -> it runs the harness with some safety attempted (but you never know so probably run in a container to be safer) -> it reviews how well it did -> if it failed, it tries to fix itself -> successful harnesses get stored in a knowledge base for future generations to learn from -> final deliverables can be found in the ./HH/sandbox/ directory.
Usually the normal process is:
- You run "Write a function that implements binary search on a sorted array"`
- Some frontend sends it off to some witchcraft in the backend (which happens to most likely be an agentic harness) and after it is completed
- you get the deliverable, but do not get the harness.
That's what this project is about. You get the harness (the agentic code), and the deliverable (the output of the agentic code).
Who knows what kind of potential one might have with different models and different tasks.
Try it out. Run python main.py full "Create a text file analyzer that counts words, sentences, and paragraphs" to run the full pipeline (generate → execute → review → repair) in one command.
HH/
├── main.py # CLI entry point
├── config.json # System configuration
├── _new_ollama.py # Ollama API wrapper (provided)
├── core/ # Core infrastructure
│ ├── constants.py # All system constants
│ ├── config.py # Thread-safe configuration manager
│ ├── storage.py # Atomic file I/O with locking
│ ├── logging_setup.py # Rotating log setup
│ └── utils.py # LLM wrapper, code extraction, JSON parsing
├── rag/ # FAISS-powered RAG system
│ ├── embeddings.py # TF-IDF vectorizer + BM25 scorer
│ ├── engine.py # FAISS vector index + hybrid search engine
│ ├── ranking.py # Weighted fusion ranking with MMR diversity
│ ├── snippets.py # Code snippet store
│ ├── errors.py # Error pattern store
│ ├── prompts.py # Prompt template store
│ ├── management.py # Unified RAG manager
│ └── api.py # HTTP REST API for RAG (port 8420)
├── harness/ # Harness management
│ ├── models.py # Data models (metadata, versions, reviews, etc.)
│ ├── orchestrator.py # Cognitive workflow with stage-gated permissions
│ ├── generator.py # LLM-based code generation
│ ├── executor.py # Sandboxed subprocess execution
│ ├── reviewer.py # Multi-faceted LLM review system
│ ├── repairer.py # Self-repair with branching strategy
│ └── runtime.py # Agentic runtime base class
├── modes/ # Autonomous modes
│ ├── autolearn.py # Continuous discovery loop
│ └── autoimprove.py # Iterative harness enhancement
├── TESTS/ # Test suite (34 tests)
│ └── test_all.py
├── db/ # Flat-file database
│ ├── harnesses/ # Generated harness storage
│ ├── rag/ # RAG indexes and entries
│ │ ├── snippets/
│ │ ├── errors/
│ │ └── prompts/
│ ├── metrics/ # Performance metrics
│ └── autolearn/sessions/ # Autolearn session state
├── sandbox/ # Sandboxed execution directory
├── api/ # API package
├── logs/ # Log files
└── projects/ # Project deliverables
- Python 3.10+
- Ollama (running locally on port 11434)
faiss-cpu+numpyjson-repairrequests
# install dependencies:
pip install faiss-cpu numpy json-repair requestsConfigure models in config.json. Defaults:
| Role | Model | Context |
|---|---|---|
| Judge | hf.co/g023/Qwen3-1.77B-g023-GGUF:Q8_0 |
40,000 |
| Reasoner | qwen3.5:2b |
240,000 |
| Generator | qwen3.5:4b-q4_K_M |
32,000 |
python main.py harvest "Write a function that implements binary search on a sorted array"python main.py full "Create a text file analyzer that counts words, sentences, and paragraphs"python main.py run <harness_id> --timeout 120python main.py review <harness_id>python main.py repair <harness_id>Continuous autonomous discovery loop that generates tasks, creates harnesses, executes, reviews, and self-reflects:
python main.py autolearn --max-iterations 10
python main.py autolearn --max-duration 3600 # Run for 1 hour
python main.py autolearn --session <session_id> # Resume sessionAnalyze and improve an existing harness through iterative enhancement:
python main.py autoimprove <harness_id> --max-iterations 5python main.py rag search "binary search algorithm"
python main.py rag stats
python main.py rag add --title "My Snippet" --file snippet.pypython main.py api --port 8420REST endpoints:
GET /health- Health checkGET /snippets/search?q=<query>- Search snippetsPOST /snippets- Add snippetGET /errors/search?q=<query>- Search error patternsPOST /errors- Add error patternGET /prompts/search?q=<query>- Search prompts
python main.py list
python main.py list --status active --source autolearnpython main.py report <harness_id> --show-codeCombines TF-IDF cosine similarity, BM25 relevance, FAISS vector search, quality scores, usage frequency, and recency for optimal retrieval.
- Path restriction (files confined to sandbox directory)
- Configurable timeouts
- Dangerous operation detection (subprocess, network access)
- Environment variable sanitization
Harnesses can call checkpoint(state_dict, step_name) during execution. Failed runs can resume from the last checkpoint.
When a harness fails, the repairer:
- Analyzes the failure using LLM reasoning
- Generates 2-3 repair strategies
- Creates a branch for each strategy
- Quick-scores each repair
- Promotes the best-scoring branch
Agentic harnesses follow mandatory stages with permission-gated tools:
| Stage | Permissions |
|---|---|
| Reflect | Read-only, LLM, RAG query |
| Brainstorm | Read-only, LLM, RAG query |
| Plan | Read-only, LLM, RAG query |
| Create Task List | Read-only, LLM, Read-write |
| Track Progress | Read-only, Read-write, LLM |
| Execute | All permissions |
Successful harnesses are automatically stored in RAG. Error patterns and fixes are captured. Future generations benefit from accumulated knowledge.
All settings in config.json:
- ollama: Host, models, context windows, option profiles (default/creative/precise/judge)
- paths: Database, sandbox, RAG directories
- sandbox: Allowed imports, blocked patterns, execution timeout
- harness: Max repair attempts, branching limits, auto-repair threshold
- rag: Embedding dimensions, ranking weights, BM25 parameters
- autolearn: Reflection interval, minimum score thresholds
- autoimprove: Iteration limits, improvement thresholds
cd HH
python -m pytest TESTS/test_all.py::test_core TESTS/test_all.py::test_rag TESTS/test_all.py::test_harness TESTS/test_all.py::test_cli -v4 test suites covering: core infrastructure, RAG system (embeddings, FAISS, BM25, engine, stores, ranking), harness system (models, orchestrator, executor, sandbox, checkpoints), and CLI.
The full pipeline (generate → execute → review → repair) has been validated across 4 domains:
| Domain | Task | Score | Repair Needed |
|---|---|---|---|
| Game | Text-based number guessing game | 1.00 | No |
| Book | Short story generator (genre + chapters) | 0.50 | Yes (0.00 → 0.50) |
| Application | CLI todo list with JSON persistence | 1.00 | No |
| Website | Portfolio site generator (HTML/CSS/JS) | 0.90 | No |
Critical fixes:
- autoimprove.py: Fixed hardcoded version numbers (
"v1"/"v2") in_improve_iteration()— now capturesversion_infofromcreate_version()and uses the actual version string - engine.py (VectorIndex): Fixed
save()writingNoneentries afterremove()— now compacts before saving - engine.py (VectorIndex): Initialized
_needs_rebuildflag in__init__instead of relying onhasattr() - _new_ollama.py: Fixed
llm_stream()unconditionally overridingnum_ctxwith the global 40k constant — now respects per-model context windows passed via options - executor.py: Fixed sandbox wrapper ordering (
os.makedirsnow runs beforeos.chdir) - executor.py: Replaced fragile string-based duration calculation with direct
time.time()delta - repairer.py: Fixed
_select_repair_model()referencing undefined config roles (repair_syntax, etc.) — now maps to existinggenerator/reasonerroles - reviewer.py: Fixed
quick_score()crash when LLM returnsNonecontent - utils.py: Fixed file handle leak in
check_user_md()(addedwithstatement) - config.py: Added missing config defaults for
sandbox,autolearn,autoimprove,metrics, andrag.ranking_weightssections
All data uses flat-file JSON storage with:
- Atomic writes (temp file + rename) for crash safety
fcntlfile locking for concurrent access- JSONL append for logs and metrics
- Content hashing (SHA-256) for deduplication