What is this? • Install • Why RLM? • Benchmarks • Development • Contributing
Note: If you're looking for a hosted, plug-and-play version of HALO, please sign up for inference.net.
HALO (Hierarchical Agent Loop Optimization) is a methodology for building recursively self-improving agent harnesses using RLMs. This repository contains:
- Information on HALO methodology.
- A Python package that implements the core HALO-RLM engine. View on PyPI
- A demo project that shows how to build HALO loops for your agents using the Python package. View demo
- Benchmarking examples applying HALO to popular agent benchmarks. (View AppWorld).
The core HALO loop is surprisingly simple:
- Collect execution traces from your agent harness. HALO uses OpenTelemetry-compatible tracing.
- Feed traces into HALO-RLM engine.
- The engine decomposes the traces to understand common failure modes across harness executions and produces a report with its findings.
- This report is fed into a coding agent like Cursor or Claude Code to generate and apply a set of changes to your harness.
- The harness is then re-deployed, more traces are gathered, and the cycle repeats.
HALO is great at finding issues in production agent deployments. We find high-traffic environments tend to generate more data with higher variance across executions, creating the type of issues that HALO is great at identifying.
A general-purpose harness like Claude Code is the wrong tool for trace analysis. This isn’t because the model isn’t smart, but because traces can get extremely long, and you need a specialized toolkit in order to make observations about systemic agentic behavior. We noticed in our testing that harnesses like CC would often overfit to an error present in a single/few traces rather than generalize to harness-level problems. This led us to creating a specialized form of a RLM.
Install the HALO engine + CLI from PyPI:
pip install halo-engine
# Verify installation
halo --help- Integrate Tracing
- Collect traces by running your agent
- Run the HALO engine
export OPENAI_API_KEY=...
# Optional: point HALO at another OpenAI-compatible provider.
export OPENAI_BASE_URL=https://openrouter.ai/api/v1
halo path_to_your_traces.jsonl -p "Diagnose errors you find and suggest fixes"HALO uses the canonical OpenAI env vars: OPENAI_API_KEY for credentials and OPENAI_BASE_URL for OpenAI-compatible providers. If OPENAI_BASE_URL is unset, HALO uses https://api.openai.com/v1. Run halo --help to see all CLI options. The CLI mirrors the model/provider settings exposed by the Python SDK's
ModelConfig and
ModelProviderConfig.
| Flag | Default | Description |
|---|---|---|
TRACE_PATH |
required | JSONL trace file |
--prompt, -p |
required | User prompt sent to the root agent |
--model, -m |
gpt-5.4-mini |
Model name for root, subagent, synthesis, and compaction calls |
--max-depth |
2 |
Max subagent recursion depth |
--max-turns |
20 |
Max turns per agent |
--max-parallel |
10 |
Max concurrent subagents |
--base-url |
OPENAI_BASE_URL / https://api.openai.com/v1 |
OpenAI-compatible API base URL |
--api-key |
OPENAI_API_KEY |
Provider API key |
--header, -H |
unset | Provider header as NAME: VALUE. Repeat for multiple headers, matching curl's -H convention |
--temperature |
provider default | Sampling temperature forwarded to the model |
--max-output-tokens |
provider default | Maximum output tokens forwarded to the model |
--parallel-tool-calls / --no-parallel-tool-calls |
enabled | Allow models to issue parallel tool calls |
--refusal-retries |
0 |
Retry an agent model request this many times when the model refuses |
--reasoning-effort |
model/provider default | Reasoning effort for root, subagent, and synthesis calls. Compaction never uses reasoning |
--telemetry |
off | Emit OpenInference traces of HALO's own LLM, tool, and agent activity |
For example:
halo path_to_your_traces.jsonl \
-p "Diagnose errors you find and suggest fixes" \
--base-url https://openrouter.ai/api/v1 \
-H "HTTP-Referer: https://example.com"HALO can emit OpenInference-shaped traces of its own LLM, tool, and agent activity. It is off by default; nothing is emitted unless you pass --telemetry.
halo TRACE_PATH --prompt "..." --telemetryWhen telemetry is enabled, CATALYST_OTLP_TOKEN uploads spans to inference.net Catalyst over OTLP. If it is unset, spans are written to a local JSONL file at ./halo-telemetry-{run_id}.jsonl in the current working directory.
| Var | Default | Purpose |
|---|---|---|
CATALYST_OTLP_TOKEN |
unset | If set, uploads to Catalyst over OTLP. If unset, writes JSONL locally |
CATALYST_OTLP_ENDPOINT |
catalyst-tracing default | OTLP endpoint base URL, for example https://telemetry.inference.net |
CATALYST_DEBUG |
unset | Set to 1 to surface OTLP export errors at WARNING level |
CATALYST_TRACING_RUN_ID |
unset | Uses this HALO run id instead of a generated uuid |
CATALYST_TRACING_* |
unset | Generic catalyst-tracing passthrough |
HALO_TELEMETRY_PATH |
./halo-telemetry-{run_id}.jsonl |
Local fallback file path. Only used when CATALYST_OTLP_TOKEN is unset |
We have provided a simple demo and an AppWorld demo.
The engine exposes four entry points from engine.main. Use whichever
matches the trade-off you want between observability and code
simplicity. The yielded types (AgentOutputItem
and AgentTextDelta) are defined in
engine/models/engine_output.py:
| Function | Sync / async | Returns | When to use |
|---|---|---|---|
stream_engine_async |
async | AsyncIterator[AgentOutputItem | AgentTextDelta] |
You want every event including streaming-token deltas (live UI, custom rendering). |
stream_engine_output_async |
async | AsyncIterator[AgentOutputItem] |
You want to log / persist each completed step (assistant message, tool call, tool result) as it lands. |
run_engine_async |
async | list[AgentOutputItem] |
You want the final list at the end and don't care about per-step observability. |
stream_engine |
sync | Iterator[AgentOutputItem | AgentTextDelta] |
Sync generator; yields every event including deltas. Drives the async iterator on a private event loop. |
stream_engine_output |
sync | Iterator[AgentOutputItem] |
Sync generator; yields completed items only. Same shape as the async variant for sync callers. |
run_engine |
sync | list[AgentOutputItem] |
Sync, collects to a list. Pure convenience over asyncio.run(run_engine_async(...)). |
from engine.main import stream_engine_output_async
async for item in stream_engine_output_async(messages, cfg, trace_path):
logger.info("step", extra={"sequence": item.sequence, "agent": item.agent_name})
# item.item is an AgentMessage (assistant / tool / etc.)HALO is consistently capable of driving improvements on benchmarks, solely by optimizing the harness.
We applied HALO to the AppWorld benchmark, a set of agentic tasks that assess the LLM’s ability to use multi-app services like Spotify, Venmo, file systems, and phone contacts. We tested HALO’s ability to improve harnesses for both Gemini 3 Flash and Sonnet 4.6. We iterated on the harness using the dev split, and then used the test_normal split as a proxy to verify that improvements did not come from overfitting.
The feedback from HALO Engine surfaced failures in the harnesses such as hallucinated tool calls, redundant arguments in tools, refusal loops, and semantic correctness issues. Each issue mapped cleanly to a direct prompt edit. HALO’s claims were independently verified from the source trace files with the findings holding up under scrutiny.
The peak improvements over baseline were substantial for both models. For Gemini 3 Flash, dev SGC went from 36.8% to 52.6% (+15.8 points) and test_normal SGC went from 37.5% to 48.2% (+10.7 points). For Sonnet 4.6, dev SGC went from 73.7% to 89.5% (+15.8 points) and test_normal SGC went from 62.5% to 73.2% (+10.7 points).Local development against this repo uses uv for dependency management and go-task as the task runner.
git clone https://github.com/context-labs/HALO
cd HALO
task env:setuptask env:setup installs uv (if missing), syncs the venv from uv.lock, and configures the repo's git hooks. After that, the halo CLI is available via uv run halo ... (or activate .venv/).
Run task --list for the full list. The ones you'll use most:
| Task | What it does |
|---|---|
task check |
Run all pre-commit checks: pinned-versions, lint, format, typecheck, unit tests |
task check:fix |
Same, but auto-fix lint/format issues |
task test:unit |
Unit tests under tests/unit/ |
task test:integration |
Integration tests under tests/integration/ |
Contributions are welcome! Please feel free to submit a pull request.

