feat(runner): library-first batch rollout runner (agentix.runner) + CLI + example#60
Merged
Merged
Conversation
…hin CLI New uv-workspace member `plugins/runner` exposing `agentix.runner`: - `run_rollouts(...)` / `rollout_one(...)` run an agent over a dataset of instances, each in its own sandbox; the agent phase and the scoring phase each get a fresh sandbox. Built only on the stable surface — `provider.session(config)` for sandboxes and `sandbox.remote(fn, ...)` for in-sandbox calls (+ `bundle`). - Generic `Dataset` / `Agent` / `Provider` Protocols + `AgentResult` / `Rollout` dataclasses; the runner has no benchmark- or agent-specific logic. Per-instance failures surface as `Rollout.error`, never abort the batch; results return in input order, bounded by `n_concurrent`. - Thin `agentix-run` CLI (library is the real interface; RL/eval calls `run_rollouts` directly) that resolves `module:attr` dataset/agent adapters and a provider backend. - 9 unit tests via an in-process fake provider (no docker). ruff + pyright clean; wired into `[tool.uv.workspace]` members and `[tool.pyright]`. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
`examples/run-swe-rollouts` is the `eval-cc-swe` flow expressed through the reusable runner: a `SweDataset` + `ClaudeCodeAgent` adapter plus one `run_rollouts(...)` call replace the hand-written per-instance orchestration. `--ground-truth` swaps in a `GroundTruthAgent` that submits each row's gold patch, reusing the identical scoring path. Patch extraction goes through `agentix.bash.run`; the real provider call stays host-side via the bridge gateway. Additive (a new standalone example, not a workspace member); ruff-clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This was referenced May 29, 2026
Meirtz
added a commit
that referenced
this pull request
May 30, 2026
Documents the batch rollout runner merged in #60: `run_rollouts(...)`, the `Dataset`/`Agent` adapter Protocols, the `Rollout` result, and the `agentix-run` CLI, with a pointer to the `examples/run-swe-rollouts` worked example. Sits in the How-to group beside integrate-agent / integrate-dataset. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
New standalone uv-workspace member
plugins/runnerexposingagentix.runner— a generic, library-first batch rollout runner.run_rollouts(...)runs an agent over a dataset of instances, each in its own sandbox: an agent phase (setup → solve) then a fresh scoring sandbox (so scoring starts from a clean task image). It returns one typedRolloutper instance, in input order, bounded byn_concurrent. Per-instance failures are isolated asRollout.errorand never abort the batch.Built only on the stable surface —
provider.session(config)for sandboxes andsandbox.remote(fn, ...)for in-sandbox calls (+bundle). It carries no benchmark- or agent-specific logic: datasets and agents plug in through two small Protocols.Dataset—instances(),image(inst),setup(sandbox, inst) -> bool,score(sandbox, inst, patch) -> dictAgent—solve(sandbox, inst, *, model) -> AgentResultProvider— anything withsession(config) -> async context manager(i.e. aSandboxProvider)A thin
agentix-runCLI wraps the same function. The library is the real interface — an RL/eval loop callsrun_rollouts(...)directly.Also includes:
examples/run-swe-rolloutsA runnable example that expresses the
eval-cc-sweflow through the runner: aSweDataset+ClaudeCodeAgentadapter plus onerun_rollouts(...)call replace the hand-written per-instance orchestration (--ground-truthreuses the same scoring path via aGroundTruthAgent). It mirrorseval-cc-swe's exact remote-call wiring, so it builds/runs the same way;eval-cc-swe/runner.py's bespoke loop can later be reduced to this.Gate
ruff check,pyright(0 errors, full include set),pytest— all green.[tool.uv.workspace]members and[tool.pyright]include/extraPaths.Scope note
Deliberately steers clear of the
agentix/gateway(#41) andagentix/orchestrator(#2) areas — this is the batch-rollout layer that composesclient.remote, complementary to the gateway's session/training bridge.