Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
51 changes: 51 additions & 0 deletions examples/eval-tui/DESIGN.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
# Agentix TUI — design & rubrics

A modern, reactive [Textual](https://textual.textualize.io/) control room for
Agentix. The goal is a single TUI that surfaces **every core Agentix surface** —
not just batch rollouts — built on the stable `client.remote` + `bundle` APIs
(plus `provider.session`) and degrading gracefully when no Docker/runtime is
present.

## Rubrics (v1 — scored 0–5, target ≥4, revisable)

| # | Dimension | "Advanced" looks like |
|---|-----------|------------------------|
| 1 | **Coverage** | Rollouts · plugin **Catalog** · **Sandboxes**/providers + remote-invoke · **Build**/bundle · **Observability** (traces + logs) |
| 2 | **Reactivity** | Fully async, live updates, bounded concurrency, never blocks the UI |
| 3 | **Navigation / IA** | Discoverable multi-area nav (tabs), command palette, help |
| 4 | **Visual design** | Cohesive theme, semantic color, responsive layout, dark/light |
| 5 | **Interaction** | Keybindings, mouse, search/filter, drill-down detail |
| 6 | **Robustness** | Graceful with no Docker (demo/empty states), error surfaces, cancellation |
| 7 | **Feedback** | Progress, throughput, status, notifications |
| 8 | **Code quality** | Typed, ruff-clean, modular, documented |
| 9 | **Verifiability** | Headless `run_test` pilots per screen; demo mode without infra |
| 10 | **Polish / UX** | Help screen, sensible defaults, onboarding |

## Architecture

```text
AgentixTUI(App) # shell: Header + TabbedContent + Footer, theme, palette
├── Rollouts (views/rollouts.py) # live batch-rollout dashboard over agentix.runner
├── Catalog (views/catalog.py) # installed agentix dists + entry points (no Docker)
├── Sandboxes (views/…) # providers + live sessions + remote-invoke [planned]
├── Build (views/…) # trigger & stream `agentix build` [planned]
└── Observability (views/…) # live /trace spans + /log streams [planned]
```

Each area is a self-contained view widget with its own demo/empty state, so the
app is useful (and testable headlessly) with no runtime attached.

## Rubric addendum (v2)

| # | Dimension | "Advanced" looks like |
|---|-----------|------------------------|
| 11 | **Aesthetics** | A landing dashboard that's genuinely beautiful — branded gradient banner, ecosystem stat cards, cohesive theme; a "sexy" first impression |

## Iteration log

- **PR-A** — app shell (TabbedContent nav) + **Catalog** view (real entry-point /
distribution introspection) + this rubric doc + theming. Coverage 1→2, IA 1→4, Visual 3→4.
- **drill-down** — Rollouts instance detail pane (verdict/duration/score/error). Interaction 3→4.
- **Overview dashboard** — branded gradient banner + live ecosystem stat cards +
environment readiness as the landing tab; branded Textual theme. Aesthetics →4, Polish →4.
- **next** — Sandboxes, Build, Observability views; command palette; search/filter.
54 changes: 54 additions & 0 deletions examples/eval-tui/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
# eval-tui

A modern [Textual](https://textual.textualize.io/) **control room** for Agentix —
a tabbed TUI that surfaces each Agentix area in one place. See
[`DESIGN.md`](DESIGN.md) for the rubrics it's iterated against.

```text
┌─ Agentix · agent ↔ environment control room ───────────────────────────────┐
│ Rollouts │ Catalog │ Sandboxes │ Build │ Observability │
├─────────────────────────────────────────────────────────────────────────────┤
│ [████████████············] 18/40 done ✓ 11 ✗ 7 ⟳ 4 running 62.3/min │
│ Instance Status Time Result │ ▶ starting 40 rollouts │
│ demo__task-000 ✓ PASS 1.2s resolved│ ✓ PASS demo__task-000 · 1.2s │
│ demo__task-001 ⟳ scoring … … │ … │
└─────────────────────────────────────────────────────────────────────────────┘
q Quit
```

## Tabs

- **Rollouts** — live batch-rollout dashboard over
[`agentix.runner`](../../plugins/runner): per-instance phase grid (`pending →
setup → agent → scoring → PASS/FAIL/skip/error`), summary bar (progress /
resolved / failed / running / throughput), and an event log. Phase
transitions are observed by wrapping the dataset/agent adapters
(`_adapters.py`), so `agentix.runner` is unchanged.
- **Catalog** — the installed Agentix ecosystem: every `agentix*` distribution
plus `agentix.provider` (backends) and `agentix.nix` (agents/datasets shipping
a Nix closure) entry points. Pure introspection — no Docker.
- **Sandboxes · Build · Observability** — signposted; landing in follow-up PRs.

## Run

```bash
cd examples/eval-tui
uv sync

# No-Docker synthetic demo:
uv run agentix-eval-tui --demo 40 --n-concurrent 6

# Real run — adapters resolved like `agentix-run`:
uv run agentix-eval-tui --dataset my_pkg:dataset --agent my_pkg:agent \
--provider docker --bundle eval:0.1.0 --model claude-3-5-sonnet-latest

# Bare launch — just browse the Catalog (no run):
uv run agentix-eval-tui
```

## Test

```bash
uv sync --extra dev
uv run pytest # headless Textual run_test pilots — no Docker
```
8 changes: 8 additions & 0 deletions examples/eval-tui/eval_tui/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
"""Agentix TUI — a modern Textual control room for Agentix."""

from __future__ import annotations

from .app import AgentixTUI
from .models import RunSpec

__all__ = ["AgentixTUI", "RunSpec"]
6 changes: 6 additions & 0 deletions examples/eval-tui/eval_tui/__main__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
from __future__ import annotations

from .cli import main

if __name__ == "__main__":
raise SystemExit(main())
49 changes: 49 additions & 0 deletions examples/eval-tui/eval_tui/_adapters.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
"""Phase-tracing wrappers around a runner `Dataset` / `Agent`.

The runner exposes a per-instance `on_result` callback but no in-flight phase
hook. To drive a live UI we wrap the dataset/agent so the dashboard learns
when each instance enters `setup`, `agent`, and `score` — without changing
`agentix.runner` itself. Each wrapper simply emits a phase event, then
delegates to the wrapped object.
"""

from __future__ import annotations

from collections.abc import Callable
from typing import Any

OnPhase = Callable[[str, str], None]


def instance_id(instance: dict[str, Any]) -> str:
return str(instance.get("instance_id") or instance.get("id") or "?")


class TracingDataset:
def __init__(self, inner: Any, on_phase: OnPhase) -> None:
self._inner = inner
self._on_phase = on_phase

def instances(self) -> Any:
return self._inner.instances()

def image(self, instance: dict[str, Any]) -> str:
return self._inner.image(instance)

async def setup(self, sandbox: Any, instance: dict[str, Any]) -> bool:
self._on_phase(instance_id(instance), "setup")
return await self._inner.setup(sandbox, instance)

async def score(self, sandbox: Any, instance: dict[str, Any], patch: str) -> dict[str, Any]:
self._on_phase(instance_id(instance), "score")
return await self._inner.score(sandbox, instance, patch)


class TracingAgent:
def __init__(self, inner: Any, on_phase: OnPhase) -> None:
self._inner = inner
self._on_phase = on_phase

async def solve(self, sandbox: Any, instance: dict[str, Any], *, model: str | None) -> Any:
self._on_phase(instance_id(instance), "agent")
return await self._inner.solve(sandbox, instance, model=model)
128 changes: 128 additions & 0 deletions examples/eval-tui/eval_tui/app.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,128 @@
"""Agentix TUI — a modern Textual control room for Agentix.

A tabbed shell that surfaces each Agentix area as its own view: an Overview
landing dashboard, live **Rollouts** over `agentix.runner`, a plugin
**Catalog**, **Sandboxes** readiness, a **Build** planner, and live
**Observability**. See `DESIGN.md` for the rubrics this iterates against.

Run a no-Docker demo with `agentix-eval-tui --demo 40`, point it at real
adapters like `agentix-run`, or launch it bare to browse the Catalog.
"""

from __future__ import annotations

from textual.app import App, ComposeResult
from textual.widgets import Footer, Header, TabbedContent, TabPane

from .models import RunSpec
from .views import (
BuildView,
CatalogView,
ObservabilityView,
OverviewView,
RolloutsView,
SandboxesView,
)


class AgentixTUI(App):
"""Tabbed control room for Agentix."""

TITLE = "Agentix"
SUB_TITLE = "agent ↔ environment control room"

CSS = """
TabbedContent { height: 1fr; }
TabPane { padding: 0; }

#rollouts-summary {
height: 3;
padding: 0 2;
content-align: left middle;
border: round $primary;
background: $panel;
}
#rollouts-body { height: 1fr; }
#rollouts-table { width: 3fr; height: 1fr; border: round $primary; }
#rollouts-side { width: 2fr; height: 1fr; }
#rollouts-detail { height: 2fr; border: round $primary; padding: 0 1; }
#rollouts-log { height: 3fr; border: round $primary; padding: 0 1; }

#catalog-title { height: 1; padding: 0 1; }
#catalog-filter { margin: 0 1; }
#catalog-table { height: 1fr; border: round $primary; }

#ov-banner { height: auto; padding: 1 2; content-align: center middle; text-align: center; }
#ov-cards { height: 7; padding: 0 1; }
.ov-card {
width: 1fr;
height: 5;
border: round $primary;
padding: 1 1;
margin: 0 1;
content-align: center middle;
text-align: center;
}
#ov-hints { height: auto; padding: 1 2; }

#sb-title { height: 1; padding: 0 1; }
#sb-table { height: 1fr; border: round $primary; }
#sb-explainer { height: auto; padding: 1 1; }

#obs-title { height: 1; padding: 0 1; }
#obs-body { height: 1fr; }
#obs-trace { width: 1fr; height: 1fr; border: round $primary; padding: 0 1; }
#obs-log { width: 1fr; height: 1fr; border: round $primary; padding: 0 1; }

#build-title { height: 1; padding: 0 1; }
#build-path { margin: 0 1; }
#build-cmd { height: auto; padding: 1 2; }
#build-info { height: 1fr; padding: 0 2; }
"""

BINDINGS = [
("1", "show_tab('overview')", "Overview"),
("2", "show_tab('rollouts')", "Rollouts"),
("3", "show_tab('catalog')", "Catalog"),
("4", "show_tab('sandboxes')", "Sandboxes"),
("5", "show_tab('build')", "Build"),
("6", "show_tab('observability')", "Obs"),
("q", "quit", "Quit"),
]

def __init__(self, *, rollout_spec: RunSpec | None = None) -> None:
super().__init__()
self._spec = rollout_spec

def on_mount(self) -> None:
# Best-effort branded theme; falls back to the default if the running
# Textual version's theme API differs.
try:
from textual.theme import Theme

self.register_theme(
Theme(name="agentix", primary="#cc785c", secondary="#a45a45", accent="#e08a6d", dark=True)
)
self.theme = "agentix"
except Exception:
pass

def compose(self) -> ComposeResult:
yield Header(show_clock=True)
with TabbedContent(initial="overview"):
with TabPane("Overview", id="overview"):
yield OverviewView()
with TabPane("Rollouts", id="rollouts"):
yield RolloutsView(self._spec)
with TabPane("Catalog", id="catalog"):
yield CatalogView()
with TabPane("Sandboxes", id="sandboxes"):
yield SandboxesView()
with TabPane("Build", id="build"):
yield BuildView()
with TabPane("Observability", id="observability"):
yield ObservabilityView()
yield Footer()

def action_show_tab(self, tab: str) -> None:
self.query_one(TabbedContent).active = tab
93 changes: 93 additions & 0 deletions examples/eval-tui/eval_tui/cli.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
"""CLI for the Agentix TUI.

- `agentix-eval-tui --demo 40` — synthetic, no-Docker rollouts.
- `agentix-eval-tui --dataset m:d --agent m:a --bundle eval:0.1.0` — real run,
adapters resolved like `agentix-run`.
- `agentix-eval-tui` — no run; browse the Catalog (and the planned tabs).
"""

from __future__ import annotations

import argparse
import importlib
import sys
from typing import Any

from .app import AgentixTUI
from .models import RunSpec


def _load(path: str) -> Any:
module_name, sep, attr = path.partition(":")
if not module_name or not sep or not attr:
raise SystemExit(f"expected 'module:attr', got {path!r}")
obj = getattr(importlib.import_module(module_name), attr)
return obj() if isinstance(obj, type) else obj


def _load_provider(name_or_path: str) -> Any:
if ":" in name_or_path:
return _load(name_or_path)
module = importlib.import_module(f"agentix.provider.{name_or_path}")
classes = [
value
for key, value in vars(module).items()
if isinstance(value, type) and key.endswith("Provider") and value.__module__ == module.__name__
]
if len(classes) != 1:
raise SystemExit(f"could not find a single *Provider class in agentix.provider.{name_or_path}")
return classes[0]()


def _parse_args(argv: list[str]) -> argparse.Namespace:
parser = argparse.ArgumentParser(prog="agentix-eval-tui", description="Modern TUI control room for Agentix.")
parser.add_argument("--demo", type=int, metavar="N", default=None, help="Run N synthetic instances (no Docker).")
parser.add_argument("--dataset", help="Dataset adapter as 'module:attr'.")
parser.add_argument("--agent", help="Agent adapter as 'module:attr'.")
parser.add_argument("--provider", default="docker", help="Provider backend name or 'module:attr'.")
parser.add_argument("--bundle", help="Agentix bundle reference (from `agentix build`).")
parser.add_argument("--model", default=None)
parser.add_argument("--n-concurrent", type=int, default=4)
parser.add_argument("--limit", type=int, default=None)
return parser.parse_args(argv)


def _build_spec(args: argparse.Namespace) -> RunSpec | None:
if args.demo is not None:
from .demo import DemoAgent, DemoDataset, DemoProvider

dataset = DemoDataset(args.demo)
return RunSpec(
dataset=dataset,
agent=DemoAgent(),
provider=DemoProvider(),
bundle="demo",
instances=dataset.instances(),
n_concurrent=args.n_concurrent,
)

given = [bool(args.dataset), bool(args.agent), bool(args.bundle)]
if not any(given):
return None # bare launch: browse the Catalog / planned tabs
if not all(given):
raise SystemExit("--dataset, --agent and --bundle must be given together (or use --demo N)")

dataset = _load(args.dataset)
instances = list(dataset.instances())
if args.limit is not None:
instances = instances[: args.limit]
return RunSpec(
dataset=dataset,
agent=_load(args.agent),
provider=_load_provider(args.provider),
bundle=args.bundle,
model=args.model,
instances=instances,
n_concurrent=args.n_concurrent,
)


def main(argv: list[str] | None = None) -> int:
args = _parse_args(sys.argv[1:] if argv is None else argv)
AgentixTUI(rollout_spec=_build_spec(args)).run()
return 0
Loading
Loading