Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
75 changes: 75 additions & 0 deletions docs/docs/benchmarking.md
Original file line number Diff line number Diff line change
Expand Up @@ -130,6 +130,81 @@ To analyze results, see `scripts/benchmarking/` which contains:

---

## Prompt sweep

A *prompt sweep* runs N agent prompts through the RAGAS harness with everything
but the prompt held fixed — model, queries, retriever, and the RAGAS judge all
come from one base config — and ranks the prompts on a **leaderboard** by mean
RAGAS metric. It generalizes a 2-way A/B to the whole prompt field, which is how
we settle "which support prompt do we ship?" with data rather than by eyeballing
one pair at a time (the open question Q5 / Decision 3 in
`docs/docs/notes_response_tuning.md`).

Only `services.benchmarking.agent_md_file` varies across variants. The model and
RAGAS judge are deliberately held fixed; the leaderboard's `shared_context`
cross-checks this and flags any drift.

### 1. Write a manifest

A manifest names one base config and the prompts to sweep. See
`config/benchmarking/prompt_sweep.yaml` for a worked example:

```yaml
base_config: config/benchmarking/ragas.yaml # supplies model/queries/judge/retriever
out_dir: bench_out/sweep_configs # optional (this is the default)
primary_metric: faithfulness # leaderboard sort key (optional)
prompts:
- config/agents/fasrc-cannon-v1-strict.md
- config/agents/fasrc-cannon-v2-lean.md
- config/agents/fasrc-cannon-v3-cited.md
- config/agents/fasrc-cannon-v4-linked.md
```

`primary_metric` is one of `answer_relevancy`, `faithfulness`,
`context_precision`, `context_recall` (default `faithfulness` — grounding is the
load-bearing property for a "never guess" support bot). All four metrics are
reported per variant regardless; this only sets the ranking key.

### 2. Generate the per-prompt configs

```bash
python scripts/benchmarking/generate_prompt_sweep.py -m config/benchmarking/prompt_sweep.yaml
```

This writes one config per prompt into the sweep directory, each identical to
the base except `services.benchmarking.agent_md_file` (the prompt) and `.name`
(the prompt's filename stem). It refuses to write anything if a prompt path is
missing, so a bad manifest never leaves a partial set behind.

### 3. Run the sweep

```bash
archi evaluate --config-dir bench_out/sweep_configs --hostmode
```

The harness runs each config in turn (the existing `--config-dir` path) and,
because 2+ configs ran, emits a `leaderboard` block in the dump JSON.

### Reading the leaderboard

The dump JSON gains a `leaderboard` key:

- `rows` — one per variant: `name`, `agent_md_file`, the four mean RAGAS
`metrics`, `primary_score`, `rank`, `query_count`, and `incomplete`. Rows are
ranked best-first by `primary_metric`; ties share a rank. A variant that
failed to produce a metric (missing/NaN) has `None` for it, is marked
`incomplete: true`, and sorts after all complete variants — it is never
treated as a zero.
- `shared_context` — the model, provider, judge `evaluator_model`,
`queries_path`, and `corpus_snapshot_id` shared by all variants. If any of
these differ across the swept configs, the discrepancy is recorded in
`shared_context.warnings` (the sweep is no longer apples-to-apples).

The pairwise `ab_comparisons` are still produced alongside the leaderboard; the
leaderboard is computed independently from each config's aggregates.

---

## Human grading via Argilla

`archi evaluate --argilla` pushes benchmark results to a self-hosted [Argilla](https://argilla.io/) instance for independent human grading. This is the platform we use to answer the question "is config A better than config B for FASRC users?" with data we trust — RAGAS scores alone can't decide prompt or model choices because the judge LLM has its own biases.
Expand Down
165 changes: 165 additions & 0 deletions scripts/benchmarking/generate_prompt_sweep.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,165 @@
"""Generate a RAGAS prompt-sweep: one benchmarking config per prompt variant.

A prompt sweep runs N agent prompts through the existing RAGAS harness with
everything but the prompt held fixed, so the only moving part is
`services.benchmarking.agent_md_file`. This script takes one base config plus a
list of prompt files (a sweep manifest) and writes one rendered config per
prompt into a sweep directory. Each generated config is byte-for-byte identical
to the base except for:

- services.benchmarking.agent_md_file -> the prompt's path
- services.benchmarking.name -> the prompt's filename stem
- services.benchmarking.primary_metric -> manifest primary_metric (if set)

Holding everything else constant is what makes the resulting leaderboard an
apples-to-apples comparison (the leaderboard's shared_context cross-check will
flag any drift). Run the sweep with:

archi evaluate --config-dir <sweep_dir> --hostmode ...

Manifest format (YAML):

base_config: config/benchmarking/ragas.yaml
out_dir: bench_out/sweep_configs # optional, this is the default
primary_metric: faithfulness # optional, default faithfulness
prompts:
- config/agents/fasrc-cannon-v1-strict.md
- config/agents/fasrc-cannon-v2-lean.md
- config/agents/fasrc-cannon-v3-cited.md
- config/agents/fasrc-cannon-v4-linked.md

Usage:
python scripts/benchmarking/generate_prompt_sweep.py --manifest config/benchmarking/prompt_sweep.yaml
"""

from __future__ import annotations

import argparse
import copy
import sys
from pathlib import Path
from typing import Any, Dict, List

import yaml

DEFAULT_OUT_DIR = "bench_out/sweep_configs"
DEFAULT_PRIMARY_METRIC = "faithfulness"
KNOWN_METRICS = {
"answer_relevancy",
"faithfulness",
"context_precision",
"context_recall",
}


def _load_yaml(path: Path) -> Dict[str, Any]:
with open(path, "r") as f:
data = yaml.safe_load(f)
if not isinstance(data, dict):
raise ValueError(f"Expected a YAML mapping at {path}, got {type(data).__name__}")
return data


def generate_sweep_configs(manifest_path: Path) -> List[Path]:
"""Render one benchmarking config per prompt listed in the manifest.

Validates that every prompt file exists BEFORE writing anything, so a bad
manifest never leaves a partial set of configs in the sweep directory.
Returns the list of written config paths.
"""
manifest = _load_yaml(manifest_path)

base_config_raw = manifest.get("base_config")
if not base_config_raw:
raise ValueError("Manifest missing required key 'base_config'.")
prompts_raw = manifest.get("prompts")
if not isinstance(prompts_raw, list) or not prompts_raw:
raise ValueError("Manifest 'prompts' must be a non-empty list of prompt file paths.")

primary_metric = str(manifest.get("primary_metric", DEFAULT_PRIMARY_METRIC))
if primary_metric not in KNOWN_METRICS:
raise ValueError(
f"Manifest primary_metric '{primary_metric}' is not a known RAGAS metric "
f"{sorted(KNOWN_METRICS)}."
)

# Relative paths resolve against the current working directory (the repo
# root — the documented place to run this from, and how the prompts in the
# example manifest are written). Absolute paths are used as-is.
repo_root = Path.cwd()

def _resolve(p: str) -> Path:
candidate = Path(p)
return candidate if candidate.is_absolute() else repo_root / candidate

base_config_path = _resolve(str(base_config_raw))
if not base_config_path.is_file():
raise ValueError(f"base_config not found: '{base_config_path}'")

prompt_paths = [_resolve(str(p)) for p in prompts_raw]
missing = [str(p) for p in prompt_paths if not p.is_file()]
if missing:
# Atomic: refuse before writing any config.
raise ValueError(f"Prompt file(s) not found, aborting (no configs written): {missing}")

out_dir = _resolve(str(manifest.get("out_dir", DEFAULT_OUT_DIR)))
out_dir.mkdir(parents=True, exist_ok=True)

base_config = _load_yaml(base_config_path)
if "services" not in base_config or "benchmarking" not in base_config.get("services", {}):
raise ValueError(
f"base_config '{base_config_path}' has no services.benchmarking section to sweep."
)

written: List[Path] = []
for prompt_path in prompt_paths:
cfg = copy.deepcopy(base_config)
bench = cfg["services"]["benchmarking"]
stem = prompt_path.stem
bench["agent_md_file"] = str(prompt_path)
bench["name"] = stem
bench["primary_metric"] = primary_metric

out_path = out_dir / f"{stem}.yaml"
with open(out_path, "w") as f:
yaml.safe_dump(cfg, f, sort_keys=False)
written.append(out_path)

return written


def main(argv: List[str] | None = None) -> int:
parser = argparse.ArgumentParser(description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter)
parser.add_argument(
"--manifest", "-m", required=True,
help="Path to a sweep manifest YAML (base_config + prompts).",
)
args = parser.parse_args(argv)

manifest_path = Path(args.manifest)
if not manifest_path.is_file():
print(f"Manifest not found: {manifest_path}", file=sys.stderr)
return 2

try:
written = generate_sweep_configs(manifest_path)
except ValueError as exc:
print(f"Error: {exc}", file=sys.stderr)
return 1

out_dir = written[0].parent
manifest = _load_yaml(manifest_path)
primary_metric = str(manifest.get("primary_metric", DEFAULT_PRIMARY_METRIC))

print(f"Wrote {len(written)} sweep config(s) to {out_dir}/:")
for p in written:
print(f" - {p.name}")
print()
print("Next: run the sweep (leaderboard ranks by "
f"'{primary_metric}'):")
print(f" archi evaluate --config-dir {out_dir} --hostmode")
return 0


if __name__ == "__main__":
raise SystemExit(main())
13 changes: 11 additions & 2 deletions src/archi/pipelines/agents/base_react.py
Original file line number Diff line number Diff line change
Expand Up @@ -1176,7 +1176,7 @@ def _prepare_agent_inputs(self, **kwargs) -> Dict[str, Any]:
"Invalid context window (%s), skipping trimming.",
context_window,
)
return {"messages": history_messages}
return {"messages": self._inject_forced_retrieval(history_messages)}

safety_margin = int(context_window * 0.15)
max_prompt_tokens = context_window - safety_margin
Expand Down Expand Up @@ -1220,7 +1220,16 @@ def _prepare_agent_inputs(self, **kwargs) -> Dict[str, Any]:
except Exception as e:
logger.debug("Token trimming skipped: %s", e)

return {"messages": history_messages}
return {"messages": self._inject_forced_retrieval(history_messages)}

def _inject_forced_retrieval(self, messages: List[BaseMessage]) -> List[BaseMessage]:
"""Hook for subclasses to force a retrieval before the model's first turn.

Base implementation is a no-op; agents with a vector retriever override
this to prefill a completed ``search_vectorstore_hybrid`` tool round so
retrieval happens regardless of whether the model chooses to call it.
"""
return messages

def _metadata_from_agent_output(self, answer_output: Dict[str, Any]) -> Dict[str, Any]:
"""Hook for subclasses to enrich metadata returned to callers."""
Expand Down
64 changes: 64 additions & 0 deletions src/archi/pipelines/agents/cms_comp_ops_agent.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,11 @@
from __future__ import annotations

import json
import uuid
from typing import Any, Callable, Dict, List

from langchain_core.messages import AIMessage, BaseMessage, HumanMessage, ToolMessage

from src.utils.logging import get_logger
from src.utils.env import read_secret
from src.archi.pipelines.agents.base_react import BaseReActAgent
Expand Down Expand Up @@ -313,3 +316,64 @@ def _update_vector_retrievers(self, vectorstore: Any) -> None:
store_tool_input=getattr(self, "_store_tool_input", None),
)
)

def _inject_forced_retrieval(self, messages: List[BaseMessage]) -> List[BaseMessage]:
"""Force one ``search_vectorstore_hybrid`` call before the model answers.

The model can ignore an "always search first" prompt and answer from its
own weights, leaving ``source_documents`` empty (chat UI shows "Link
unavailable"). To enforce retrieval, prefill a completed tool round —
an ``AIMessage`` carrying the tool call plus the matching ``ToolMessage``
result — so the ReAct loop starts with real chunks already in context.
Invoking the existing retriever tool also runs its ``store_docs``
callback, so retrieved documents flow into ``source_documents``/links
exactly as a model-initiated search would. The model may still search
again. Gated by ``services.chat_app.force_initial_retrieval`` (default
on) so prompt-vs-enforcement variants can be A/B'd in the sweep.
"""
if not getattr(self, "enable_vector_tools", False):
logger.debug("Forced retrieval skipped: vector tools disabled")
return messages
if not self._chat_app_config.get("force_initial_retrieval", True):
logger.debug("Forced retrieval skipped: force_initial_retrieval=false")
return messages
tools = getattr(self, "_vector_tools", None)
if not tools:
logger.debug("Forced retrieval skipped: no vector tools built")
return messages
# Only force on a fresh user turn (the latest message is the question).
if not messages or not isinstance(messages[-1], HumanMessage):
last_type = type(messages[-1]).__name__ if messages else "none"
logger.debug("Forced retrieval skipped: last message is %s", last_type)
return messages
query = (self._message_content(messages[-1]) or "").strip()
if not query:
logger.debug("Forced retrieval skipped: empty query")
return messages

try:
result = tools[0].invoke({"query": query})
except Exception:
# Fail open: a retrieval error must not break the chat turn.
logger.warning("Forced initial retrieval failed", exc_info=True)
return messages

logger.info("Forced initial retrieval ran: query=%r -> %d chars", query, len(str(result)))

call_id = f"forced_search_{uuid.uuid4().hex}"
ai = AIMessage(
content="",
tool_calls=[
{
"name": "search_vectorstore_hybrid",
"args": {"query": query},
"id": call_id,
}
],
)
tool_msg = ToolMessage(
content=result if isinstance(result, str) else str(result),
tool_call_id=call_id,
name="search_vectorstore_hybrid",
)
return list(messages) + [ai, tool_msg]
Loading
Loading