fasrc · swinney · Jun 10, 2026 · Jun 10, 2026 · Jun 10, 2026 · Jun 10, 2026
diff --git a/docs/docs/benchmarking.md b/docs/docs/benchmarking.md
@@ -130,6 +130,81 @@ To analyze results, see `scripts/benchmarking/` which contains:
 
 ---
 
+## Prompt sweep
+
+A *prompt sweep* runs N agent prompts through the RAGAS harness with everything
+but the prompt held fixed — model, queries, retriever, and the RAGAS judge all
+come from one base config — and ranks the prompts on a **leaderboard** by mean
+RAGAS metric. It generalizes a 2-way A/B to the whole prompt field, which is how
+we settle "which support prompt do we ship?" with data rather than by eyeballing
+one pair at a time (the open question Q5 / Decision 3 in
+`docs/docs/notes_response_tuning.md`).
+
+Only `services.benchmarking.agent_md_file` varies across variants. The model and
+RAGAS judge are deliberately held fixed; the leaderboard's `shared_context`
+cross-checks this and flags any drift.
+
+### 1. Write a manifest
+
+A manifest names one base config and the prompts to sweep. See
+`config/benchmarking/prompt_sweep.yaml` for a worked example:
+
+```yaml
+base_config: config/benchmarking/ragas.yaml   # supplies model/queries/judge/retriever
+out_dir: bench_out/sweep_configs              # optional (this is the default)
+primary_metric: faithfulness                  # leaderboard sort key (optional)
+prompts:
+  - config/agents/fasrc-cannon-v1-strict.md
+  - config/agents/fasrc-cannon-v2-lean.md
+  - config/agents/fasrc-cannon-v3-cited.md
+  - config/agents/fasrc-cannon-v4-linked.md
+```
+
+`primary_metric` is one of `answer_relevancy`, `faithfulness`,
+`context_precision`, `context_recall` (default `faithfulness` — grounding is the
+load-bearing property for a "never guess" support bot). All four metrics are
+reported per variant regardless; this only sets the ranking key.
+
+### 2. Generate the per-prompt configs
+
+```bash
+python scripts/benchmarking/generate_prompt_sweep.py -m config/benchmarking/prompt_sweep.yaml
+```
+
+This writes one config per prompt into the sweep directory, each identical to
+the base except `services.benchmarking.agent_md_file` (the prompt) and `.name`
+(the prompt's filename stem). It refuses to write anything if a prompt path is
+missing, so a bad manifest never leaves a partial set behind.
+
+### 3. Run the sweep
+
+```bash
+archi evaluate --config-dir bench_out/sweep_configs --hostmode
+```
+
+The harness runs each config in turn (the existing `--config-dir` path) and,
+because 2+ configs ran, emits a `leaderboard` block in the dump JSON.
+
+### Reading the leaderboard
+
+The dump JSON gains a `leaderboard` key:
+
+- `rows` — one per variant: `name`, `agent_md_file`, the four mean RAGAS
+  `metrics`, `primary_score`, `rank`, `query_count`, and `incomplete`. Rows are
+  ranked best-first by `primary_metric`; ties share a rank. A variant that
+  failed to produce a metric (missing/NaN) has `None` for it, is marked
+  `incomplete: true`, and sorts after all complete variants — it is never
+  treated as a zero.
+- `shared_context` — the model, provider, judge `evaluator_model`,
+  `queries_path`, and `corpus_snapshot_id` shared by all variants. If any of
+  these differ across the swept configs, the discrepancy is recorded in
+  `shared_context.warnings` (the sweep is no longer apples-to-apples).
+
+The pairwise `ab_comparisons` are still produced alongside the leaderboard; the
+leaderboard is computed independently from each config's aggregates.
+
+---
+
 ## Human grading via Argilla
 
 `archi evaluate --argilla` pushes benchmark results to a self-hosted [Argilla](https://argilla.io/) instance for independent human grading. This is the platform we use to answer the question "is config A better than config B for FASRC users?" with data we trust — RAGAS scores alone can't decide prompt or model choices because the judge LLM has its own biases.

diff --git a/scripts/benchmarking/generate_prompt_sweep.py b/scripts/benchmarking/generate_prompt_sweep.py
@@ -0,0 +1,165 @@
+"""Generate a RAGAS prompt-sweep: one benchmarking config per prompt variant.
+
+A prompt sweep runs N agent prompts through the existing RAGAS harness with
+everything but the prompt held fixed, so the only moving part is
+`services.benchmarking.agent_md_file`. This script takes one base config plus a
+list of prompt files (a sweep manifest) and writes one rendered config per
+prompt into a sweep directory. Each generated config is byte-for-byte identical
+to the base except for:
+
+  - services.benchmarking.agent_md_file  -> the prompt's path
+  - services.benchmarking.name           -> the prompt's filename stem
+  - services.benchmarking.primary_metric -> manifest primary_metric (if set)
+
+Holding everything else constant is what makes the resulting leaderboard an
+apples-to-apples comparison (the leaderboard's shared_context cross-check will
+flag any drift). Run the sweep with:
+
+    archi evaluate --config-dir <sweep_dir> --hostmode ...
+
+Manifest format (YAML):
+
+    base_config: config/benchmarking/ragas.yaml
+    out_dir: bench_out/sweep_configs        # optional, this is the default
+    primary_metric: faithfulness            # optional, default faithfulness
+    prompts:
+      - config/agents/fasrc-cannon-v1-strict.md
+      - config/agents/fasrc-cannon-v2-lean.md
+      - config/agents/fasrc-cannon-v3-cited.md
+      - config/agents/fasrc-cannon-v4-linked.md
+
+Usage:
+    python scripts/benchmarking/generate_prompt_sweep.py --manifest config/benchmarking/prompt_sweep.yaml
+"""
+
+from __future__ import annotations
+
+import argparse
+import copy
+import sys
+from pathlib import Path
+from typing import Any, Dict, List
+
+import yaml
+
+DEFAULT_OUT_DIR = "bench_out/sweep_configs"
+DEFAULT_PRIMARY_METRIC = "faithfulness"
+KNOWN_METRICS = {
+    "answer_relevancy",
+    "faithfulness",
+    "context_precision",
+    "context_recall",
+}
+
+
+def _load_yaml(path: Path) -> Dict[str, Any]:
+    with open(path, "r") as f:
+        data = yaml.safe_load(f)
+    if not isinstance(data, dict):
+        raise ValueError(f"Expected a YAML mapping at {path}, got {type(data).__name__}")
+    return data
+
+
+def generate_sweep_configs(manifest_path: Path) -> List[Path]:
+    """Render one benchmarking config per prompt listed in the manifest.
+
+    Validates that every prompt file exists BEFORE writing anything, so a bad
+    manifest never leaves a partial set of configs in the sweep directory.
+    Returns the list of written config paths.
+    """
+    manifest = _load_yaml(manifest_path)
+
+    base_config_raw = manifest.get("base_config")
+    if not base_config_raw:
+        raise ValueError("Manifest missing required key 'base_config'.")
+    prompts_raw = manifest.get("prompts")
+    if not isinstance(prompts_raw, list) or not prompts_raw:
+        raise ValueError("Manifest 'prompts' must be a non-empty list of prompt file paths.")
+
+    primary_metric = str(manifest.get("primary_metric", DEFAULT_PRIMARY_METRIC))
+    if primary_metric not in KNOWN_METRICS:
+        raise ValueError(
+            f"Manifest primary_metric '{primary_metric}' is not a known RAGAS metric "
+            f"{sorted(KNOWN_METRICS)}."
+        )
+
+    # Relative paths resolve against the current working directory (the repo
+    # root — the documented place to run this from, and how the prompts in the
+    # example manifest are written). Absolute paths are used as-is.
+    repo_root = Path.cwd()
+
+    def _resolve(p: str) -> Path:
+        candidate = Path(p)
+        return candidate if candidate.is_absolute() else repo_root / candidate
+
+    base_config_path = _resolve(str(base_config_raw))
+    if not base_config_path.is_file():
+        raise ValueError(f"base_config not found: '{base_config_path}'")
+
+    prompt_paths = [_resolve(str(p)) for p in prompts_raw]
+    missing = [str(p) for p in prompt_paths if not p.is_file()]
+    if missing:
+        # Atomic: refuse before writing any config.
+        raise ValueError(f"Prompt file(s) not found, aborting (no configs written): {missing}")
+
+    out_dir = _resolve(str(manifest.get("out_dir", DEFAULT_OUT_DIR)))
+    out_dir.mkdir(parents=True, exist_ok=True)
+
+    base_config = _load_yaml(base_config_path)
+    if "services" not in base_config or "benchmarking" not in base_config.get("services", {}):
+        raise ValueError(
+            f"base_config '{base_config_path}' has no services.benchmarking section to sweep."
+        )
+
+    written: List[Path] = []
+    for prompt_path in prompt_paths:
+        cfg = copy.deepcopy(base_config)
+        bench = cfg["services"]["benchmarking"]
+        stem = prompt_path.stem
+        bench["agent_md_file"] = str(prompt_path)
+        bench["name"] = stem
+        bench["primary_metric"] = primary_metric
+
+        out_path = out_dir / f"{stem}.yaml"
+        with open(out_path, "w") as f:
+            yaml.safe_dump(cfg, f, sort_keys=False)
+        written.append(out_path)
+
+    return written
+
+
+def main(argv: List[str] | None = None) -> int:
+    parser = argparse.ArgumentParser(description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter)
+    parser.add_argument(
+        "--manifest", "-m", required=True,
+        help="Path to a sweep manifest YAML (base_config + prompts).",
+    )
+    args = parser.parse_args(argv)
+
+    manifest_path = Path(args.manifest)
+    if not manifest_path.is_file():
+        print(f"Manifest not found: {manifest_path}", file=sys.stderr)
+        return 2
+
+    try:
+        written = generate_sweep_configs(manifest_path)
+    except ValueError as exc:
+        print(f"Error: {exc}", file=sys.stderr)
+        return 1
+
+    out_dir = written[0].parent
+    manifest = _load_yaml(manifest_path)
+    primary_metric = str(manifest.get("primary_metric", DEFAULT_PRIMARY_METRIC))
+
+    print(f"Wrote {len(written)} sweep config(s) to {out_dir}/:")
+    for p in written:
+        print(f"  - {p.name}")
+    print()
+    print("Next: run the sweep (leaderboard ranks by "
+          f"'{primary_metric}'):")
+    print(f"  archi evaluate --config-dir {out_dir} --hostmode")
+    return 0
+
+
+if __name__ == "__main__":
+    raise SystemExit(main())
diff --git a/src/archi/pipelines/agents/base_react.py b/src/archi/pipelines/agents/base_react.py
@@ -1176,7 +1176,7 @@ def _prepare_agent_inputs(self, **kwargs) -> Dict[str, Any]:
                     "Invalid context window (%s), skipping trimming.",
                     context_window,
                     )
-                    return {"messages": history_messages}
+                    return {"messages": self._inject_forced_retrieval(history_messages)}
 
                 safety_margin = int(context_window * 0.15)
                 max_prompt_tokens = context_window - safety_margin
@@ -1220,7 +1220,16 @@ def _prepare_agent_inputs(self, **kwargs) -> Dict[str, Any]:
         except Exception as e:
             logger.debug("Token trimming skipped: %s", e)
 
-        return {"messages": history_messages}
+        return {"messages": self._inject_forced_retrieval(history_messages)}
+
+    def _inject_forced_retrieval(self, messages: List[BaseMessage]) -> List[BaseMessage]:
+        """Hook for subclasses to force a retrieval before the model's first turn.
+
+        Base implementation is a no-op; agents with a vector retriever override
+        this to prefill a completed ``search_vectorstore_hybrid`` tool round so
+        retrieval happens regardless of whether the model chooses to call it.
+        """
+        return messages
 
     def _metadata_from_agent_output(self, answer_output: Dict[str, Any]) -> Dict[str, Any]:
         """Hook for subclasses to enrich metadata returned to callers."""

diff --git a/src/archi/pipelines/agents/cms_comp_ops_agent.py b/src/archi/pipelines/agents/cms_comp_ops_agent.py
@@ -1,8 +1,11 @@
 from __future__ import annotations
 
 import json
+import uuid
 from typing import Any, Callable, Dict, List
 
+from langchain_core.messages import AIMessage, BaseMessage, HumanMessage, ToolMessage
+
 from src.utils.logging import get_logger
 from src.utils.env import read_secret
 from src.archi.pipelines.agents.base_react import BaseReActAgent
@@ -313,3 +316,64 @@ def _update_vector_retrievers(self, vectorstore: Any) -> None:
                 store_tool_input=getattr(self, "_store_tool_input", None),
             )
         )
+
+    def _inject_forced_retrieval(self, messages: List[BaseMessage]) -> List[BaseMessage]:
+        """Force one ``search_vectorstore_hybrid`` call before the model answers.
+
+        The model can ignore an "always search first" prompt and answer from its
+        own weights, leaving ``source_documents`` empty (chat UI shows "Link
+        unavailable"). To enforce retrieval, prefill a completed tool round —
+        an ``AIMessage`` carrying the tool call plus the matching ``ToolMessage``
+        result — so the ReAct loop starts with real chunks already in context.
+        Invoking the existing retriever tool also runs its ``store_docs``
+        callback, so retrieved documents flow into ``source_documents``/links
+        exactly as a model-initiated search would. The model may still search
+        again. Gated by ``services.chat_app.force_initial_retrieval`` (default
+        on) so prompt-vs-enforcement variants can be A/B'd in the sweep.
+        """
+        if not getattr(self, "enable_vector_tools", False):
+            logger.debug("Forced retrieval skipped: vector tools disabled")
+            return messages
+        if not self._chat_app_config.get("force_initial_retrieval", True):
+            logger.debug("Forced retrieval skipped: force_initial_retrieval=false")
+            return messages
+        tools = getattr(self, "_vector_tools", None)
+        if not tools:
+            logger.debug("Forced retrieval skipped: no vector tools built")
+            return messages
+        # Only force on a fresh user turn (the latest message is the question).
+        if not messages or not isinstance(messages[-1], HumanMessage):
+            last_type = type(messages[-1]).__name__ if messages else "none"
+            logger.debug("Forced retrieval skipped: last message is %s", last_type)
+            return messages
+        query = (self._message_content(messages[-1]) or "").strip()
+        if not query:
+            logger.debug("Forced retrieval skipped: empty query")
+            return messages
+
+        try:
+            result = tools[0].invoke({"query": query})
+        except Exception:
+            # Fail open: a retrieval error must not break the chat turn.
+            logger.warning("Forced initial retrieval failed", exc_info=True)
+            return messages
+
+        logger.info("Forced initial retrieval ran: query=%r -> %d chars", query, len(str(result)))
+
+        call_id = f"forced_search_{uuid.uuid4().hex}"
+        ai = AIMessage(
+            content="",
+            tool_calls=[
+                {
+                    "name": "search_vectorstore_hybrid",
+                    "args": {"query": query},
+                    "id": call_id,
+                }
+            ],
+        )
+        tool_msg = ToolMessage(
+            content=result if isinstance(result, str) else str(result),
+            tool_call_id=call_id,
+            name="search_vectorstore_hybrid",
+        )
+        return list(messages) + [ai, tool_msg]