RAGAS prompt-sweep leaderboard, --config-dir fixes, and enforced retrieval by swinney · Pull Request #22 · fasrc/archi

swinney · 2026-06-10T13:07:02Z

Three feature-separated commits, each with tests.

1. `feat(benchmark)`: rank prompt variants on a RAGAS leaderboard

scripts/benchmarking/generate_prompt_sweep.py renders one benchmarking config per prompt variant from a manifest.
ResultHandler.build_leaderboard aggregates each variant's mean RAGAS metrics into a ranked leaderboard (with shared_context and incomplete-row handling), emitted in the dump JSON.
Model, queries, retriever, and judge are held fixed so only the prompt varies. Docs added to docs/docs/benchmarking.md.

2. `fix(cli)`: make `archi evaluate --config-dir` run multi-config sweeps

Four plumbing fixes, each surfaced by a live --config-dir run:

config_manager: exempt services.benchmarking from the cross-config consistency check (it's the sweep axis); global + other services.* stay strict.
templates_manager: render a distinct file per config in multi-config mode (configs sharing a top-level name no longer collide).
config_seed: fall back to the first *.yaml when config.yaml is absent, so the chatbot-oriented seed step no longer aborts a benchmark deployment.
templates_manager: stage every config's agent_md_file, not just the first.

3. `feat(agent)`: enforce initial vector retrieval in CMSCompOpsAgent

_inject_forced_retrieval prefills a completed search_vectorstore_hybrid tool round before the model's first turn, so retrieval always happens and source links populate (the model can otherwise answer from weights, leaving "Link unavailable").
The model may still search again. Gated by services.chat_app.force_initial_retrieval (default on).

Verification

37 new unit tests pass; full suite green apart from one pre-existing unrelated ingestion failure.
Pyright: zero new diagnostics vs baseline across all changed files.
End-to-end: a live 3-variant sweep produced a real 3-row leaderboard with populated shared_context.

🤖 Generated with Claude Code

Add a prompt-sweep workflow: generate_prompt_sweep.py renders one benchmarking config per prompt variant from a manifest, and ResultHandler.build_leaderboard aggregates each variant's mean RAGAS metrics into a ranked leaderboard (with a shared_context block and incomplete-row handling) emitted in the dump JSON. Model, queries, retriever and judge are held fixed so only the prompt varies. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

… sweeps Four plumbing fixes so a prompt sweep actually runs end-to-end, each surfaced by a live --config-dir run: - config_manager: exempt services.benchmarking from the cross-config consistency check (it is the sweep axis); global + other services.* stay strict. - templates_manager: render a distinct file per config in multi-config mode (configs sharing a top-level name no longer collide onto one file). - config_seed: fall back to the first *.yaml when config.yaml is absent, so the chatbot-oriented seed step no longer aborts a benchmark deployment. - templates_manager: stage every config's agent_md_file, not just the first, so all sweep variants are available in the container. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The model can ignore an 'always search first' prompt and answer from its own weights, leaving source_documents empty (chat UI shows 'Link unavailable'). _inject_forced_retrieval prefills a completed search_vectorstore_hybrid tool round before the model's first turn, so retrieval always happens and its store_docs callback populates the source links. The model may still search again. Gated by services.chat_app.force_initial_retrieval (default on). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Copilot

Pull request overview

Adds prompt-sweep support to the benchmarking workflow (including a RAGAS leaderboard), fixes several archi evaluate --config-dir multi-config deployment issues, and enforces an initial vector retrieval in CMSCompOpsAgent to ensure source links populate.

Changes:

Introduces a prompt-sweep generator script plus a leaderboard aggregator that ranks prompt variants by mean RAGAS metrics and emits results in the dump JSON.
Fixes multi-config sweep plumbing in config rendering/staging and config-seed resolution to support --config-dir runs reliably.
Adds a BaseReAct hook + CMSCompOpsAgent implementation to prefill a completed retrieval tool round before the model’s first turn (gated by config), with unit tests.

Reviewed changes

Copilot reviewed 15 out of 16 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
tests/unit/test_stage_agents_multiconfig.py	Verifies multi-config benchmarking stages all agent prompt markdown files.
tests/unit/test_render_config_filename.py	Verifies per-config rendered YAML filenames are unique in multi-config mode.
tests/unit/test_prompt_sweep_leaderboard.py	Unit coverage for leaderboard aggregation (ranking, ties, incomplete metrics, shared-context drift).
tests/unit/test_prompt_sweep_dump_gating.py	Ensures dump JSON only includes `leaderboard` when populated.
tests/unit/test_generate_prompt_sweep.py	Tests the sweep-config generator’s output invariants and atomic failure behavior.
tests/unit/test_forced_retrieval.py	Tests forced initial retrieval injection behavior and gating/fail-open semantics.
tests/unit/test_config_seed_resolve.py	Tests config-seed fallback behavior when `config.yaml` is absent in multi-config renders.
tests/unit/test_config_manager_benchmarking_variation.py	Ensures cross-config consistency checking allows `services.benchmarking` variation only.
src/cli/tools/config_seed.py	Adds `resolve_config_path` to tolerate multi-config rendered directories.
src/cli/managers/templates_manager.py	Fixes multi-config staging (agents) and per-config render naming to avoid collisions.
src/cli/managers/config_manager.py	Exempts `services.benchmarking` from cross-config equality checks for sweeps.
src/bin/service_benchmark.py	Adds leaderboard computation + dump emission and logs a ranked summary.
src/archi/pipelines/agents/cms_comp_ops_agent.py	Implements forced initial retrieval injection using a prefilled tool round.
src/archi/pipelines/agents/base_react.py	Adds subclass hook and integrates it into agent input preparation.
scripts/benchmarking/generate_prompt_sweep.py	New script to generate one benchmarking config per prompt variant from a manifest.
docs/docs/benchmarking.md	Documentation for running prompt sweeps and interpreting the leaderboard output.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

swinney · 2026-06-10T18:38:28Z

+            for bench_config in context.config_manager.get_configs():
+                bench_services = bench_config.get("services", {}) or {}
+                benchmark_cfg = bench_services.get("benchmarking", {}) or {}
+                agent_md_file = benchmark_cfg.get("agent_md_file")
+                if not agent_md_file:


Fixed in 60dd655: _stage_agents now rejects two configs whose agent_md_file share a basename (they would overwrite each other when staged and when referenced by the rendered config) with a clear error instead of silently dropping one. Test: test_same_basename_different_files_is_rejected.

swinney · 2026-06-10T18:38:30Z

+    stem = str(benchmarking_name or top_level_name)
+    candidate = f"{stem}.yaml"
+    if candidate in used_names:
+        candidate = f"{stem}_{index}.yaml"
+    used_names.add(candidate)
+    return candidate


Fixed in 60dd655: render_config_target_name now loops, bumping the suffix until the name is unique against all previously-used names, so a {stem}{index} that is already taken can no longer overwrite an earlier rendered file. Test: test_repeated_collisions_stay_unique.

aggregate_* is a pandas .mean() that skips NaN, so a RAGAS judge timeout on a question silently shrinks that metric's sample without making the aggregate NaN — a mean over 4 of 9 answered questions looked fully-backed. Add scored_counts{metric: n} (non-NaN per-question count) to each leaderboard row, keep query_count as the answered count, log a warning for under-sampled metrics, and annotate the log table cells with @<n> when n < answered. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…out them The leaderboard unit tests import service_benchmark, whose top-level 'from datasets import Dataset' and 'from ragas import ...' pulled heavy benchmark-only deps absent from the lean unit-test CI env, breaking collection (ModuleNotFoundError: datasets). Move those imports into the two methods that use them (get_ragas_results, run) so the module imports for its pure helpers (ResultHandler.build_leaderboard/dump) without datasets/ragas. The benchmarking Docker image still has both at runtime. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…lisions Address Copilot review on PR #22: - _render_config_target_name now loops until the name is unique instead of disambiguating only once, so a config can never silently overwrite an earlier rendered file even if {stem}_{index} is already taken. - _stage_agents rejects two benchmark configs whose agent_md_file share a basename (they'd overwrite each other when staged and when referenced by the rendered config) with a clear error instead of silently dropping one. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Copilot · 2026-06-10T18:10:20Z

Thanks @copilot — both findings were valid and are now fixed (commit 60dd655):

_render_config_target_name single-disambiguation — it now loops, bumping the suffix until the name is uniq...

Confirmed — both fixes look correct. All 16 unit tests covering the collision-hardening and agent-staging logic pass locally (commit 60dd655). The lazy-import fix for the CI collection error (7f9b419) also checks out.

After lazy-importing datasets/ragas, collection still failed in CI because importing service_benchmark ran module-level read_secret() calls and opened a Postgres connection pool (PostgresServiceFactory.from_env), which can't connect in the DB-less unit-test env. Move that initialization into _init_runtime(), called only from __main__, so importing the module for ResultHandler's pure helpers needs neither live secrets nor a database. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Austin Swinney and others added 3 commits June 9, 2026 22:26

swinney requested a review from Copilot June 10, 2026 14:05

Copilot AI reviewed Jun 10, 2026

View reviewed changes

Austin Swinney and others added 3 commits June 10, 2026 10:12

Copilot started work on behalf of swinney June 10, 2026 18:07 View session

Copilot finished work on behalf of swinney June 10, 2026 18:10

swinney merged commit b7bf345 into dev Jun 10, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RAGAS prompt-sweep leaderboard, --config-dir fixes, and enforced retrieval#22

RAGAS prompt-sweep leaderboard, --config-dir fixes, and enforced retrieval#22
swinney merged 7 commits into
devfrom
feat/ragas-prompt-sweep

swinney commented Jun 10, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

swinney Jun 10, 2026

Uh oh!

swinney Jun 10, 2026

Uh oh!

Copilot AI commented Jun 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

swinney commented Jun 10, 2026

1. feat(benchmark): rank prompt variants on a RAGAS leaderboard

2. fix(cli): make archi evaluate --config-dir run multi-config sweeps

3. feat(agent): enforce initial vector retrieval in CMSCompOpsAgent

Verification

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

swinney Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

swinney Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI commented Jun 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

1. `feat(benchmark)`: rank prompt variants on a RAGAS leaderboard

2. `fix(cli)`: make `archi evaluate --config-dir` run multi-config sweeps

3. `feat(agent)`: enforce initial vector retrieval in CMSCompOpsAgent