Skip to content

RAGAS prompt-sweep leaderboard, --config-dir fixes, and enforced retrieval#22

Merged
swinney merged 7 commits into
devfrom
feat/ragas-prompt-sweep
Jun 10, 2026
Merged

RAGAS prompt-sweep leaderboard, --config-dir fixes, and enforced retrieval#22
swinney merged 7 commits into
devfrom
feat/ragas-prompt-sweep

Conversation

@swinney

@swinney swinney commented Jun 10, 2026

Copy link
Copy Markdown
Member

Three feature-separated commits, each with tests.

1. feat(benchmark): rank prompt variants on a RAGAS leaderboard

  • scripts/benchmarking/generate_prompt_sweep.py renders one benchmarking config per prompt variant from a manifest.
  • ResultHandler.build_leaderboard aggregates each variant's mean RAGAS metrics into a ranked leaderboard (with shared_context and incomplete-row handling), emitted in the dump JSON.
  • Model, queries, retriever, and judge are held fixed so only the prompt varies. Docs added to docs/docs/benchmarking.md.

2. fix(cli): make archi evaluate --config-dir run multi-config sweeps

Four plumbing fixes, each surfaced by a live --config-dir run:

  • config_manager: exempt services.benchmarking from the cross-config consistency check (it's the sweep axis); global + other services.* stay strict.
  • templates_manager: render a distinct file per config in multi-config mode (configs sharing a top-level name no longer collide).
  • config_seed: fall back to the first *.yaml when config.yaml is absent, so the chatbot-oriented seed step no longer aborts a benchmark deployment.
  • templates_manager: stage every config's agent_md_file, not just the first.

3. feat(agent): enforce initial vector retrieval in CMSCompOpsAgent

  • _inject_forced_retrieval prefills a completed search_vectorstore_hybrid tool round before the model's first turn, so retrieval always happens and source links populate (the model can otherwise answer from weights, leaving "Link unavailable").
  • The model may still search again. Gated by services.chat_app.force_initial_retrieval (default on).

Verification

  • 37 new unit tests pass; full suite green apart from one pre-existing unrelated ingestion failure.
  • Pyright: zero new diagnostics vs baseline across all changed files.
  • End-to-end: a live 3-variant sweep produced a real 3-row leaderboard with populated shared_context.

🤖 Generated with Claude Code

Austin Swinney and others added 3 commits June 9, 2026 22:26
Add a prompt-sweep workflow: generate_prompt_sweep.py renders one benchmarking
config per prompt variant from a manifest, and ResultHandler.build_leaderboard
aggregates each variant's mean RAGAS metrics into a ranked leaderboard (with a
shared_context block and incomplete-row handling) emitted in the dump JSON.
Model, queries, retriever and judge are held fixed so only the prompt varies.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… sweeps

Four plumbing fixes so a prompt sweep actually runs end-to-end, each surfaced by
a live --config-dir run:
- config_manager: exempt services.benchmarking from the cross-config consistency
  check (it is the sweep axis); global + other services.* stay strict.
- templates_manager: render a distinct file per config in multi-config mode
  (configs sharing a top-level name no longer collide onto one file).
- config_seed: fall back to the first *.yaml when config.yaml is absent, so the
  chatbot-oriented seed step no longer aborts a benchmark deployment.
- templates_manager: stage every config's agent_md_file, not just the first, so
  all sweep variants are available in the container.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The model can ignore an 'always search first' prompt and answer from its own
weights, leaving source_documents empty (chat UI shows 'Link unavailable').
_inject_forced_retrieval prefills a completed search_vectorstore_hybrid tool
round before the model's first turn, so retrieval always happens and its
store_docs callback populates the source links. The model may still search
again. Gated by services.chat_app.force_initial_retrieval (default on).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@swinney swinney requested a review from Copilot June 10, 2026 14:05

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds prompt-sweep support to the benchmarking workflow (including a RAGAS leaderboard), fixes several archi evaluate --config-dir multi-config deployment issues, and enforces an initial vector retrieval in CMSCompOpsAgent to ensure source links populate.

Changes:

  • Introduces a prompt-sweep generator script plus a leaderboard aggregator that ranks prompt variants by mean RAGAS metrics and emits results in the dump JSON.
  • Fixes multi-config sweep plumbing in config rendering/staging and config-seed resolution to support --config-dir runs reliably.
  • Adds a BaseReAct hook + CMSCompOpsAgent implementation to prefill a completed retrieval tool round before the model’s first turn (gated by config), with unit tests.

Reviewed changes

Copilot reviewed 15 out of 16 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
tests/unit/test_stage_agents_multiconfig.py Verifies multi-config benchmarking stages all agent prompt markdown files.
tests/unit/test_render_config_filename.py Verifies per-config rendered YAML filenames are unique in multi-config mode.
tests/unit/test_prompt_sweep_leaderboard.py Unit coverage for leaderboard aggregation (ranking, ties, incomplete metrics, shared-context drift).
tests/unit/test_prompt_sweep_dump_gating.py Ensures dump JSON only includes leaderboard when populated.
tests/unit/test_generate_prompt_sweep.py Tests the sweep-config generator’s output invariants and atomic failure behavior.
tests/unit/test_forced_retrieval.py Tests forced initial retrieval injection behavior and gating/fail-open semantics.
tests/unit/test_config_seed_resolve.py Tests config-seed fallback behavior when config.yaml is absent in multi-config renders.
tests/unit/test_config_manager_benchmarking_variation.py Ensures cross-config consistency checking allows services.benchmarking variation only.
src/cli/tools/config_seed.py Adds resolve_config_path to tolerate multi-config rendered directories.
src/cli/managers/templates_manager.py Fixes multi-config staging (agents) and per-config render naming to avoid collisions.
src/cli/managers/config_manager.py Exempts services.benchmarking from cross-config equality checks for sweeps.
src/bin/service_benchmark.py Adds leaderboard computation + dump emission and logs a ranked summary.
src/archi/pipelines/agents/cms_comp_ops_agent.py Implements forced initial retrieval injection using a prefilled tool round.
src/archi/pipelines/agents/base_react.py Adds subclass hook and integrates it into agent input preparation.
scripts/benchmarking/generate_prompt_sweep.py New script to generate one benchmarking config per prompt variant from a manifest.
docs/docs/benchmarking.md Documentation for running prompt sweeps and interpreting the leaderboard output.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +193 to +197
for bench_config in context.config_manager.get_configs():
bench_services = bench_config.get("services", {}) or {}
benchmark_cfg = bench_services.get("benchmarking", {}) or {}
agent_md_file = benchmark_cfg.get("agent_md_file")
if not agent_md_file:

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 60dd655: _stage_agents now rejects two configs whose agent_md_file share a basename (they would overwrite each other when staged and when referenced by the rendered config) with a clear error instead of silently dropping one. Test: test_same_basename_different_files_is_rejected.

Comment on lines +38 to +43
stem = str(benchmarking_name or top_level_name)
candidate = f"{stem}.yaml"
if candidate in used_names:
candidate = f"{stem}_{index}.yaml"
used_names.add(candidate)
return candidate

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 60dd655: render_config_target_name now loops, bumping the suffix until the name is unique against all previously-used names, so a {stem}{index} that is already taken can no longer overwrite an earlier rendered file. Test: test_repeated_collisions_stay_unique.

Austin Swinney and others added 3 commits June 10, 2026 10:12
aggregate_* is a pandas .mean() that skips NaN, so a RAGAS judge timeout on a
question silently shrinks that metric's sample without making the aggregate
NaN — a mean over 4 of 9 answered questions looked fully-backed. Add
scored_counts{metric: n} (non-NaN per-question count) to each leaderboard row,
keep query_count as the answered count, log a warning for under-sampled
metrics, and annotate the log table cells with @<n> when n < answered.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…out them

The leaderboard unit tests import service_benchmark, whose top-level
'from datasets import Dataset' and 'from ragas import ...' pulled heavy
benchmark-only deps absent from the lean unit-test CI env, breaking collection
(ModuleNotFoundError: datasets). Move those imports into the two methods that
use them (get_ragas_results, run) so the module imports for its pure helpers
(ResultHandler.build_leaderboard/dump) without datasets/ragas. The benchmarking
Docker image still has both at runtime.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…lisions

Address Copilot review on PR #22:
- _render_config_target_name now loops until the name is unique instead of
  disambiguating only once, so a config can never silently overwrite an
  earlier rendered file even if {stem}_{index} is already taken.
- _stage_agents rejects two benchmark configs whose agent_md_file share a
  basename (they'd overwrite each other when staged and when referenced by
  the rendered config) with a clear error instead of silently dropping one.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Copilot AI commented Jun 10, 2026

Copy link
Copy Markdown

Thanks @copilot — both findings were valid and are now fixed (commit 60dd655):

  1. _render_config_target_name single-disambiguation — it now loops, bumping the suffix until the name is uniq...

Confirmed — both fixes look correct. All 16 unit tests covering the collision-hardening and agent-staging logic pass locally (commit 60dd655). The lazy-import fix for the CI collection error (7f9b419) also checks out.

After lazy-importing datasets/ragas, collection still failed in CI because
importing service_benchmark ran module-level read_secret() calls and opened a
Postgres connection pool (PostgresServiceFactory.from_env), which can't connect
in the DB-less unit-test env. Move that initialization into _init_runtime(),
called only from __main__, so importing the module for ResultHandler's pure
helpers needs neither live secrets nor a database.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@swinney swinney merged commit b7bf345 into dev Jun 10, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants