RAGAS prompt-sweep leaderboard, --config-dir fixes, and enforced retrieval#22
Conversation
Add a prompt-sweep workflow: generate_prompt_sweep.py renders one benchmarking config per prompt variant from a manifest, and ResultHandler.build_leaderboard aggregates each variant's mean RAGAS metrics into a ranked leaderboard (with a shared_context block and incomplete-row handling) emitted in the dump JSON. Model, queries, retriever and judge are held fixed so only the prompt varies. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… sweeps Four plumbing fixes so a prompt sweep actually runs end-to-end, each surfaced by a live --config-dir run: - config_manager: exempt services.benchmarking from the cross-config consistency check (it is the sweep axis); global + other services.* stay strict. - templates_manager: render a distinct file per config in multi-config mode (configs sharing a top-level name no longer collide onto one file). - config_seed: fall back to the first *.yaml when config.yaml is absent, so the chatbot-oriented seed step no longer aborts a benchmark deployment. - templates_manager: stage every config's agent_md_file, not just the first, so all sweep variants are available in the container. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The model can ignore an 'always search first' prompt and answer from its own weights, leaving source_documents empty (chat UI shows 'Link unavailable'). _inject_forced_retrieval prefills a completed search_vectorstore_hybrid tool round before the model's first turn, so retrieval always happens and its store_docs callback populates the source links. The model may still search again. Gated by services.chat_app.force_initial_retrieval (default on). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Adds prompt-sweep support to the benchmarking workflow (including a RAGAS leaderboard), fixes several archi evaluate --config-dir multi-config deployment issues, and enforces an initial vector retrieval in CMSCompOpsAgent to ensure source links populate.
Changes:
- Introduces a prompt-sweep generator script plus a leaderboard aggregator that ranks prompt variants by mean RAGAS metrics and emits results in the dump JSON.
- Fixes multi-config sweep plumbing in config rendering/staging and config-seed resolution to support
--config-dirruns reliably. - Adds a BaseReAct hook + CMSCompOpsAgent implementation to prefill a completed retrieval tool round before the model’s first turn (gated by config), with unit tests.
Reviewed changes
Copilot reviewed 15 out of 16 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/unit/test_stage_agents_multiconfig.py | Verifies multi-config benchmarking stages all agent prompt markdown files. |
| tests/unit/test_render_config_filename.py | Verifies per-config rendered YAML filenames are unique in multi-config mode. |
| tests/unit/test_prompt_sweep_leaderboard.py | Unit coverage for leaderboard aggregation (ranking, ties, incomplete metrics, shared-context drift). |
| tests/unit/test_prompt_sweep_dump_gating.py | Ensures dump JSON only includes leaderboard when populated. |
| tests/unit/test_generate_prompt_sweep.py | Tests the sweep-config generator’s output invariants and atomic failure behavior. |
| tests/unit/test_forced_retrieval.py | Tests forced initial retrieval injection behavior and gating/fail-open semantics. |
| tests/unit/test_config_seed_resolve.py | Tests config-seed fallback behavior when config.yaml is absent in multi-config renders. |
| tests/unit/test_config_manager_benchmarking_variation.py | Ensures cross-config consistency checking allows services.benchmarking variation only. |
| src/cli/tools/config_seed.py | Adds resolve_config_path to tolerate multi-config rendered directories. |
| src/cli/managers/templates_manager.py | Fixes multi-config staging (agents) and per-config render naming to avoid collisions. |
| src/cli/managers/config_manager.py | Exempts services.benchmarking from cross-config equality checks for sweeps. |
| src/bin/service_benchmark.py | Adds leaderboard computation + dump emission and logs a ranked summary. |
| src/archi/pipelines/agents/cms_comp_ops_agent.py | Implements forced initial retrieval injection using a prefilled tool round. |
| src/archi/pipelines/agents/base_react.py | Adds subclass hook and integrates it into agent input preparation. |
| scripts/benchmarking/generate_prompt_sweep.py | New script to generate one benchmarking config per prompt variant from a manifest. |
| docs/docs/benchmarking.md | Documentation for running prompt sweeps and interpreting the leaderboard output. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| for bench_config in context.config_manager.get_configs(): | ||
| bench_services = bench_config.get("services", {}) or {} | ||
| benchmark_cfg = bench_services.get("benchmarking", {}) or {} | ||
| agent_md_file = benchmark_cfg.get("agent_md_file") | ||
| if not agent_md_file: |
There was a problem hiding this comment.
Fixed in 60dd655: _stage_agents now rejects two configs whose agent_md_file share a basename (they would overwrite each other when staged and when referenced by the rendered config) with a clear error instead of silently dropping one. Test: test_same_basename_different_files_is_rejected.
| stem = str(benchmarking_name or top_level_name) | ||
| candidate = f"{stem}.yaml" | ||
| if candidate in used_names: | ||
| candidate = f"{stem}_{index}.yaml" | ||
| used_names.add(candidate) | ||
| return candidate |
There was a problem hiding this comment.
Fixed in 60dd655: render_config_target_name now loops, bumping the suffix until the name is unique against all previously-used names, so a {stem}{index} that is already taken can no longer overwrite an earlier rendered file. Test: test_repeated_collisions_stay_unique.
aggregate_* is a pandas .mean() that skips NaN, so a RAGAS judge timeout on a
question silently shrinks that metric's sample without making the aggregate
NaN — a mean over 4 of 9 answered questions looked fully-backed. Add
scored_counts{metric: n} (non-NaN per-question count) to each leaderboard row,
keep query_count as the answered count, log a warning for under-sampled
metrics, and annotate the log table cells with @<n> when n < answered.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…out them The leaderboard unit tests import service_benchmark, whose top-level 'from datasets import Dataset' and 'from ragas import ...' pulled heavy benchmark-only deps absent from the lean unit-test CI env, breaking collection (ModuleNotFoundError: datasets). Move those imports into the two methods that use them (get_ragas_results, run) so the module imports for its pure helpers (ResultHandler.build_leaderboard/dump) without datasets/ragas. The benchmarking Docker image still has both at runtime. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…lisions Address Copilot review on PR #22: - _render_config_target_name now loops until the name is unique instead of disambiguating only once, so a config can never silently overwrite an earlier rendered file even if {stem}_{index} is already taken. - _stage_agents rejects two benchmark configs whose agent_md_file share a basename (they'd overwrite each other when staged and when referenced by the rendered config) with a clear error instead of silently dropping one. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Confirmed — both fixes look correct. All 16 unit tests covering the collision-hardening and agent-staging logic pass locally (commit |
After lazy-importing datasets/ragas, collection still failed in CI because importing service_benchmark ran module-level read_secret() calls and opened a Postgres connection pool (PostgresServiceFactory.from_env), which can't connect in the DB-less unit-test env. Move that initialization into _init_runtime(), called only from __main__, so importing the module for ResultHandler's pure helpers needs neither live secrets nor a database. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Three feature-separated commits, each with tests.
1.
feat(benchmark): rank prompt variants on a RAGAS leaderboardscripts/benchmarking/generate_prompt_sweep.pyrenders one benchmarking config per prompt variant from a manifest.ResultHandler.build_leaderboardaggregates each variant's mean RAGAS metrics into a ranked leaderboard (withshared_contextand incomplete-row handling), emitted in the dump JSON.docs/docs/benchmarking.md.2.
fix(cli): makearchi evaluate --config-dirrun multi-config sweepsFour plumbing fixes, each surfaced by a live
--config-dirrun:services.benchmarkingfrom the cross-config consistency check (it's the sweep axis);global+ otherservices.*stay strict.nameno longer collide).*.yamlwhenconfig.yamlis absent, so the chatbot-oriented seed step no longer aborts a benchmark deployment.agent_md_file, not just the first.3.
feat(agent): enforce initial vector retrieval in CMSCompOpsAgent_inject_forced_retrievalprefills a completedsearch_vectorstore_hybridtool round before the model's first turn, so retrieval always happens and source links populate (the model can otherwise answer from weights, leaving "Link unavailable").services.chat_app.force_initial_retrieval(default on).Verification
shared_context.🤖 Generated with Claude Code