Outcome 5 (2/2): eval run --replicate K with statistical report by DianaMeda · Pull Request #176 · jumbocontext/jumbo.cli

DianaMeda · 2026-06-20T08:22:40Z

Building block: stacked on #175

⚠️ Stacked on #175 (Outcome 5 1/2: replication statistics aggregator). Merge #175 first. Until then this PR's diff includes #175's commit; once #175 lands in main, this reduces to just the 7 files here.

Why

Completes Outcome 5: turns the pure aggregator (#175) into a single command that produces a statistically-grounded comparison artifact — lift as mean ± SD with a one-SD significance flag — exactly the GOAL.md Constraint that replication be "automatable as a single eval run --replicate K command."

What's here

cli/commands/run.ts — --replicate <K> (default 1). K<1 errors; 2 ≤ K < 5 warns that K≥5 is the minimum for a credible signal. For each harness the A/B comparison runs K times; the K ComparisonResults are aggregated via aggregateReplications() into a ReplicationReport, persisted by runId, and printed. K=1 leaves the existing single-run output/behavior unchanged.
storage — saveReplicationReport(runId, report) / getReplicationReport(runId) on the ResultStore interface and the JSON implementation (runs/<runId>/replication.json, retrievable after the process exits).
output/replication-display.ts — formatReplicationReport(): per dimension, mean lift ± SD, applicable replications, t-statistic, and SIGNAL/none.

Notes

Sequential execution (parallel/interleaved is GOAL.md's Challenge item, deliberately deferred — needs a design pass for non-API-harness resource contention).
Multi-harness + replicate persists one report per runId (last harness wins) but displays all; single-harness is the primary path.
Reconciliation: the K<5 warning fires only for 2 ≤ K < 5, so K=1 (single-run default) stays warning-free and unchanged.

Verification

New tests: --replicate 5 → 5 runner calls/harness + report persisted & retrievable by runId; --replicate 1 unchanged (no report); K<1 errors; plus a JsonResultStore round-trip that survives a fresh store instance.
Full unit suite 366/366; tsc --noEmit clean.

🤖 Generated with Claude Code

Pure aggregator for K replicated A/B comparisons of the same scenario/harness, the foundation for reporting lift with statistical confidence (GOAL.md Outcome 5) instead of single-point estimates. - domain/replication.ts: ReplicationReport + DimensionLiftStat { dimension, k, applicableReplications, meanJumbo, meanBaseline, meanLift, sdLift, tStatistic, isSignal }, re-exported from the domain barrel. - analysis/replication-stats.ts: aggregateReplications(comparisons) -> ReplicationReport. Per dimension present in every replication: meanLift is the mean of (jumbo - baseline) per replication, sdLift the sample (n-1) SD; isSignal only when |meanLift| > sdLift (K<2 -> 0 SD, not a signal); tStatistic reported against the K=5 one-tailed alpha=0.05 threshold (t>2.13, df=4). token-efficiency replications that were N/A (maxScore 0) are excluded and counted in applicableReplications. Pure, deterministic, no I/O. 7 new tests; full unit suite 361/361 green; tsc --noEmit clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Adds the single-command replication driver on top of the aggregator (1/2): - run.ts: `--replicate <K>` (default 1). K<1 errors; 2<=K<5 warns that K>=5 is the minimum for a credible signal. For each harness the A/B comparison runs K times; the K ComparisonResults are aggregated via aggregateReplications() into a ReplicationReport, persisted by runId, and printed. K=1 leaves the existing single-run output and behavior unchanged. - ResultStore: saveReplicationReport(runId, report) / getReplicationReport(runId) on the interface and the JSON implementation (stored as runs/<runId>/ replication.json, retrievable after the process exits). - output/replication-display.ts: formatReplicationReport() renders per dimension the mean lift +/- SD, applicable replications, t-statistic, and SIGNAL/none. Tests: cli-commands (replicate 5 -> 5 runner calls/harness + report persisted & retrievable by runId; replicate 1 unchanged, no report; K<1 errors) and a JsonResultStore round-trip that survives a fresh store instance. 366 unit tests pass; tsc --noEmit clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

DianaMeda and others added 2 commits June 20, 2026 10:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Outcome 5 (2/2): eval run --replicate K with statistical report#176

Outcome 5 (2/2): eval run --replicate K with statistical report#176
DianaMeda wants to merge 2 commits into
jumbocontext:mainfrom
DianaMeda:outcome-5-replicate-command

DianaMeda commented Jun 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

DianaMeda commented Jun 20, 2026

Building block: stacked on #175

Why

What's here

Notes

Verification

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant