Skip to content

Outcome 5 (2/2): eval run --replicate K with statistical report#176

Open
DianaMeda wants to merge 2 commits into
jumbocontext:mainfrom
DianaMeda:outcome-5-replicate-command
Open

Outcome 5 (2/2): eval run --replicate K with statistical report#176
DianaMeda wants to merge 2 commits into
jumbocontext:mainfrom
DianaMeda:outcome-5-replicate-command

Conversation

@DianaMeda

Copy link
Copy Markdown
Contributor

Building block: stacked on #175

⚠️ Stacked on #175 (Outcome 5 1/2: replication statistics aggregator). Merge #175 first. Until then this PR's diff includes #175's commit; once #175 lands in main, this reduces to just the 7 files here.

Why

Completes Outcome 5: turns the pure aggregator (#175) into a single command that produces a statistically-grounded comparison artifact — lift as mean ± SD with a one-SD significance flag — exactly the GOAL.md Constraint that replication be "automatable as a single eval run --replicate K command."

What's here

  • cli/commands/run.ts--replicate <K> (default 1). K<1 errors; 2 ≤ K < 5 warns that K≥5 is the minimum for a credible signal. For each harness the A/B comparison runs K times; the K ComparisonResults are aggregated via aggregateReplications() into a ReplicationReport, persisted by runId, and printed. K=1 leaves the existing single-run output/behavior unchanged.
  • storagesaveReplicationReport(runId, report) / getReplicationReport(runId) on the ResultStore interface and the JSON implementation (runs/<runId>/replication.json, retrievable after the process exits).
  • output/replication-display.tsformatReplicationReport(): per dimension, mean lift ± SD, applicable replications, t-statistic, and SIGNAL/none.

Notes

  • Sequential execution (parallel/interleaved is GOAL.md's Challenge item, deliberately deferred — needs a design pass for non-API-harness resource contention).
  • Multi-harness + replicate persists one report per runId (last harness wins) but displays all; single-harness is the primary path.
  • Reconciliation: the K<5 warning fires only for 2 ≤ K < 5, so K=1 (single-run default) stays warning-free and unchanged.

Verification

  • New tests: --replicate 5 → 5 runner calls/harness + report persisted & retrievable by runId; --replicate 1 unchanged (no report); K<1 errors; plus a JsonResultStore round-trip that survives a fresh store instance.
  • Full unit suite 366/366; tsc --noEmit clean.

🤖 Generated with Claude Code

DianaMeda and others added 2 commits June 20, 2026 10:10
Pure aggregator for K replicated A/B comparisons of the same scenario/harness,
the foundation for reporting lift with statistical confidence (GOAL.md
Outcome 5) instead of single-point estimates.

- domain/replication.ts: ReplicationReport + DimensionLiftStat
  { dimension, k, applicableReplications, meanJumbo, meanBaseline, meanLift,
    sdLift, tStatistic, isSignal }, re-exported from the domain barrel.
- analysis/replication-stats.ts: aggregateReplications(comparisons) ->
  ReplicationReport. Per dimension present in every replication: meanLift is the
  mean of (jumbo - baseline) per replication, sdLift the sample (n-1) SD;
  isSignal only when |meanLift| > sdLift (K<2 -> 0 SD, not a signal); tStatistic
  reported against the K=5 one-tailed alpha=0.05 threshold (t>2.13, df=4).
  token-efficiency replications that were N/A (maxScore 0) are excluded and
  counted in applicableReplications. Pure, deterministic, no I/O.

7 new tests; full unit suite 361/361 green; tsc --noEmit clean.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Adds the single-command replication driver on top of the aggregator (1/2):

- run.ts: `--replicate <K>` (default 1). K<1 errors; 2<=K<5 warns that K>=5 is
  the minimum for a credible signal. For each harness the A/B comparison runs K
  times; the K ComparisonResults are aggregated via aggregateReplications() into
  a ReplicationReport, persisted by runId, and printed. K=1 leaves the existing
  single-run output and behavior unchanged.
- ResultStore: saveReplicationReport(runId, report) / getReplicationReport(runId)
  on the interface and the JSON implementation (stored as runs/<runId>/
  replication.json, retrievable after the process exits).
- output/replication-display.ts: formatReplicationReport() renders per dimension
  the mean lift +/- SD, applicable replications, t-statistic, and SIGNAL/none.

Tests: cli-commands (replicate 5 -> 5 runner calls/harness + report persisted &
retrievable by runId; replicate 1 unchanged, no report; K<1 errors) and a
JsonResultStore round-trip that survives a fresh store instance.

366 unit tests pass; tsc --noEmit clean.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant