Outcome 5 (2/2): eval run --replicate K with statistical report#176
Open
DianaMeda wants to merge 2 commits into
Open
Outcome 5 (2/2): eval run --replicate K with statistical report#176DianaMeda wants to merge 2 commits into
DianaMeda wants to merge 2 commits into
Conversation
Pure aggregator for K replicated A/B comparisons of the same scenario/harness,
the foundation for reporting lift with statistical confidence (GOAL.md
Outcome 5) instead of single-point estimates.
- domain/replication.ts: ReplicationReport + DimensionLiftStat
{ dimension, k, applicableReplications, meanJumbo, meanBaseline, meanLift,
sdLift, tStatistic, isSignal }, re-exported from the domain barrel.
- analysis/replication-stats.ts: aggregateReplications(comparisons) ->
ReplicationReport. Per dimension present in every replication: meanLift is the
mean of (jumbo - baseline) per replication, sdLift the sample (n-1) SD;
isSignal only when |meanLift| > sdLift (K<2 -> 0 SD, not a signal); tStatistic
reported against the K=5 one-tailed alpha=0.05 threshold (t>2.13, df=4).
token-efficiency replications that were N/A (maxScore 0) are excluded and
counted in applicableReplications. Pure, deterministic, no I/O.
7 new tests; full unit suite 361/361 green; tsc --noEmit clean.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Adds the single-command replication driver on top of the aggregator (1/2): - run.ts: `--replicate <K>` (default 1). K<1 errors; 2<=K<5 warns that K>=5 is the minimum for a credible signal. For each harness the A/B comparison runs K times; the K ComparisonResults are aggregated via aggregateReplications() into a ReplicationReport, persisted by runId, and printed. K=1 leaves the existing single-run output and behavior unchanged. - ResultStore: saveReplicationReport(runId, report) / getReplicationReport(runId) on the interface and the JSON implementation (stored as runs/<runId>/ replication.json, retrievable after the process exits). - output/replication-display.ts: formatReplicationReport() renders per dimension the mean lift +/- SD, applicable replications, t-statistic, and SIGNAL/none. Tests: cli-commands (replicate 5 -> 5 runner calls/harness + report persisted & retrievable by runId; replicate 1 unchanged, no report; K<1 errors) and a JsonResultStore round-trip that survives a fresh store instance. 366 unit tests pass; tsc --noEmit clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Building block: stacked on #175
Why
Completes Outcome 5: turns the pure aggregator (#175) into a single command that produces a statistically-grounded comparison artifact — lift as mean ± SD with a one-SD significance flag — exactly the GOAL.md Constraint that replication be "automatable as a single
eval run --replicate Kcommand."What's here
cli/commands/run.ts—--replicate <K>(default 1).K<1errors;2 ≤ K < 5warns that K≥5 is the minimum for a credible signal. For each harness the A/B comparison runs K times; the KComparisonResults are aggregated viaaggregateReplications()into aReplicationReport, persisted byrunId, and printed. K=1 leaves the existing single-run output/behavior unchanged.storage—saveReplicationReport(runId, report)/getReplicationReport(runId)on theResultStoreinterface and the JSON implementation (runs/<runId>/replication.json, retrievable after the process exits).output/replication-display.ts—formatReplicationReport(): per dimension, mean lift ± SD, applicable replications, t-statistic, andSIGNAL/none.Notes
runId(last harness wins) but displays all; single-harness is the primary path.K<5warning fires only for2 ≤ K < 5, soK=1(single-run default) stays warning-free and unchanged.Verification
--replicate 5→ 5 runner calls/harness + report persisted & retrievable byrunId;--replicate 1unchanged (no report);K<1errors; plus aJsonResultStoreround-trip that survives a fresh store instance.tsc --noEmitclean.🤖 Generated with Claude Code