v0.35.2.0 feat(eval): temporal-aware contradiction probe + verdict enum#1052
Open
garrytan wants to merge 11 commits into
Open
v0.35.2.0 feat(eval): temporal-aware contradiction probe + verdict enum#1052garrytan wants to merge 11 commits into
garrytan wants to merge 11 commits into
Conversation
Field report on residual HIGH findings from gbrain eval suspected-contradictions and proposal for a 4-phase fix (Phase 1 = judge prompt + verdict enum is the recommended starting point). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Lane A1 of the temporal-contradiction-probe wave. Threads page-level effective_date through the search projection into the contradiction judge so the LLM can reason about supersession instead of treating every dated pair as a contradiction. Changes: - SearchResult interface adds optional effective_date + effective_date_source fields; rowToSearchResult populates them from the row data with date-only YYYY-MM-DD normalization (handles both postgres.js Date and PGLite string). - 8 SELECT projection sites (3 in postgres-engine, 5 in pglite-engine) now carry p.effective_date + p.effective_date_source through their inner CTEs and outer SELECTs so search results expose the field on both engines. - PairMember (eval-contradictions/types.ts) gets the two fields as required (string | null) so the type forces every constructor to think about temporal anchoring. Runner's searchResultToMember + takeToMember handle the normalization; takes inherit the chunk's page-level date. - buildJudgePrompt emits `Statement A (from: YYYY-MM-DD)` when effective_date is non-null, else `(date unknown)`. Prompt instructions explain the tag so the model knows what to do with it. - PROMPT_VERSION bumps '1' → '2'. Cache-key tuple shape unchanged; old rows miss naturally on first run against the new prompt. Test fixtures in 5 files updated to include the new required fields. All 205 eval-contradictions unit tests + 101 search-related tests pass. Typecheck clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Lane A2 of the temporal-contradiction-probe wave. Expands the judge's classification vocabulary from a binary contradicts:bool to a six-member verdict enum so the probe can distinguish "this changed" from "this is wrong". Verdict taxonomy: no_contradiction — drop from findings contradiction — genuine conflict at same point in time temporal_supersession — newer claim updates/replaces older; not an error temporal_regression — metric/status went backwards over time (signal) temporal_evolution — legitimate change, neither supersession nor regression negation_artifact — judge misread an explicit negation Changes: - types.ts: Verdict union (6 members); Severity gains 'info'; ResolutionKind extended with temporal_supersede, flag_for_review, log_timeline_change; JudgeVerdict.contradicts → verdict; ContradictionFinding now carries verdict; ProbeReport adds queries_with_any_finding + verdict_breakdown (additive). - judge.ts: parseResolutionKind + parseVerdict guards; normalizeVerdict reads the new field and applies the C1 confidence floor only to verdict='contradiction' (the new verdicts are informational classifications, no floor). Prompt rubric rewritten to ask for verdict + extended severity scale. - severity-classify.ts: 'info' joins the rank with value 0; defaultSeverityForVerdict maps each verdict to its baseline severity (D7 — supersession=info, regression=high, etc.). parseSeverity gains a fallback param so consumers can override 'low' default. - auto-supersession.ts: classifyResolution + renderResolutionCommand handle the three new resolution kinds. Probe still NEVER auto-mutates — the new kinds render paste-ready commands or informational lines. - cache.ts: isJudgeVerdict shape check matches the new verdict field; old v1 rows fail the guard and treat as misses. - runner.ts: emit predicate at cache-hit and judge-success branches changes from `verdict.contradicts` to `verdict.verdict !== 'no_contradiction'`. Without this, the new verdicts vanish from the report. Added per-verdict tally + queriesWithAnyFinding alongside the strict queriesWithContradiction. - trends.ts: latest run verdict breakdown surfaces in the trend chart. Test fixtures updated across 8 test files. All 210 eval-contradictions unit tests pass. Typecheck clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Lane B of the temporal-contradiction-probe wave. The v1 date pre-filter
skipped pairs whose chunk-text-extracted dates differed by >30 days as a
cost-saving heuristic. That heuristic silently killed exactly the cases the
new verdict taxonomy exists to surface — role transitions across years
(e.g. a 2017 historical record vs. a 2025 current state), MRR claims years
apart, status changes recorded over time.
Lane A1+A2 made temporal supersession explicit and cheap to classify. The
filter no longer needs to skip these pairs; the judge can label them.
Changes:
- date-filter.ts: shouldSkipForDateMismatch accepts optional effectiveDateA
and effectiveDateB. When BOTH are non-null, returns skip=false with the new
'both_have_effective_date' reason — the judge will see the dates via the
(from: YYYY-MM-DD) prompt tag from Lane A1. Other rules (same-paragraph
dual-date override, missing-date fallback) preserved verbatim and still
run first.
- runner.ts: threads pair.{a,b}.effective_date into the date-filter call.
Pairs that previously vanished into the skip bucket now reach the judge.
Tests (R1 IRON RULE regression suite, 6 new cases):
- both sides effective_date → not skipped
- both sides effective_date overrides >30d chunk-text rule
- rule 1 (same-paragraph dual-date) still wins over effective_date relaxation
- rule 2 (missing chunk dates) still applies when effective_date partially present
- undefined effective_dates fall through to v1 behavior (back-compat)
- empty-string effective_date treated as missing (only real dates enable the relaxation)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Lane C of the temporal-contradiction-probe wave. Three layers of cost
guardrail, all stacked:
(a) cost-estimate prompt at probe-run-time. Before the runner spends any
tokens after a PROMPT_VERSION change, eval-suspected-contradictions
reads the most recent persisted prompt_version from
eval_contradictions_runs and compares. When they differ:
- TTY: prints an upper-bound estimate + Ctrl-C window (default 10s,
override via GBRAIN_PROBE_PROMPT_GRACE_SECONDS).
- non-TTY: prints the estimate + auto-proceeds (autopilot path).
- --yes override or GBRAIN_NO_PROBE_PROMPT=1: skip entirely.
Mirrors the v0.32.7 runPostUpgradeReembedPrompt pattern.
(b) --budget-usd N hard cap (pre-existing; PreFlightBudgetError surfaces
when the estimate alone exceeds the cap, and CostTracker halts the
run mid-flight when cumulative cost exceeds it). Documented in the
help text alongside (a).
(c) Judge model now routes through resolveModel() with configKey
'models.eval.contradictions_judge', tier 'utility' (Haiku-class
default), and env var GBRAIN_CONTRADICTIONS_JUDGE_MODEL. The legacy
--judge CLI flag still wins as the highest-precedence override.
Doctor's model touchpoint registry (src/commands/models.ts:50) carries
the new key so `gbrain models` and `gbrain models doctor` surface it.
Also in this lane:
- CLI: --severity accepts 'info' (the new Severity member from Lane A2).
- CLI: --severity output shows [verdict] tag alongside slug pairs so
operators distinguish genuine contradictions from temporal classifications.
- Human summary: prints the new queries_with_any_finding metric and the
per-verdict breakdown table.
- Help text: explains the cost-prompt + budget-cap + model-routing
interactions in one paragraph.
New tests (9 cases on the cost-prompt helper):
- --yes override skips
- GBRAIN_NO_PROBE_PROMPT=1 skips
- prompt_version unchanged → skips
- non-TTY auto-proceeds with stderr note
- TTY proceeds after grace
- TTY aborts on Ctrl-C
- fresh brain (no prior runs) fires the prompt
- GBRAIN_PROBE_PROMPT_GRACE_SECONDS override honored
- estimate banner contains query count + judge model + dollar amount
All 225 eval-contradictions tests + 25 model-config tests pass. Typecheck clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Lane D of the temporal-contradiction-probe wave. The Lanes A1/A2/B/C lanes
landed the behavior; this lane pins the regressions that protect the wave
against future drift.
R4 (runner emit predicate): five new tests, one per non-no_contradiction
verdict, prove the runner.ts emit rule surfaces each one as a finding with
the correct verdict tag, and that:
- queries_with_contradiction (Wilson-CI denominator) ONLY counts verdict
='contradiction' — the strict metric is preserved
- queries_with_any_finding counts every non-no_contradiction verdict
- verdict_breakdown tallies correctly
Plus one negative case: verdict='no_contradiction' produces zero findings.
Without R4, a future runner refactor could collapse the new verdicts back
to /dev/null and the report would silently shrink.
R5 (cache key shape): direct shape assertion on buildCacheKey output. The
key tuple is exactly 5 fields (chunk_a_hash, chunk_b_hash, model_id,
prompt_version, truncation_policy). Adding a 6th field would silently break
every operator's brain (no migration path).
R6 (contradiction severity unchanged): four tests on normalizeVerdict pin
the legacy semantics — judge-supplied severity wins (whether 'high' or
'low'), and on garbage severity input the fallback is 'medium' (per
defaultSeverityForVerdict('contradiction')) NOT 'low'. The contradiction
verdict's severity must never default to 'low', which would silently mask
genuine conflicts as cosmetic naming issues. The temporal_regression case
is included for parity (garbage → 'high' since regressions are real
investor red flags).
236 eval-contradictions tests pass (211 + 6 R4 + 1 R5 + 4 R6 + 9 cost-prompt
from Lane C).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Captures the residual TODO from the temporal-contradiction-probe wave's plan: prevent the bug class where an RFC lands in docs/proposals/ with PII that should never appear in a public technical artifact. The original RFC had to be scrubbed at force-push time (Step 0); this lint catches the same patterns at CI time so the next one can't slip through. Sibling to scripts/check-privacy.sh: - check-privacy.sh: bans the literal "Wintermute" repo-wide. - check-proposal-pii.sh: focuses on docs/proposals/*.md and the OTHER PII classes — personal-relationship vocabulary, private repo refs. Design contract: the denylist names PATTERNS, not real people. Naming specific real names (deceased relatives, therapist first names, dealflow contacts) inside this script would leak PII into the repo just by appearing here. The structural patterns below catch the SURROUNDING vocabulary that always accompanies such content in personal RFC prose. Trade-off: a future RFC that names a real person without any contextual markers won't be caught — accepted as residual risk handled by human review. Patterns flagged in docs/proposals/*.md: - garrytan/brain (private repo reference) - trial separation, permanent separation - couples session, couples therapist - divorce attorney(s) - grandmother's funeral, aunt's funeral - wintermute (also caught by check-privacy.sh; listed here for proposal-scoped clarity) Bare common words (separation, funeral) are NOT banned — only the combined personal-context phrases. "Separation of concerns" and other software vocabulary survives. Wired into: - `bun run verify` (gates every push) - `bun run check:all` - `bun run check:proposal-pii` (standalone) Tests: 15 cases in test/scripts/check-proposal-pii.test.ts. - Each pattern flagged when present, plus exit-code + stderr signal. - Two negative cases (separation-of-concerns, funeral metaphor) prove the lint doesn't false-positive on legitimate software prose. - No-proposals-dir → exit 0 (not a failure). - Multi-hit case proves all patterns surface together with a summary count. - The two test fixtures that name "Wintermute" / "WINTERMUTE" as sentinel literals are allowlisted in check-test-real-names.sh per the same meta-rule-enforcement exception as check-privacy.sh itself. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
check-privacy.sh bans the literal Wintermute repo-wide. The two new files from the v0.34 privacy lint (scripts/check-proposal-pii.sh and its test) necessarily name the token to do their job. Same meta-rule-enforcement exception as scripts/check-privacy.sh itself, scripts/check-test-real-names.sh, test/recency-decay.test.ts, and the existing entries — describing what the rule forbids requires naming it. Without this allowlist, `bun run verify` fails on check:privacy. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Temporal-contradiction-probe wave — Phase 1 of the RFC at docs/proposals/temporal-contradiction-probe.md. Headline: the contradiction probe now classifies pairs into a 6-member verdict enum (no_contradiction, contradiction, temporal_supersession, temporal_regression, temporal_evolution, negation_artifact) and sees the page-level effective_date for each chunk via a (from: YYYY-MM-DD) tag in the prompt. The pre-judge date filter no longer skips dated wide-gap pairs, so the role-transition class (e.g. a 2017 historical record vs. a 2025 current state) reaches the judge and gets classified as temporal_supersession instead of vanishing into the skip bucket. PROMPT_VERSION bumped 1 → 2 (cache fully invalidated). Three-layer cost guardrail: TTY-only cost-estimate prompt with Ctrl-C window, --budget-usd hard cap, Haiku-tier routing via new models.eval.contradictions_judge config key. Also adds a CI privacy lint (scripts/check-proposal-pii.sh) wired into bun run verify that catches PII patterns in docs/proposals/*.md so future RFCs can't ship with personal-context vocabulary the way this wave's source RFC did at draft time. Phases 2-4 deferred to follow-up RFCs per the plan. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…diction-probe # Conflicts: # CHANGELOG.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Phase 1 of the temporal-contradiction-probe RFC (
docs/proposals/temporal-contradiction-probe.md). Teaches the contradiction-probe judge to reason about time. The headline cases this addresses:contradiction; now classifies astemporal_supersession.temporal_evolution.negation_artifact(informational, no false-positive in the contradiction count).temporal_regressionrather than buried in noise.Per the plan, Phases 2-4 (structured claims substrate, trajectory view, founder scorecard) are deferred to follow-up RFCs.
Lanes (bisect-friendly):
effective_dateto judge prompt; bumpPROMPT_VERSION'1' → '2'. Eight SQL projection sites updated (3 in postgres-engine, 5 in pglite-engine) so SearchResult carries the page-level date.contradicts: booleanwithverdict: enum(6 members). Runner emit predicate now fires for every non-no_contradictionverdict. Severity gains'info'. ResolutionKind extended withtemporal_supersede,flag_for_review,log_timeline_change.date-filter.tsrule 3 when both sides have expliciteffective_date(the role-transition case now reaches the judge instead of being silently skipped).--budget-usdhard cap, judge model routes throughresolveModel({configKey: 'models.eval.contradictions_judge', tier: 'utility'}). New CLI flag, new config key, new doctor touchpoint.scripts/check-proposal-pii.shcatches the PII patterns this wave's own RFC had at draft time (structural patterns, not real names). Wired intobun run verify.Test Coverage
tsc --noEmit): cleanbun run verify): clean (privacy + proposal-pii + jsonb + source-id-projection + progress + test-isolation + wasm + admin-build + scope-drift + cli-exec + system-of-record + eval-glossary + typecheck)Six pinned IRON-RULE regressions:
result.contradictsreaders left (tsc gate)no_contradictionverdict (6 cases)contradictionverdict severity unchanged (4 cases onnormalizeVerdict)Plus 9 cases on the new cost-prompt helper, 15 cases on the privacy lint.
Pre-Landing Review
Plan-eng-review ran on the original plan and surfaced 12 architectural decisions (D1-D12, all locked). Codex outside-voice review on the original plan surfaced 19 findings; the plan was revised to address all 19 (10 by inclusion in this wave, 9 by deferral to a follow-up RFC). The revised plan locks 10 decisions (D1-D10), all implemented as specified.
Privacy
The source RFC originally contained personal-context PII (real names, personal life events, private repo references). It was scrubbed at force-push time as Step 0 of the implementation. The new
scripts/check-proposal-pii.shcatches the same patterns at CI time so future RFCs can't ship with similar content. Patterns flagged:garrytan/brain,trial separation,permanent separation,couples session,couples therapist,divorce attorney,grandmother's funeral,aunt's funeral,wintermute.Test plan
bun run verify— typecheck + 13 shell pre-checks passbun run test— 6600 unit tests pass (parallel + serial)bun run test:e2eagainst Postgres — 574 tests / 85 files passgbrain eval suspected-contradictions run --budget-usd 5against production brain to measure the actual FP-drop🤖 Generated with Claude Code