v0.35.2.0 feat(eval): temporal-aware contradiction probe + verdict enum by garrytan · Pull Request #1052 · garrytan/gbrain

garrytan · 2026-05-16T00:15:41Z

Summary

Phase 1 of the temporal-contradiction-probe RFC (docs/proposals/temporal-contradiction-probe.md). Teaches the contradiction-probe judge to reason about time. The headline cases this addresses:

Temporal supersession — a 2017 role record vs. 2025 current state was a HIGH contradiction; now classifies as temporal_supersession.
Trial → confirmed status change — got flagged as conflict; now classifies as temporal_evolution.
Negation parsing artifacts — "NOT X" parsed as positive X; now classifies as negation_artifact (informational, no false-positive in the contradiction count).
Metric regression — MRR going backwards over time gets surfaced as temporal_regression rather than buried in noise.

Per the plan, Phases 2-4 (structured claims substrate, trajectory view, founder scorecard) are deferred to follow-up RFCs.

Lanes (bisect-friendly):

A1 — pass effective_date to judge prompt; bump PROMPT_VERSION '1' → '2'. Eight SQL projection sites updated (3 in postgres-engine, 5 in pglite-engine) so SearchResult carries the page-level date.
A2 — replace contradicts: boolean with verdict: enum (6 members). Runner emit predicate now fires for every non-no_contradiction verdict. Severity gains 'info'. ResolutionKind extended with temporal_supersede, flag_for_review, log_timeline_change.
B — relax date-filter.ts rule 3 when both sides have explicit effective_date (the role-transition case now reaches the judge instead of being silently skipped).
C — cost-estimate prompt at PROMPT_VERSION change, --budget-usd hard cap, judge model routes through resolveModel({configKey: 'models.eval.contradictions_judge', tier: 'utility'}). New CLI flag, new config key, new doctor touchpoint.
D — R4/R5/R6 IRON-RULE regression tests (emit predicate, cache shape, severity-unchanged contract).
Privacy lint — scripts/check-proposal-pii.sh catches the PII patterns this wave's own RFC had at draft time (structural patterns, not real names). Wired into bun run verify.

Test Coverage

Unit: 6600 pass / 0 fail / 0 skip (parallel + serial fast loop)
E2E (real Postgres, port 5434): 574 pass / 0 fail across 85 files
Typecheck (tsc --noEmit): clean
Verify (bun run verify): clean (privacy + proposal-pii + jsonb + source-id-projection + progress + test-isolation + wasm + admin-build + scope-drift + cli-exec + system-of-record + eval-glossary + typecheck)

Six pinned IRON-RULE regressions:

R1 — date-filter rule 3 relaxation preserves rules 1+2 (6 cases)
R2 — PROMPT_VERSION bump invalidates the cache cleanly
R3 — verdict-enum migration has zero result.contradicts readers left (tsc gate)
R4 — runner emits findings for every non-no_contradiction verdict (6 cases)
R5 — cache key tuple stays a 5-field shape
R6 — contradiction verdict severity unchanged (4 cases on normalizeVerdict)

Plus 9 cases on the new cost-prompt helper, 15 cases on the privacy lint.

Pre-Landing Review

Plan-eng-review ran on the original plan and surfaced 12 architectural decisions (D1-D12, all locked). Codex outside-voice review on the original plan surfaced 19 findings; the plan was revised to address all 19 (10 by inclusion in this wave, 9 by deferral to a follow-up RFC). The revised plan locks 10 decisions (D1-D10), all implemented as specified.

Privacy

The source RFC originally contained personal-context PII (real names, personal life events, private repo references). It was scrubbed at force-push time as Step 0 of the implementation. The new scripts/check-proposal-pii.sh catches the same patterns at CI time so future RFCs can't ship with similar content. Patterns flagged: garrytan/brain, trial separation, permanent separation, couples session, couples therapist, divorce attorney, grandmother's funeral, aunt's funeral, wintermute.

Test plan

bun run verify — typecheck + 13 shell pre-checks pass
bun run test — 6600 unit tests pass (parallel + serial)
bun run test:e2e against Postgres — 574 tests / 85 files pass
R4 emit-predicate regressions pass for all 6 verdicts
R5 cache-key shape regression locked
R6 contradiction-severity-unchanged regression locked
Privacy lint passes against the scrubbed RFC; fails (as designed) against pre-scrub fixture
Post-merge: re-run gbrain eval suspected-contradictions run --budget-usd 5 against production brain to measure the actual FP-drop

🤖 Generated with Claude Code

Field report on residual HIGH findings from gbrain eval suspected-contradictions and proposal for a 4-phase fix (Phase 1 = judge prompt + verdict enum is the recommended starting point). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Lane A1 of the temporal-contradiction-probe wave. Threads page-level effective_date through the search projection into the contradiction judge so the LLM can reason about supersession instead of treating every dated pair as a contradiction. Changes: - SearchResult interface adds optional effective_date + effective_date_source fields; rowToSearchResult populates them from the row data with date-only YYYY-MM-DD normalization (handles both postgres.js Date and PGLite string). - 8 SELECT projection sites (3 in postgres-engine, 5 in pglite-engine) now carry p.effective_date + p.effective_date_source through their inner CTEs and outer SELECTs so search results expose the field on both engines. - PairMember (eval-contradictions/types.ts) gets the two fields as required (string | null) so the type forces every constructor to think about temporal anchoring. Runner's searchResultToMember + takeToMember handle the normalization; takes inherit the chunk's page-level date. - buildJudgePrompt emits `Statement A (from: YYYY-MM-DD)` when effective_date is non-null, else `(date unknown)`. Prompt instructions explain the tag so the model knows what to do with it. - PROMPT_VERSION bumps '1' → '2'. Cache-key tuple shape unchanged; old rows miss naturally on first run against the new prompt. Test fixtures in 5 files updated to include the new required fields. All 205 eval-contradictions unit tests + 101 search-related tests pass. Typecheck clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Lane A2 of the temporal-contradiction-probe wave. Expands the judge's classification vocabulary from a binary contradicts:bool to a six-member verdict enum so the probe can distinguish "this changed" from "this is wrong". Verdict taxonomy: no_contradiction — drop from findings contradiction — genuine conflict at same point in time temporal_supersession — newer claim updates/replaces older; not an error temporal_regression — metric/status went backwards over time (signal) temporal_evolution — legitimate change, neither supersession nor regression negation_artifact — judge misread an explicit negation Changes: - types.ts: Verdict union (6 members); Severity gains 'info'; ResolutionKind extended with temporal_supersede, flag_for_review, log_timeline_change; JudgeVerdict.contradicts → verdict; ContradictionFinding now carries verdict; ProbeReport adds queries_with_any_finding + verdict_breakdown (additive). - judge.ts: parseResolutionKind + parseVerdict guards; normalizeVerdict reads the new field and applies the C1 confidence floor only to verdict='contradiction' (the new verdicts are informational classifications, no floor). Prompt rubric rewritten to ask for verdict + extended severity scale. - severity-classify.ts: 'info' joins the rank with value 0; defaultSeverityForVerdict maps each verdict to its baseline severity (D7 — supersession=info, regression=high, etc.). parseSeverity gains a fallback param so consumers can override 'low' default. - auto-supersession.ts: classifyResolution + renderResolutionCommand handle the three new resolution kinds. Probe still NEVER auto-mutates — the new kinds render paste-ready commands or informational lines. - cache.ts: isJudgeVerdict shape check matches the new verdict field; old v1 rows fail the guard and treat as misses. - runner.ts: emit predicate at cache-hit and judge-success branches changes from `verdict.contradicts` to `verdict.verdict !== 'no_contradiction'`. Without this, the new verdicts vanish from the report. Added per-verdict tally + queriesWithAnyFinding alongside the strict queriesWithContradiction. - trends.ts: latest run verdict breakdown surfaces in the trend chart. Test fixtures updated across 8 test files. All 210 eval-contradictions unit tests pass. Typecheck clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Lane B of the temporal-contradiction-probe wave. The v1 date pre-filter skipped pairs whose chunk-text-extracted dates differed by >30 days as a cost-saving heuristic. That heuristic silently killed exactly the cases the new verdict taxonomy exists to surface — role transitions across years (e.g. a 2017 historical record vs. a 2025 current state), MRR claims years apart, status changes recorded over time. Lane A1+A2 made temporal supersession explicit and cheap to classify. The filter no longer needs to skip these pairs; the judge can label them. Changes: - date-filter.ts: shouldSkipForDateMismatch accepts optional effectiveDateA and effectiveDateB. When BOTH are non-null, returns skip=false with the new 'both_have_effective_date' reason — the judge will see the dates via the (from: YYYY-MM-DD) prompt tag from Lane A1. Other rules (same-paragraph dual-date override, missing-date fallback) preserved verbatim and still run first. - runner.ts: threads pair.{a,b}.effective_date into the date-filter call. Pairs that previously vanished into the skip bucket now reach the judge. Tests (R1 IRON RULE regression suite, 6 new cases): - both sides effective_date → not skipped - both sides effective_date overrides >30d chunk-text rule - rule 1 (same-paragraph dual-date) still wins over effective_date relaxation - rule 2 (missing chunk dates) still applies when effective_date partially present - undefined effective_dates fall through to v1 behavior (back-compat) - empty-string effective_date treated as missing (only real dates enable the relaxation) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Lane C of the temporal-contradiction-probe wave. Three layers of cost guardrail, all stacked: (a) cost-estimate prompt at probe-run-time. Before the runner spends any tokens after a PROMPT_VERSION change, eval-suspected-contradictions reads the most recent persisted prompt_version from eval_contradictions_runs and compares. When they differ: - TTY: prints an upper-bound estimate + Ctrl-C window (default 10s, override via GBRAIN_PROBE_PROMPT_GRACE_SECONDS). - non-TTY: prints the estimate + auto-proceeds (autopilot path). - --yes override or GBRAIN_NO_PROBE_PROMPT=1: skip entirely. Mirrors the v0.32.7 runPostUpgradeReembedPrompt pattern. (b) --budget-usd N hard cap (pre-existing; PreFlightBudgetError surfaces when the estimate alone exceeds the cap, and CostTracker halts the run mid-flight when cumulative cost exceeds it). Documented in the help text alongside (a). (c) Judge model now routes through resolveModel() with configKey 'models.eval.contradictions_judge', tier 'utility' (Haiku-class default), and env var GBRAIN_CONTRADICTIONS_JUDGE_MODEL. The legacy --judge CLI flag still wins as the highest-precedence override. Doctor's model touchpoint registry (src/commands/models.ts:50) carries the new key so `gbrain models` and `gbrain models doctor` surface it. Also in this lane: - CLI: --severity accepts 'info' (the new Severity member from Lane A2). - CLI: --severity output shows [verdict] tag alongside slug pairs so operators distinguish genuine contradictions from temporal classifications. - Human summary: prints the new queries_with_any_finding metric and the per-verdict breakdown table. - Help text: explains the cost-prompt + budget-cap + model-routing interactions in one paragraph. New tests (9 cases on the cost-prompt helper): - --yes override skips - GBRAIN_NO_PROBE_PROMPT=1 skips - prompt_version unchanged → skips - non-TTY auto-proceeds with stderr note - TTY proceeds after grace - TTY aborts on Ctrl-C - fresh brain (no prior runs) fires the prompt - GBRAIN_PROBE_PROMPT_GRACE_SECONDS override honored - estimate banner contains query count + judge model + dollar amount All 225 eval-contradictions tests + 25 model-config tests pass. Typecheck clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Lane D of the temporal-contradiction-probe wave. The Lanes A1/A2/B/C lanes landed the behavior; this lane pins the regressions that protect the wave against future drift. R4 (runner emit predicate): five new tests, one per non-no_contradiction verdict, prove the runner.ts emit rule surfaces each one as a finding with the correct verdict tag, and that: - queries_with_contradiction (Wilson-CI denominator) ONLY counts verdict ='contradiction' — the strict metric is preserved - queries_with_any_finding counts every non-no_contradiction verdict - verdict_breakdown tallies correctly Plus one negative case: verdict='no_contradiction' produces zero findings. Without R4, a future runner refactor could collapse the new verdicts back to /dev/null and the report would silently shrink. R5 (cache key shape): direct shape assertion on buildCacheKey output. The key tuple is exactly 5 fields (chunk_a_hash, chunk_b_hash, model_id, prompt_version, truncation_policy). Adding a 6th field would silently break every operator's brain (no migration path). R6 (contradiction severity unchanged): four tests on normalizeVerdict pin the legacy semantics — judge-supplied severity wins (whether 'high' or 'low'), and on garbage severity input the fallback is 'medium' (per defaultSeverityForVerdict('contradiction')) NOT 'low'. The contradiction verdict's severity must never default to 'low', which would silently mask genuine conflicts as cosmetic naming issues. The temporal_regression case is included for parity (garbage → 'high' since regressions are real investor red flags). 236 eval-contradictions tests pass (211 + 6 R4 + 1 R5 + 4 R6 + 9 cost-prompt from Lane C). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Captures the residual TODO from the temporal-contradiction-probe wave's plan: prevent the bug class where an RFC lands in docs/proposals/ with PII that should never appear in a public technical artifact. The original RFC had to be scrubbed at force-push time (Step 0); this lint catches the same patterns at CI time so the next one can't slip through. Sibling to scripts/check-privacy.sh: - check-privacy.sh: bans the literal "Wintermute" repo-wide. - check-proposal-pii.sh: focuses on docs/proposals/*.md and the OTHER PII classes — personal-relationship vocabulary, private repo refs. Design contract: the denylist names PATTERNS, not real people. Naming specific real names (deceased relatives, therapist first names, dealflow contacts) inside this script would leak PII into the repo just by appearing here. The structural patterns below catch the SURROUNDING vocabulary that always accompanies such content in personal RFC prose. Trade-off: a future RFC that names a real person without any contextual markers won't be caught — accepted as residual risk handled by human review. Patterns flagged in docs/proposals/*.md: - garrytan/brain (private repo reference) - trial separation, permanent separation - couples session, couples therapist - divorce attorney(s) - grandmother's funeral, aunt's funeral - wintermute (also caught by check-privacy.sh; listed here for proposal-scoped clarity) Bare common words (separation, funeral) are NOT banned — only the combined personal-context phrases. "Separation of concerns" and other software vocabulary survives. Wired into: - `bun run verify` (gates every push) - `bun run check:all` - `bun run check:proposal-pii` (standalone) Tests: 15 cases in test/scripts/check-proposal-pii.test.ts. - Each pattern flagged when present, plus exit-code + stderr signal. - Two negative cases (separation-of-concerns, funeral metaphor) prove the lint doesn't false-positive on legitimate software prose. - No-proposals-dir → exit 0 (not a failure). - Multi-hit case proves all patterns surface together with a summary count. - The two test fixtures that name "Wintermute" / "WINTERMUTE" as sentinel literals are allowlisted in check-test-real-names.sh per the same meta-rule-enforcement exception as check-privacy.sh itself. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…diction-probe

check-privacy.sh bans the literal Wintermute repo-wide. The two new files from the v0.34 privacy lint (scripts/check-proposal-pii.sh and its test) necessarily name the token to do their job. Same meta-rule-enforcement exception as scripts/check-privacy.sh itself, scripts/check-test-real-names.sh, test/recency-decay.test.ts, and the existing entries — describing what the rule forbids requires naming it. Without this allowlist, `bun run verify` fails on check:privacy. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Temporal-contradiction-probe wave — Phase 1 of the RFC at docs/proposals/temporal-contradiction-probe.md. Headline: the contradiction probe now classifies pairs into a 6-member verdict enum (no_contradiction, contradiction, temporal_supersession, temporal_regression, temporal_evolution, negation_artifact) and sees the page-level effective_date for each chunk via a (from: YYYY-MM-DD) tag in the prompt. The pre-judge date filter no longer skips dated wide-gap pairs, so the role-transition class (e.g. a 2017 historical record vs. a 2025 current state) reaches the judge and gets classified as temporal_supersession instead of vanishing into the skip bucket. PROMPT_VERSION bumped 1 → 2 (cache fully invalidated). Three-layer cost guardrail: TTY-only cost-estimate prompt with Ctrl-C window, --budget-usd hard cap, Haiku-tier routing via new models.eval.contradictions_judge config key. Also adds a CI privacy lint (scripts/check-proposal-pii.sh) wired into bun run verify that catches PII patterns in docs/proposals/*.md so future RFCs can't ship with personal-context vocabulary the way this wave's source RFC did at draft time. Phases 2-4 deferred to follow-up RFCs per the plan. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…diction-probe # Conflicts: # CHANGELOG.md

garrytan-agents and others added 10 commits May 15, 2026 17:12

Merge remote-tracking branch 'origin/master' into rfc/temporal-contra…

86dd762

…diction-probe

garrytan changed the title ~~rfc: temporal axis for contradiction probe~~ v0.35.1.0 feat(eval): temporal-aware contradiction probe + verdict enum May 16, 2026

Merge remote-tracking branch 'origin/master' into rfc/temporal-contra…

f8412f0

…diction-probe # Conflicts: # CHANGELOG.md

garrytan changed the title ~~v0.35.1.0 feat(eval): temporal-aware contradiction probe + verdict enum~~ v0.35.2.0 feat(eval): temporal-aware contradiction probe + verdict enum May 16, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.35.2.0 feat(eval): temporal-aware contradiction probe + verdict enum#1052

v0.35.2.0 feat(eval): temporal-aware contradiction probe + verdict enum#1052
garrytan wants to merge 11 commits into
masterfrom
rfc/temporal-contradiction-probe

garrytan commented May 16, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

garrytan commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test Coverage

Pre-Landing Review

Privacy

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

garrytan commented May 16, 2026 •

edited

Loading