Skip to content

v0.35.2.0 feat(eval): temporal-aware contradiction probe + verdict enum#1052

Open
garrytan wants to merge 11 commits into
masterfrom
rfc/temporal-contradiction-probe
Open

v0.35.2.0 feat(eval): temporal-aware contradiction probe + verdict enum#1052
garrytan wants to merge 11 commits into
masterfrom
rfc/temporal-contradiction-probe

Conversation

@garrytan
Copy link
Copy Markdown
Owner

@garrytan garrytan commented May 16, 2026

Summary

Phase 1 of the temporal-contradiction-probe RFC (docs/proposals/temporal-contradiction-probe.md). Teaches the contradiction-probe judge to reason about time. The headline cases this addresses:

  • Temporal supersession — a 2017 role record vs. 2025 current state was a HIGH contradiction; now classifies as temporal_supersession.
  • Trial → confirmed status change — got flagged as conflict; now classifies as temporal_evolution.
  • Negation parsing artifacts — "NOT X" parsed as positive X; now classifies as negation_artifact (informational, no false-positive in the contradiction count).
  • Metric regression — MRR going backwards over time gets surfaced as temporal_regression rather than buried in noise.

Per the plan, Phases 2-4 (structured claims substrate, trajectory view, founder scorecard) are deferred to follow-up RFCs.

Lanes (bisect-friendly):

  1. A1 — pass effective_date to judge prompt; bump PROMPT_VERSION '1' → '2'. Eight SQL projection sites updated (3 in postgres-engine, 5 in pglite-engine) so SearchResult carries the page-level date.
  2. A2 — replace contradicts: boolean with verdict: enum (6 members). Runner emit predicate now fires for every non-no_contradiction verdict. Severity gains 'info'. ResolutionKind extended with temporal_supersede, flag_for_review, log_timeline_change.
  3. B — relax date-filter.ts rule 3 when both sides have explicit effective_date (the role-transition case now reaches the judge instead of being silently skipped).
  4. C — cost-estimate prompt at PROMPT_VERSION change, --budget-usd hard cap, judge model routes through resolveModel({configKey: 'models.eval.contradictions_judge', tier: 'utility'}). New CLI flag, new config key, new doctor touchpoint.
  5. D — R4/R5/R6 IRON-RULE regression tests (emit predicate, cache shape, severity-unchanged contract).
  6. Privacy lintscripts/check-proposal-pii.sh catches the PII patterns this wave's own RFC had at draft time (structural patterns, not real names). Wired into bun run verify.

Test Coverage

  • Unit: 6600 pass / 0 fail / 0 skip (parallel + serial fast loop)
  • E2E (real Postgres, port 5434): 574 pass / 0 fail across 85 files
  • Typecheck (tsc --noEmit): clean
  • Verify (bun run verify): clean (privacy + proposal-pii + jsonb + source-id-projection + progress + test-isolation + wasm + admin-build + scope-drift + cli-exec + system-of-record + eval-glossary + typecheck)

Six pinned IRON-RULE regressions:

  • R1 — date-filter rule 3 relaxation preserves rules 1+2 (6 cases)
  • R2 — PROMPT_VERSION bump invalidates the cache cleanly
  • R3 — verdict-enum migration has zero result.contradicts readers left (tsc gate)
  • R4 — runner emits findings for every non-no_contradiction verdict (6 cases)
  • R5 — cache key tuple stays a 5-field shape
  • R6 — contradiction verdict severity unchanged (4 cases on normalizeVerdict)

Plus 9 cases on the new cost-prompt helper, 15 cases on the privacy lint.

Pre-Landing Review

Plan-eng-review ran on the original plan and surfaced 12 architectural decisions (D1-D12, all locked). Codex outside-voice review on the original plan surfaced 19 findings; the plan was revised to address all 19 (10 by inclusion in this wave, 9 by deferral to a follow-up RFC). The revised plan locks 10 decisions (D1-D10), all implemented as specified.

Privacy

The source RFC originally contained personal-context PII (real names, personal life events, private repo references). It was scrubbed at force-push time as Step 0 of the implementation. The new scripts/check-proposal-pii.sh catches the same patterns at CI time so future RFCs can't ship with similar content. Patterns flagged: garrytan/brain, trial separation, permanent separation, couples session, couples therapist, divorce attorney, grandmother's funeral, aunt's funeral, wintermute.

Test plan

  • bun run verify — typecheck + 13 shell pre-checks pass
  • bun run test — 6600 unit tests pass (parallel + serial)
  • bun run test:e2e against Postgres — 574 tests / 85 files pass
  • R4 emit-predicate regressions pass for all 6 verdicts
  • R5 cache-key shape regression locked
  • R6 contradiction-severity-unchanged regression locked
  • Privacy lint passes against the scrubbed RFC; fails (as designed) against pre-scrub fixture
  • Post-merge: re-run gbrain eval suspected-contradictions run --budget-usd 5 against production brain to measure the actual FP-drop

🤖 Generated with Claude Code

garrytan-agents and others added 10 commits May 15, 2026 17:12
Field report on residual HIGH findings from gbrain eval suspected-contradictions
and proposal for a 4-phase fix (Phase 1 = judge prompt + verdict enum is the
recommended starting point).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Lane A1 of the temporal-contradiction-probe wave. Threads page-level
effective_date through the search projection into the contradiction judge so
the LLM can reason about supersession instead of treating every dated pair as
a contradiction.

Changes:
- SearchResult interface adds optional effective_date + effective_date_source
  fields; rowToSearchResult populates them from the row data with date-only
  YYYY-MM-DD normalization (handles both postgres.js Date and PGLite string).
- 8 SELECT projection sites (3 in postgres-engine, 5 in pglite-engine) now
  carry p.effective_date + p.effective_date_source through their inner CTEs
  and outer SELECTs so search results expose the field on both engines.
- PairMember (eval-contradictions/types.ts) gets the two fields as required
  (string | null) so the type forces every constructor to think about temporal
  anchoring. Runner's searchResultToMember + takeToMember handle the
  normalization; takes inherit the chunk's page-level date.
- buildJudgePrompt emits `Statement A (from: YYYY-MM-DD)` when effective_date
  is non-null, else `(date unknown)`. Prompt instructions explain the tag so
  the model knows what to do with it.
- PROMPT_VERSION bumps '1' → '2'. Cache-key tuple shape unchanged; old rows
  miss naturally on first run against the new prompt.

Test fixtures in 5 files updated to include the new required fields. All 205
eval-contradictions unit tests + 101 search-related tests pass. Typecheck
clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Lane A2 of the temporal-contradiction-probe wave. Expands the judge's
classification vocabulary from a binary contradicts:bool to a six-member
verdict enum so the probe can distinguish "this changed" from "this is wrong".

Verdict taxonomy:
  no_contradiction       — drop from findings
  contradiction          — genuine conflict at same point in time
  temporal_supersession  — newer claim updates/replaces older; not an error
  temporal_regression    — metric/status went backwards over time (signal)
  temporal_evolution     — legitimate change, neither supersession nor regression
  negation_artifact      — judge misread an explicit negation

Changes:
- types.ts: Verdict union (6 members); Severity gains 'info'; ResolutionKind
  extended with temporal_supersede, flag_for_review, log_timeline_change;
  JudgeVerdict.contradicts → verdict; ContradictionFinding now carries verdict;
  ProbeReport adds queries_with_any_finding + verdict_breakdown (additive).
- judge.ts: parseResolutionKind + parseVerdict guards; normalizeVerdict reads
  the new field and applies the C1 confidence floor only to verdict='contradiction'
  (the new verdicts are informational classifications, no floor). Prompt rubric
  rewritten to ask for verdict + extended severity scale.
- severity-classify.ts: 'info' joins the rank with value 0; defaultSeverityForVerdict
  maps each verdict to its baseline severity (D7 — supersession=info, regression=high,
  etc.). parseSeverity gains a fallback param so consumers can override 'low' default.
- auto-supersession.ts: classifyResolution + renderResolutionCommand handle the
  three new resolution kinds. Probe still NEVER auto-mutates — the new kinds
  render paste-ready commands or informational lines.
- cache.ts: isJudgeVerdict shape check matches the new verdict field; old v1
  rows fail the guard and treat as misses.
- runner.ts: emit predicate at cache-hit and judge-success branches changes
  from `verdict.contradicts` to `verdict.verdict !== 'no_contradiction'`.
  Without this, the new verdicts vanish from the report. Added per-verdict
  tally + queriesWithAnyFinding alongside the strict queriesWithContradiction.
- trends.ts: latest run verdict breakdown surfaces in the trend chart.

Test fixtures updated across 8 test files. All 210 eval-contradictions unit
tests pass. Typecheck clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Lane B of the temporal-contradiction-probe wave. The v1 date pre-filter
skipped pairs whose chunk-text-extracted dates differed by >30 days as a
cost-saving heuristic. That heuristic silently killed exactly the cases the
new verdict taxonomy exists to surface — role transitions across years
(e.g. a 2017 historical record vs. a 2025 current state), MRR claims years
apart, status changes recorded over time.

Lane A1+A2 made temporal supersession explicit and cheap to classify. The
filter no longer needs to skip these pairs; the judge can label them.

Changes:
- date-filter.ts: shouldSkipForDateMismatch accepts optional effectiveDateA
  and effectiveDateB. When BOTH are non-null, returns skip=false with the new
  'both_have_effective_date' reason — the judge will see the dates via the
  (from: YYYY-MM-DD) prompt tag from Lane A1. Other rules (same-paragraph
  dual-date override, missing-date fallback) preserved verbatim and still
  run first.
- runner.ts: threads pair.{a,b}.effective_date into the date-filter call.
  Pairs that previously vanished into the skip bucket now reach the judge.

Tests (R1 IRON RULE regression suite, 6 new cases):
- both sides effective_date → not skipped
- both sides effective_date overrides >30d chunk-text rule
- rule 1 (same-paragraph dual-date) still wins over effective_date relaxation
- rule 2 (missing chunk dates) still applies when effective_date partially present
- undefined effective_dates fall through to v1 behavior (back-compat)
- empty-string effective_date treated as missing (only real dates enable the relaxation)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Lane C of the temporal-contradiction-probe wave. Three layers of cost
guardrail, all stacked:

(a) cost-estimate prompt at probe-run-time. Before the runner spends any
    tokens after a PROMPT_VERSION change, eval-suspected-contradictions
    reads the most recent persisted prompt_version from
    eval_contradictions_runs and compares. When they differ:
      - TTY: prints an upper-bound estimate + Ctrl-C window (default 10s,
        override via GBRAIN_PROBE_PROMPT_GRACE_SECONDS).
      - non-TTY: prints the estimate + auto-proceeds (autopilot path).
      - --yes override or GBRAIN_NO_PROBE_PROMPT=1: skip entirely.
    Mirrors the v0.32.7 runPostUpgradeReembedPrompt pattern.

(b) --budget-usd N hard cap (pre-existing; PreFlightBudgetError surfaces
    when the estimate alone exceeds the cap, and CostTracker halts the
    run mid-flight when cumulative cost exceeds it). Documented in the
    help text alongside (a).

(c) Judge model now routes through resolveModel() with configKey
    'models.eval.contradictions_judge', tier 'utility' (Haiku-class
    default), and env var GBRAIN_CONTRADICTIONS_JUDGE_MODEL. The legacy
    --judge CLI flag still wins as the highest-precedence override.
    Doctor's model touchpoint registry (src/commands/models.ts:50) carries
    the new key so `gbrain models` and `gbrain models doctor` surface it.

Also in this lane:
- CLI: --severity accepts 'info' (the new Severity member from Lane A2).
- CLI: --severity output shows [verdict] tag alongside slug pairs so
  operators distinguish genuine contradictions from temporal classifications.
- Human summary: prints the new queries_with_any_finding metric and the
  per-verdict breakdown table.
- Help text: explains the cost-prompt + budget-cap + model-routing
  interactions in one paragraph.

New tests (9 cases on the cost-prompt helper):
- --yes override skips
- GBRAIN_NO_PROBE_PROMPT=1 skips
- prompt_version unchanged → skips
- non-TTY auto-proceeds with stderr note
- TTY proceeds after grace
- TTY aborts on Ctrl-C
- fresh brain (no prior runs) fires the prompt
- GBRAIN_PROBE_PROMPT_GRACE_SECONDS override honored
- estimate banner contains query count + judge model + dollar amount

All 225 eval-contradictions tests + 25 model-config tests pass. Typecheck clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Lane D of the temporal-contradiction-probe wave. The Lanes A1/A2/B/C lanes
landed the behavior; this lane pins the regressions that protect the wave
against future drift.

R4 (runner emit predicate): five new tests, one per non-no_contradiction
verdict, prove the runner.ts emit rule surfaces each one as a finding with
the correct verdict tag, and that:
  - queries_with_contradiction (Wilson-CI denominator) ONLY counts verdict
    ='contradiction' — the strict metric is preserved
  - queries_with_any_finding counts every non-no_contradiction verdict
  - verdict_breakdown tallies correctly
Plus one negative case: verdict='no_contradiction' produces zero findings.
Without R4, a future runner refactor could collapse the new verdicts back
to /dev/null and the report would silently shrink.

R5 (cache key shape): direct shape assertion on buildCacheKey output. The
key tuple is exactly 5 fields (chunk_a_hash, chunk_b_hash, model_id,
prompt_version, truncation_policy). Adding a 6th field would silently break
every operator's brain (no migration path).

R6 (contradiction severity unchanged): four tests on normalizeVerdict pin
the legacy semantics — judge-supplied severity wins (whether 'high' or
'low'), and on garbage severity input the fallback is 'medium' (per
defaultSeverityForVerdict('contradiction')) NOT 'low'. The contradiction
verdict's severity must never default to 'low', which would silently mask
genuine conflicts as cosmetic naming issues. The temporal_regression case
is included for parity (garbage → 'high' since regressions are real
investor red flags).

236 eval-contradictions tests pass (211 + 6 R4 + 1 R5 + 4 R6 + 9 cost-prompt
from Lane C).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Captures the residual TODO from the temporal-contradiction-probe wave's
plan: prevent the bug class where an RFC lands in docs/proposals/ with
PII that should never appear in a public technical artifact. The
original RFC had to be scrubbed at force-push time (Step 0); this lint
catches the same patterns at CI time so the next one can't slip through.

Sibling to scripts/check-privacy.sh:
- check-privacy.sh: bans the literal "Wintermute" repo-wide.
- check-proposal-pii.sh: focuses on docs/proposals/*.md and the OTHER
  PII classes — personal-relationship vocabulary, private repo refs.

Design contract: the denylist names PATTERNS, not real people. Naming
specific real names (deceased relatives, therapist first names,
dealflow contacts) inside this script would leak PII into the repo
just by appearing here. The structural patterns below catch the
SURROUNDING vocabulary that always accompanies such content in
personal RFC prose. Trade-off: a future RFC that names a real person
without any contextual markers won't be caught — accepted as residual
risk handled by human review.

Patterns flagged in docs/proposals/*.md:
- garrytan/brain (private repo reference)
- trial separation, permanent separation
- couples session, couples therapist
- divorce attorney(s)
- grandmother's funeral, aunt's funeral
- wintermute (also caught by check-privacy.sh; listed here for
  proposal-scoped clarity)

Bare common words (separation, funeral) are NOT banned — only the
combined personal-context phrases. "Separation of concerns" and other
software vocabulary survives.

Wired into:
- `bun run verify` (gates every push)
- `bun run check:all`
- `bun run check:proposal-pii` (standalone)

Tests: 15 cases in test/scripts/check-proposal-pii.test.ts.
- Each pattern flagged when present, plus exit-code + stderr signal.
- Two negative cases (separation-of-concerns, funeral metaphor) prove
  the lint doesn't false-positive on legitimate software prose.
- No-proposals-dir → exit 0 (not a failure).
- Multi-hit case proves all patterns surface together with a summary
  count.
- The two test fixtures that name "Wintermute" / "WINTERMUTE" as
  sentinel literals are allowlisted in check-test-real-names.sh per
  the same meta-rule-enforcement exception as check-privacy.sh itself.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
check-privacy.sh bans the literal Wintermute repo-wide. The two new files
from the v0.34 privacy lint (scripts/check-proposal-pii.sh and its test)
necessarily name the token to do their job. Same meta-rule-enforcement
exception as scripts/check-privacy.sh itself, scripts/check-test-real-names.sh,
test/recency-decay.test.ts, and the existing entries — describing what
the rule forbids requires naming it.

Without this allowlist, `bun run verify` fails on check:privacy.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Temporal-contradiction-probe wave — Phase 1 of the RFC at
docs/proposals/temporal-contradiction-probe.md.

Headline: the contradiction probe now classifies pairs into a 6-member
verdict enum (no_contradiction, contradiction, temporal_supersession,
temporal_regression, temporal_evolution, negation_artifact) and sees the
page-level effective_date for each chunk via a (from: YYYY-MM-DD) tag in
the prompt. The pre-judge date filter no longer skips dated wide-gap pairs,
so the role-transition class (e.g. a 2017 historical record vs. a 2025
current state) reaches the judge and gets classified as
temporal_supersession instead of vanishing into the skip bucket.

PROMPT_VERSION bumped 1 → 2 (cache fully invalidated). Three-layer cost
guardrail: TTY-only cost-estimate prompt with Ctrl-C window, --budget-usd
hard cap, Haiku-tier routing via new models.eval.contradictions_judge
config key.

Also adds a CI privacy lint (scripts/check-proposal-pii.sh) wired into
bun run verify that catches PII patterns in docs/proposals/*.md so future
RFCs can't ship with personal-context vocabulary the way this wave's
source RFC did at draft time.

Phases 2-4 deferred to follow-up RFCs per the plan.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@garrytan garrytan changed the title rfc: temporal axis for contradiction probe v0.35.1.0 feat(eval): temporal-aware contradiction probe + verdict enum May 16, 2026
@garrytan garrytan changed the title v0.35.1.0 feat(eval): temporal-aware contradiction probe + verdict enum v0.35.2.0 feat(eval): temporal-aware contradiction probe + verdict enum May 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants