Skip to content

[codex] Add lightweight watchlist trigger evals#25

Merged
dd3ok merged 2 commits into
mainfrom
codex/watchlist-runtime-smoke-trigger-eval
Jun 15, 2026
Merged

[codex] Add lightweight watchlist trigger evals#25
dd3ok merged 2 commits into
mainfrom
codex/watchlist-runtime-smoke-trigger-eval

Conversation

@dd3ok

@dd3ok dd3ok commented Jun 15, 2026

Copy link
Copy Markdown
Owner

Summary

  • Add a lightweight deterministic trigger corpus in evals/trigger_cases.json with 20 balanced trigger/no-trigger cases.
  • Wire the existing semantic checker to validate trigger corpus size, schema, decision balance, required reasons, reason polarity, and non-empty string IDs.
  • Keep runtime smoke documentation compact and manual-only by explicitly forbidding transcripts, screenshots, raw logs, and long runtime output.

Runtime weight

  • No changes under .agents/skills/watchlist-md/.
  • No runtime bundle files, references, scripts, Python files, transcripts, or smoke logs were added.
  • Runtime smoke remains repo-only under docs/.
  • Trigger eval remains repo-only under evals/ and deterministic; it performs no LLM, runtime, browser, GitHub API, or network calls.

Validation

  • PYTHONDONTWRITEBYTECODE=1 python -m unittest discover -s evals -p 'test_*.py' - 95 tests passed
  • python evals/check_semantic_cases.py - 30 semantic cases; 20 trigger cases
  • python evals/check_skill_package.py
  • python evals/check_policy_markers.py
  • python evals/check_release_metadata.py
  • python evals/check_watchlist.py examples/WATCHLIST.example.md --strict-format --strict-safety --require-archive-section
  • python evals/check_watchlist.py .agents/skills/watchlist-md/assets/WATCHLIST.template.md --strict-format --strict-safety --require-archive-section
  • python tools/validate_watchlist.py examples/WATCHLIST.example.md --strict-format --strict-safety --require-archive-section
  • python tools/validate_watchlist.py .agents/skills/watchlist-md/assets/WATCHLIST.template.md --strict-format --strict-safety --require-archive-section
  • runtime bundle scan: NO_RUNTIME_PYTHON_OR_TOOLING_MATCHES
  • staged scope scan: NO_STAGED_RUNTIME_OR_ANCHOR_CHANGES

Review follow-up

  • Added explicit trigger case id validation before using the value as case_id.
  • Added invalid-id regression coverage for non-string IDs and leading/trailing whitespace.
  • Runtime-weight review remains unchanged: no .agents/skills/watchlist-md diff and no runtime bundle additions.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new trigger evaluation corpus (trigger_cases.json) along with validation logic in check_semantic_cases.py and corresponding unit tests in test_check_watchlist.py. Additionally, the manual smoke check documentation in runtime-smoke.md is updated to forbid storing raw logs or transcripts. Feedback on the changes suggests explicitly validating that the id field in each trigger case is a non-empty string before using it, preventing potential type errors or confusing validation messages.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread evals/check_semantic_cases.py Outdated
@dd3ok dd3ok marked this pull request as ready for review June 15, 2026 10:36
@dd3ok dd3ok merged commit 41ecad6 into main Jun 15, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant