[codex] Add lightweight watchlist trigger evals#25
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a new trigger evaluation corpus (trigger_cases.json) along with validation logic in check_semantic_cases.py and corresponding unit tests in test_check_watchlist.py. Additionally, the manual smoke check documentation in runtime-smoke.md is updated to forbid storing raw logs or transcripts. Feedback on the changes suggests explicitly validating that the id field in each trigger case is a non-empty string before using it, preventing potential type errors or confusing validation messages.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
Summary
evals/trigger_cases.jsonwith 20 balanced trigger/no-trigger cases.Runtime weight
.agents/skills/watchlist-md/.docs/.evals/and deterministic; it performs no LLM, runtime, browser, GitHub API, or network calls.Validation
PYTHONDONTWRITEBYTECODE=1 python -m unittest discover -s evals -p 'test_*.py'- 95 tests passedpython evals/check_semantic_cases.py- 30 semantic cases; 20 trigger casespython evals/check_skill_package.pypython evals/check_policy_markers.pypython evals/check_release_metadata.pypython evals/check_watchlist.py examples/WATCHLIST.example.md --strict-format --strict-safety --require-archive-sectionpython evals/check_watchlist.py .agents/skills/watchlist-md/assets/WATCHLIST.template.md --strict-format --strict-safety --require-archive-sectionpython tools/validate_watchlist.py examples/WATCHLIST.example.md --strict-format --strict-safety --require-archive-sectionpython tools/validate_watchlist.py .agents/skills/watchlist-md/assets/WATCHLIST.template.md --strict-format --strict-safety --require-archive-sectionNO_RUNTIME_PYTHON_OR_TOOLING_MATCHESNO_STAGED_RUNTIME_OR_ANCHOR_CHANGESReview follow-up
idvalidation before using the value ascase_id..agents/skills/watchlist-mddiff and no runtime bundle additions.