Real outputs produced by github-agent on real GitHub issues / PRs. Two
families of artifacts:
- End-to-end engineering run — what the tool emits when fixing an issue.
- Code-review run, v1 → v2 → v3 — what the tool emits, what changed when the prompt was hardened against hallucination, and what survived human curation. This is the artifact we'd actually send to a maintainer.
The review-family files are kept as a teaching artifact: they document both the failure mode (v1) and the workflow we recommend skeptical maintainers actually follow (raw → curated, with a human in the loop).
Real PR — "Add conftest.py to
increase timeouts for slow tests" — 10 lines, single file, labelled
needs-review. We chose it because it's small enough to fit in a single
review and contains a real subtle issue.
The raw output of an early version of the review prompt before we hardened it against hallucination. Multiple findings turned out to be factually wrong when verified against pytest and pytest-timeout source:
- claimed
pytest-timeoutmight be missing — butpyproject.tomlpins it; - claimed
--timeoutCLI overrides markers — wrong direction per pytest-timeout docs; - claimed an existing
@pytest.mark.timeout(N)decorator would be overridden by the conftest — backwards, the decorator wins.
Kept here on purpose so the failure mode is visible in the repo itself, and so the diff between v1 and v2 demonstrates what the prompt fix actually changed.
Same PR, same tool, but:
- The system prompt now has explicit anti-hallucination rules
(see
src/prompts/review.js) — "never claim a dependency might be missing without checking the manifest", "never assert library precedence without citation, hedge instead", "prefer fewer correct findings to many shaky ones". - The pipeline now also fetches dependency-manifest files
(
pyproject.toml,package.json, …) into the review's file context, so the "check the manifest first" rule can actually be satisfied.
The v2 output:
- does not speculate about missing dependencies (it can see
pyproject.toml); - explicitly hedges on pytest-timeout precedence ("I am not certain of the precedence rules…please confirm") instead of asserting in either direction;
- recommends a concrete fix that sidesteps the precedence ambiguity:
if not item.get_closest_marker("timeout")before adding the marker.
This is the actual unedited file the tool wrote.
The hand-curated final, distilled from the v2 raw output. Every behavioural claim is verified against pytest / pytest-timeout source code with inline citations to the source files. Includes:
- the verdict (
NEEDS_DISCUSSION) and why; - one concrete actionable suggestion (the marker guard) with a code snippet;
- one documentation nudge about the coupling to
pyproject.toml's baseline; - a transparent table of what was cut from the raw output, and why.
This is what you would actually paste into the PR thread.
A 4-file standalone pytest project (conftest.py mirroring PR #894 +
pyproject.toml + a 3-test file + a marker inspector script). Run
python verify_precedence.py and it prints, for each test, all timeout
markers attached to its item and which one get_closest_marker resolves to.
The recorded output (transcript pinned in that directory's README) makes
the precedence claim empirically checkable, not just source-cited.
The workflow demonstrated by these three files is the workflow we recommend to maintainers who don't want AI noise in their PR threads:
agent.review() → raw output → human verifies behavioural claims against source → curated post
The agent's job is to surface possible concerns and structure them. The human's job is to verify, cut, and decide whether to post. By default the tool writes the raw output to disk and posts nothing — so this curation step is the natural workflow, not an afterthought.
End-to-end agentic run on issue #4
of this repo (a small, scoped request to expose scripts/verify.js as
npm run verify).
Generated by:
node src/pipeline.js issue https://github.com/Hadar01/github-agents/issues/4 \
--dry-run --max-cost=1.50What the audit trail demonstrates:
- the human-readable section structure (Outcome / Safety gates / Files touched / Test runs / Timeline / Self-review / collapsed full transcript);
- the PR safety gate firing correctly: the self-review verdict was
NEEDS_DISCUSSIONand the agent never observed a passing test run, so the pipeline refused to open a PR. Adding--force-prwould have shipped it; the gate is designed to make that an explicit, auditable choice.
Cost: $0.0923 for 6 turns.