Skip to content

Latest commit

 

History

History
124 lines (90 loc) · 5.17 KB

File metadata and controls

124 lines (90 loc) · 5.17 KB

Sample artifacts

Real outputs produced by github-agent on real GitHub issues / PRs. Two families of artifacts:

  • End-to-end engineering run — what the tool emits when fixing an issue.
  • Code-review run, v1 → v2 → v3 — what the tool emits, what changed when the prompt was hardened against hallucination, and what survived human curation. This is the artifact we'd actually send to a maintainer.

The review-family files are kept as a teaching artifact: they document both the failure mode (v1) and the workflow we recommend skeptical maintainers actually follow (raw → curated, with a human in the loop).


Code review of tqec/tqec PR #894

Real PR — "Add conftest.py to increase timeouts for slow tests" — 10 lines, single file, labelled needs-review. We chose it because it's small enough to fit in a single review and contains a real subtle issue.

sample-review-tqec-pr894-v1-raw-flawed.md ⚠️ flawed by design

The raw output of an early version of the review prompt before we hardened it against hallucination. Multiple findings turned out to be factually wrong when verified against pytest and pytest-timeout source:

  • claimed pytest-timeout might be missing — but pyproject.toml pins it;
  • claimed --timeout CLI overrides markers — wrong direction per pytest-timeout docs;
  • claimed an existing @pytest.mark.timeout(N) decorator would be overridden by the conftest — backwards, the decorator wins.

Kept here on purpose so the failure mode is visible in the repo itself, and so the diff between v1 and v2 demonstrates what the prompt fix actually changed.

sample-review-tqec-pr894-v2-raw.md

Same PR, same tool, but:

  1. The system prompt now has explicit anti-hallucination rules (see src/prompts/review.js) — "never claim a dependency might be missing without checking the manifest", "never assert library precedence without citation, hedge instead", "prefer fewer correct findings to many shaky ones".
  2. The pipeline now also fetches dependency-manifest files (pyproject.toml, package.json, …) into the review's file context, so the "check the manifest first" rule can actually be satisfied.

The v2 output:

  • does not speculate about missing dependencies (it can see pyproject.toml);
  • explicitly hedges on pytest-timeout precedence ("I am not certain of the precedence rules…please confirm") instead of asserting in either direction;
  • recommends a concrete fix that sidesteps the precedence ambiguity: if not item.get_closest_marker("timeout") before adding the marker.

This is the actual unedited file the tool wrote.

sample-review-tqec-pr894-v3-curated.md — the version a human would post

The hand-curated final, distilled from the v2 raw output. Every behavioural claim is verified against pytest / pytest-timeout source code with inline citations to the source files. Includes:

  • the verdict (NEEDS_DISCUSSION) and why;
  • one concrete actionable suggestion (the marker guard) with a code snippet;
  • one documentation nudge about the coupling to pyproject.toml's baseline;
  • a transparent table of what was cut from the raw output, and why.

This is what you would actually paste into the PR thread.

verify-marker-precedence/ — runtime confirmation of the curated claim

A 4-file standalone pytest project (conftest.py mirroring PR #894 + pyproject.toml + a 3-test file + a marker inspector script). Run python verify_precedence.py and it prints, for each test, all timeout markers attached to its item and which one get_closest_marker resolves to. The recorded output (transcript pinned in that directory's README) makes the precedence claim empirically checkable, not just source-cited.

The point

The workflow demonstrated by these three files is the workflow we recommend to maintainers who don't want AI noise in their PR threads:

agent.review() → raw output → human verifies behavioural claims against source → curated post

The agent's job is to surface possible concerns and structure them. The human's job is to verify, cut, and decide whether to post. By default the tool writes the raw output to disk and posts nothing — so this curation step is the natural workflow, not an afterthought.


End-to-end engineering run

sample-audit-trail-issue-4.md

End-to-end agentic run on issue #4 of this repo (a small, scoped request to expose scripts/verify.js as npm run verify).

Generated by:

node src/pipeline.js issue https://github.com/Hadar01/github-agents/issues/4 \
  --dry-run --max-cost=1.50

What the audit trail demonstrates:

  • the human-readable section structure (Outcome / Safety gates / Files touched / Test runs / Timeline / Self-review / collapsed full transcript);
  • the PR safety gate firing correctly: the self-review verdict was NEEDS_DISCUSSION and the agent never observed a passing test run, so the pipeline refused to open a PR. Adding --force-pr would have shipped it; the gate is designed to make that an explicit, auditable choice.

Cost: $0.0923 for 6 turns.