Agentic development orchestrator — a LangGraph
state machine that drives a PRD's work units, in dependency order, each through
plan → implement → test-gate → review → PR, with durable checkpointed state and
human approval gates. The Claude Agent SDK
is the per-node execution engine; blacksmith operates on a target repository via
isolated git clones, the repo's own test/lint toolchain, and gh.
The deliverable is LangGraph fluency — a working, inspectable grasp of its
state-machine / checkpoint / human-in-the-loop model. blacksmith now drives most of
its own development: it builds its features from Contract v1 PRDs, on itself. See
blacksmith-v0-prd.md for the original spec. Writing your own
PRD? The PRD authoring guide documents the contract
every PRD must conform to.
Status: well past the v0 spine — blacksmith builds its own features end to end, and
nearly everything below was shipped that way (write a PRD → blacksmith builds it → review
the PR). Current on main: multi-unit execution of a PRD's whole depends_on DAG
(topological order, independent units in parallel within a level, one combined PR);
clone-based isolation (each run/unit works in a throwaway git clone, never the real
checkout); a human-QA path (a human-gated unit opens a draft PR and ends
AWAITING_QA, branch preserved); tiered models (implement on Sonnet first, escalate to
Opus only on a gate failure); long-term memory (per-repo gate-failure lessons fed back
to the planner via a SQLite Store); a local observability suite — per-run/per-unit
metrics recording, a blacksmith runs history command, a built-in blacksmith dashboard,
and per-call agent transcripts; a rendered interactive CLI (Markdown plans, diffs,
live progress) that degrades to plain output when piped; fresh-run state isolation;
per-run token/cache instrumentation; and an org cost reporter. 300+ tests, CI-green.
Proven against its own repo and an external Node/TypeScript target.
Python 3.12 · LangGraph · Claude Agent SDK · SQLite checkpointer · managed with uv
uv sync # provisions Python 3.12 + installs deps
cp .env.example .env # add your dedicated BLACKSMITH_ANTHROPIC_API_KEY
uv run pytest # run the test suite
uv run ruff check # lintInstall blacksmith as a user-wide tool so it's on your PATH and runnable from
inside any repo:
uv tool install . # from a local clone
uv tool install git+https://github.com/smith-and-web/blacksmith # from the git URLOnce installed, cd into the target repo and run blacksmith <prd> directly (no
uv run prefix). blacksmith discovers its blacksmith.config.toml by walking up from
the current directory to the git root, and — when [target].repo_path is omitted —
operates on that repo (the git root of where you're standing). See
Running.
blacksmith.config.toml— blacksmith's own runtime config: model tiering, target repo, checkpointer, the long-term memory[store], the run-metrics[metrics]sink, agent[transcripts], and API auth. Parsed byblacksmith/config.py. It's discovered by walking up from the current directory to the git root, so a globally-installedblacksmithworks from any nested path inside the repo.- A separate
blacksmith.tomllives in each target repo and defines that repo's toolchain —test_cmd, optionallint_cmd, and optionalsetup_cmd(a one-off provisioning step likenpm ci, run before the gate) — read by the test gate.
The dedicated Anthropic API key is read from the env var named in [api].key_env_var
(default BLACKSMITH_ANTHROPIC_API_KEY) — never from subscription auth, keeping
blacksmith's metered spend isolated.
# from inside the target repo, with blacksmith installed globally:
blacksmith <path/to/prd.md> # run a PRD's work units (the whole DAG)
blacksmith <prd> --approve plan,pr # non-interactive (CI / headless)
# from a clone of blacksmith itself, without a global install:
uv run blacksmith <path/to/prd.md>Run-inside-the-repo flow. When [target].repo_path is omitted from
blacksmith.config.toml (or the [target] section is dropped entirely), blacksmith
operates on the git root of the directory you run it from. So you can cd into any
repo, commit a blacksmith.config.toml (no repo_path needed) plus a blacksmith.toml,
and run blacksmith <prd> from anywhere inside it. Setting an explicit absolute
[target].repo_path still works unchanged and takes precedence.
The PRD path is the single positional argument. --config points at a non-default
blacksmith.config.toml (otherwise it's discovered by walking up to the git root);
--thread-id names the checkpointer thread; a fresh run resets a terminally-finished
thread and refuses a paused one, so a reused id won't resurrect prior state — a fresh id
per run is still the simplest habit.
Interactive runs pause for a y/n at the plan and PR gates; --auto-approve approves
both, and --approve plan,pr approves only the gates you name (an unlisted gate is
denied, halting the run there). --quiet silences the per-node progress stream.
Other entry points:
blacksmith validate <prd> # offline contract check — field-level errors, zero model spend
blacksmith resume --thread-id <id> # continue an interrupted run from its SQLite checkpoint
blacksmith runs [<thread-id>] # recorded run history; pass a thread-id to drill into one run
blacksmith dashboard # local read-only web dashboard over recorded run metrics (localhost)
blacksmith --issue <N> # scaffold a PRD skeleton from GitHub issue #N (PR links Closes #N)
blacksmith costs # org usage + cost from the Admin API (read-only; needs an admin key)blacksmith operates on a target repo via git clones — nothing about blacksmith is compiled into it, and the repo needs no prior setup beyond being a git clone.
blacksmith only sees committed state. Each run cuts a fresh clone from
HEAD, so anything it should use must be committed, not just staged — includingblacksmith.tomlandCLAUDE.md(a staged-but-uncommitted config reads as "noblacksmith.tomlfound"). That clone also starts with no installed dependencies (node_modules, virtualenvs, etc. are gitignored) — which is whatsetup_cmdis for.
To point blacksmith at a new project:
- Point blacksmith at the clone. Either set
[target] repo_pathinblacksmith.config.tomlto the local path, or omitrepo_pathand run blacksmith from inside the repo — it then targets the git root of your current directory (see Running). Setdefault_branchif it isn'tmain. - Add a
blacksmith.tomlto the target repo so the test gate knows its toolchain. Because the gate runs in a fresh clone with no installed deps, a Node/TS target needssetup_cmdto install them —test_cmd = "npm test"alone dies withvitest: command not found:Commands run through a shell, so chains work directly (# committed in the TARGET repo (e.g. a Node/TypeScript MCP server) setup_cmd = "npm ci" # optional; one-off provisioning, runs before the gate test_cmd = "npm test" lint_cmd = "npm run lint" # optional; runs only if tests pass
test_cmd = "npm ci && npm test") with nosh -cwrapper. cargo needs nosetup_cmd— it fetches its own deps. - Give it context — commit a
CLAUDE.md. For a repo with no claude.ai Project, a root-levelCLAUDE.mdis how its conventions reach the agent: blacksmith reads the clone'sCLAUDE.mdand injects it into the implementer's system prompt as project context. For safety it does not load the repo's.claude/settings.json— permissions and hooks are never inherited from a target — and the PRD untouchables always override the repo's own guidance. - Write a Contract v1 PRD for the work — see
docs/prd-authoring-guide.md. Setprimary_target_repo, declarelayers, listuntouchables, and definework_units. blacksmith runs the wholework_unitsDAG in dependency order on one shared branch and opens a single combined PR — usedepends_onto order them (independent units at the same level run in parallel). Use anautolayer where the test gate should decide pass/fail end to end; ahumanlayer instead opens a draft PR for manual QA. - Have the
claudeCLI onPATH— the Agent SDK spawns it for live runs. - Run it —
uv run blacksmith path/to/your-prd.md— then approve at the plan and PR gates (or run headless with--approve; see Running).
Worked example — fixing an MCP spec violation. Given an Obsidian MCP server whose tool violates the MCP spec: add
blacksmith.toml(npm cisetup +npm test/npm run lint), commit aCLAUDE.mdcapturing the server's conventions, and write a one-unit PRD whose work unit targets the offending tool's module with atest_contractthat asserts the spec-conformant shape. blacksmith plans, implements in an isolated clone, gates onnpm test, and opens a PR for review — no claude.ai Project required.
- WU-01 — project scaffold + config loader
- WU-02 — PRD contract schema + validator
- WU-03 — state schema + graph skeleton + checkpointer
- WU-04 — Claude Agent SDK executor wrapper (mocked tests pass; live agent path confirmed via dogfood)
- WU-05 — clone manager
- WU-06 — toolchain-aware test gate
- WU-07 — HITL interrupt nodes (plan + PR)
- WU-08 — PR node
- WU-09 — plan node
- WU-10 — implement node (guard + diff/commit auto-tested; live agent-edit confirmed via dogfood)
- WU-11 — end-to-end wiring (happy path + human-halt-on-fail) + CLI
- Multi-unit DAG execution — topological ordering + parallel fan-out within a dependency level, accumulating onto one combined PR.
- Clone-based isolation — a throwaway
git cloneper run/unit; the real checkout is never touched. - Human-QA path — a
human-gated unit opens a draft PR and endsAWAITING_QA, branch preserved for manual review. - Tiered models — implement on Sonnet first, escalate to Opus only on a gate failure.
- More CLIs —
validate(offline contract check),resume(continue from a checkpoint),runs(recorded run history + per-run drill-down),dashboard(local metrics UI),--issue N(scaffold from a GitHub issue),costs(org Admin-API usage/cost), and globaluv tool install. - Long-term memory — per-repo gate-failure lessons persisted in a SQLite Store and fed back into the planner's context on later runs.
- Observability — per-run/per-unit metrics recorded to a local SQLite, a
blacksmith runshistory command, a built-in localhostblacksmith dashboard, and per-call agent transcripts linked from each run. - Interactive CLI — rendered Markdown plans, diffs, and test output with a live progress indicator at the gates, degrading to plain, parseable output when piped or under
--quiet. - Fresh-run state isolation — a fresh run resets a terminally-finished thread and refuses a paused one, so a reused
--thread-idcan't resurrect a prior run's errors. - Cost visibility — end-of-run cost total (summed across all units + escalations) plus per-run token + cache-hit instrumentation.
- Robustness — repo-consistency preflight, reason-accurate guard-block reporting, graceful executor failures, and a forward-migrating metrics store.