To continuously make progress autonomously without constant human monitoring.
Practical definition:
- No silent deadlocks
- No infinite loops on the same step with no state change
- Repeated failures always converted to a different recovery path
Completion policy:
- Don't assume completion guarantee for all cases including external conditions
- Guarantee that the system does not intentionally stay stopped
- When progress degrades, force strategy change via recovery state transition
- Recovery-first over first-attempt success
- Idempotent control points (lease / run claim / dedupe signature)
- Backlog-first startup (prioritize consuming existing backlog over new generation)
- Explicit blocked reasons that are machine-recoverable
- Task lease prevents duplicate dispatch
- Runtime lock prevents duplicate execution
- Continuously reclaim dangling / expired / orphaned leases
- Only unjudged successful runs are processed
- Claimed runs cannot be double-judged
awaiting_judgequota_waitneeds_rework
Runtime blocked state for planner issue-link ordering:
issue_linking
Convert state to recoverable, never abandon it.
- Escalate to rework/autofix on repeated same failure signature
- On merge conflict after approve, branch to conflict autofix task when possible
Recovery switch is event-driven, not fixed-time triggered:
- Repeated same failure signature ->
needs_rework/ rework split - Non-approve circuit breaker -> autofix path
- Quota failure ->
quota_wait-> cooldown requeue - Missing judgable run -> restore
awaiting_judgerun context
Treat quota pressure as recoverable external pressure, not terminal failure.
- Single attempt may fail quickly
- Task waits with explicit reason (
quota_wait) - Continue cooldown retry until resources recover
Operators must observe not only trial results but also intended next steps:
- Run-level failure
- Task-level next retry reason/time
- Backlog gate reason returned by preflight
- Maximizing first-attempt success at cost of recoverability
- Fixed strict sequential processing when safe concurrency exists
- Recovery flows that require manual intervention only
- Recovery design that relies only on fixed-interval watchdog
This page describes design principles; for actual triage, follow state vocabulary -> transition -> owner -> implementation.
- state-model (state vocabulary)
- flow (transitions and recovery paths)
- operations (API procedures and operation shortcuts)
- agent/README (owning agent and implementation tracing)