Skip to content

fix: notify users via Slack when agent hits model call step limit#1204

Open
langsmith-forge[bot] wants to merge 1 commit intomainfrom
auto/agent-fix-recursion-limit-notification
Open

fix: notify users via Slack when agent hits model call step limit#1204
langsmith-forge[bot] wants to merge 1 commit intomainfrom
auto/agent-fix-recursion-limit-notification

Conversation

@langsmith-forge
Copy link
Copy Markdown

Problem

The agent silently terminates after hitting a 1000-step recursion limit without sending any Slack notification to users. Users start a task, the agent runs for 1000 steps and then stops, leaving users with no completion message, no PR, and no indication of what happened.

Traces:

Root cause

When DEFAULT_RECURSION_LIMIT = 1_000 is hit, LangGraph raises GraphRecursionError which bypasses all @after_agent middleware (including open_pr_if_needed). Users get no Slack notification that the agent stopped. In a typical 1000-run trace: ~72 LLM calls + 141 tool calls + 787 chain runs = 1000 total — so ~72 model calls hits the hard limit.

Fix

Two changes in agent/server.py and a new middleware file:

  1. Added ModelCallLimitMiddleware(run_limit=60, exit_behavior="end") to the middleware stack. This intercepts gracefully at 60 model calls — before the 1000-step recursion limit fires — injecting an AI message with "Model call limits exceeded: ..." and routing to end (which DOES run @after_agent middleware).

  2. Added notify_step_limit_reached — a new @after_agent middleware in agent/middleware/notify_step_limit.py — that detects the limit marker in the last AI message and posts a Slack thread reply:

    "I've reached my maximum step limit and had to stop. The task may be incomplete. You can retry with a more focused request, or ask me to continue from where I left off."

The notify_step_limit_reached middleware is placed last in the list so it runs after open_pr_if_needed (after_agent hooks run in reverse list order), ensuring any partial work is committed before the user is notified.

Evidence

No new tests written — this is a behavioral infrastructure change (adding middleware + a config parameter). Tests asserting exact middleware configuration or prompt content would be brittle. The production traces are the evidence.

  • CI checks pass locally
  • Existing tests pass, no regressions (107 tests pass with uv sync --locked --extra dev)
  • Change is minimal and scoped — no drive-by refactors

- Root cause: GraphRecursionError at 1000 steps bypassed all @after_agent
  middleware including open_pr_if_needed, leaving users with no notification
- Change: Added ModelCallLimitMiddleware(run_limit=60) to intercept gracefully
  before the hard recursion limit, and added notify_step_limit_reached
  @after_agent middleware to post a Slack thread reply when the limit fires
- Verified: 107 existing tests pass, no regressions
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants