Skip to content

fix(core): detect tmux server death and clean up orphaned dashboard#696

Open
sreyom31 wants to merge 6 commits intoComposioHQ:mainfrom
sreyom31:feat/issue-695
Open

fix(core): detect tmux server death and clean up orphaned dashboard#696
sreyom31 wants to merge 6 commits intoComposioHQ:mainfrom
sreyom31:feat/issue-695

Conversation

@sreyom31
Copy link

Summary

Fixes #695 — tmux server dies silently causing orchestrator session crash loop.

  • Fix .catch(() => true).catch(() => false) in lifecycle manager's isAlive check — the core bug that masked tmux server death by treating thrown errors as "session alive"
  • Add batch-death detection in pollAll() — when all active sessions die in a single poll cycle, fires onAllSessionsKilled callback and logs lifecycle.runtime_server_dead event
  • Lifecycle worker kills dashboard on runtime death — wires the callback to SIGTERM the dashboard parent process (from running.json) and self-shutdown, preventing orphaned dashboards
  • ao start cleans up orphaned dashboards — before falling back to a new port, checks running.json for stale entries and kills the process holding the port

Test plan

  • Unit test: isAlive rejection → session transitions to killed
  • Unit test: all sessions killed in one cycle → onAllSessionsKilled called
  • Unit test: partial death → onAllSessionsKilled NOT called
  • All 33 lifecycle-manager tests pass
  • Manual: ao starttmux kill-server → verify sessions killed, dashboard shuts down, ao start recovers on original port

🤖 Generated with Claude Code

…omposioHQ#695)

The lifecycle manager's isAlive check swallowed runtime server crashes via
.catch(() => true), masking tmux server death. This caused sessions to appear
alive indefinitely, the dashboard to linger as an orphan holding the port,
and ao start to fall back to wrong ports.

- Fix .catch(() => true) → .catch(() => false) so isAlive failures mark
  sessions as killed
- Add batch-death detection in pollAll() with onAllSessionsKilled callback
- Lifecycle worker wires callback to SIGTERM the dashboard parent and
  self-shutdown
- ao start cleans up orphaned dashboards from stale running.json before
  falling back to a new port

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…nitialized

Declare lifecycle as let before shutdown handler so signal handlers can
safely call lifecycle?.stop() even if getLifecycleManager hasn't resolved yet.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…tance

Snapshot the running.json PID when the lifecycle worker starts instead of
reading it at callback time. Prevents killing a newer AO instance that
overwrote running.json after the original parent died.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Pass AO_PARENT_PID env var when spawning lifecycle worker so it knows
  the dashboard parent PID without reading running.json (which isn't
  populated yet at worker startup)
- Remove running.json dependency from orphan cleanup in ao start since
  isAlreadyRunning() already prunes stale entries before runStartup();
  use stopDashboard() directly via lsof instead

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
A single session dying (agent crash, user kill, PR closure) should not
trigger onAllSessionsKilled and tear down the dashboard. Require at
least 2 sessions to die in the same poll cycle for the heuristic to fire.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@sreyom31
Copy link
Author

@suraj-markup please review whenever available.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

tmux server dies silently causing orchestrator session crash loop

1 participant