Skip to content

Daemon unresponsive after prolonged multi-workspace use; restart fails because zombie holds singleton (v0.1.51, Linux) #227

@manoelneto

Description

@manoelneto

Summary

After several hours of use with multiple concurrent workspaces / agent sessions, the Paseo daemon becomes unresponsive: the UI times out with Timeout waiting for message (60000ms), in-flight sessions stop progressing, and clicking Settings → Start daemon fails with Daemon failed to start because the new detached child exits during the grace period — the previous daemon is still running and holding the singleton lock. Recovery requires kill -9 of the supervisor + daemon processes; SIGTERM is ignored.

Reproduces on Linux with v0.1.51. Severity correlates with the number of concurrently active workspaces, which aligns with a log-volume / memory-growth hypothesis described below.

Environment

  • Paseo desktop: v0.1.51 (Electron 41.0.3, Chromium 146.0.7680.80)
  • OS: Linux Mint, kernel 6.14.0-37-generic (x86_64)
  • Active workspaces at time of wedge: several concurrent sessions under ~/.paseo/worktrees/1kw2wg5k/*

Symptoms

  1. Active sessions progressively slow down, then stop responding.

  2. [AgentInput] Failed to send message: Error: Timeout waiting for message (60000ms) repeats in ~/.config/Paseo/logs/main.log.

  3. Clicking Settings → Start daemon fails:

    Start daemon failed: Error invoking remote method 'paseo:invoke':
      Error: Daemon failed to start.
    
  4. The attempted spawn immediately exits during the grace period:

    [desktop daemon] starting detached daemon
    [desktop daemon] detached spawn returned
    [desktop daemon] detached child emitted exit during grace period { pid: ..., childPid: ... }
    [desktop daemon] detached startup grace period completed { ..., exitedEarly: true }
    [Settings] Failed to change desktop daemon state Error: ... Daemon failed to start.
    
  5. The original daemon and supervisor processes are still running, holding the singleton. They ignore SIGTERM. Only SIGKILL clears them.

Process state at the time of wedge

PID    %CPU  RSS       CMD
4757   0.0   92 MB     /opt/Paseo/Paseo.bin … node-entrypoint-runner.js … supervisor-entrypoint.js
4770   33.2  1.30 GB   /opt/Paseo/Paseo.bin … node-entrypoint-runner.js … @getpaseo/server/dist/server/server/index.js
  • 1.30 GB RSS on the daemon process is unusually high for a Node server and strongly suggests unbounded buffering or a leak.
  • ~33% of one core sustained while IPC is not being serviced within 60 s — consistent with event-loop saturation from stream handling + logging + GC pauses, not whole-machine CPU starvation. The rest of the laptop was ~90% idle.
  • The symptom worsens as more workspaces are opened, which fits a per-session rate-dependent bug.

Daemon log evidence — runaway trace-level stream logging

main.log is dominated by pino trace ("level":10) entries from a single active Claude session. Every SDK stream event is logged twice:

{"level":10,"time":1775744457431,"pid":4770,"module":"bootstrap","module":"agent","provider":"claude",
 "claudeSessionId":"ac2ffc96-d560-4e2e-a06b-6ebb6a4a0689","messageType":"stream_event",
 "turnId":"foreground-turn-5","msg":"Claude query pump: SDK message"}

{"level":10,"time":1775744457431,"pid":4770,"module":"bootstrap","module":"agent","provider":"claude",
 "claudeSessionId":"ac2ffc96-d560-4e2e-a06b-6ebb6a4a0689","messageType":"stream_event",
 "messageUuid":"a01e9640-ef24-4b69-874d-f81fade2b9b8","msg":"Claude query pump: raw SDK message"}

Observations:

  • Both entries fire for every single streaming token — one tagged turnId, one tagged messageUuid, but otherwise identical cargo.
  • The module binding is duplicated in the same record ("module":"bootstrap","module":"agent"), suggesting the pino.child() chain is stacking rather than merging.
  • These are level-10 (trace) emissions in a shipped release build. There's no user-facing log-level control I can find.
  • With N concurrent active sessions, log volume grows roughly N×, which matches the "gets worse the more workspaces I have" observation.

I strongly suspect this is the primary memory-growth driver behind the GC pressure that eventually stalls the event loop. Demoting these two emissions to debug (or gating them behind PASEO_LOG_LEVEL=trace) would likely eliminate the wedge on its own.

Timeline from ~/.config/Paseo/logs/main.log

All timestamps 2026-04-09, trimmed for length:

11:02:47  [error] [AgentInput] Failed to send message: Error:
          Working directory does not exist: /home/<user>/.paseo/worktrees/1kw2wg5k/radical-deer
11:04:04  [error] [AgentInput] Failed to send message: Error: Timeout waiting for message (60000ms)
11:04:26  [info]  [desktop daemon] starting detached daemon
11:04:26  [info]  [desktop daemon] detached child emitted exit during grace period { pid: 812098, childPid: 812990 }
11:04:26  [info]  [desktop daemon] detached startup grace period completed { …, exitedEarly: true }
11:04:34  [info]  [desktop daemon] starting detached daemon
11:04:34  [info]  [desktop daemon] detached child emitted exit during grace period { pid: 813521, childPid: 814426 }
11:04:34  [info]  [desktop daemon] detached startup grace period completed { …, exitedEarly: true }
                                              ↑ both failed because the prior daemon was still holding the singleton
11:05:43  [info]  [desktop daemon] starting detached daemon
11:05:44  [info]  [desktop daemon] detached startup grace period completed { pid: 4026, childPid: 4757, exitedEarly: false }
                                              ↑ succeeded only after I killed the zombie from a terminal
11:22:42  [error] [AgentInput] Failed to send message: Error: Timeout waiting for message (60000ms)
11:25:19  [error] [AgentInput] Failed to send message: Error: Timeout waiting for message (60000ms)
11:32:44  [info]  [desktop daemon] starting detached daemon
11:32:44  [info]  [desktop daemon] detached child emitted exit during grace period { pid: 4026, childPid: 222309 }
11:32:44  [info]  [desktop daemon] detached startup grace period completed { …, exitedEarly: true }
11:32:44  [error] [Settings] Failed to change desktop daemon state Error: … Daemon failed to start.

Note the sequence at 11:02 → 11:04: a session pointed at a worktree (radical-deer) that no longer existed on disk, immediately followed by 60 s IPC timeouts. This may be a contributing trigger — the daemon possibly got stuck trying to drive a session against an invalid cwd and never released the stream, amplifying the log-spam spiral.

Reproduction

Hard to trigger deterministically, but correlates with:

  1. Open a project with many worktrees/workspaces (I had 5+ concurrent sessions).
  2. Leave several agents actively streaming for an extended period.
  3. Optionally, have at least one session whose worktree was removed from disk.
  4. After some minutes, sends start timing out after 60 s.
  5. Start daemon from the UI fails because the existing daemon is still registered as the singleton and the new detached child exits in the grace period.

Why I think this isn't covered by existing issues

I couldn't find any existing issue touching the trace-level log emission volume, daemon RSS growth, or the "restart fails because the zombie holds the singleton" path.

Impact

  • Mid-turn work is lost (sessions terminate with turn_failed or silently hang).
  • The UI offers no recovery path — Start daemon fails, and there's no Force restart action that would clear a stuck singleton first.
  • Users without shell access to kill -9 have no way out short of rebooting.

Suggested fixes (rough priority)

  1. Demote Claude query pump: SDK message / raw SDK message below trace in release builds, or gate behind PASEO_LOG_LEVEL=trace. Almost certainly the highest-leverage fix — cuts the memory-growth driver. Worth revisiting whether both emissions need to exist separately at all; they carry identical cargo with a different tag.
  2. Fix the duplicated module binding in the pino child logger chain ("module":"bootstrap","module":"agent" in the same record). Not causal, but suggests pino.child() is being mis-chained and may be allocating fresh loggers per event.
  3. Have "Start daemon" detect a stale singleton and offer to kill it instead of silently failing when the detached child exits in the grace period. A UI prompt like "A previous daemon (pid X, RSS 1.3 GB, unresponsive) is still running — kill and restart?" would be a huge UX win.
  4. Handle "Working directory does not exist" as a terminal error for the session's stream pump and release any held resources, rather than leaving a stream in flight.
  5. Bound stream / log buffer sizes with a hard cap + explicit warning, so pathological growth crashes early and loudly instead of silently stalling.
  6. Add an IPC heartbeat path that bypasses the main event-loop work (worker thread or direct socket) so the UI can distinguish "busy but alive" from "wedged" and offer appropriate recovery.
  7. Respect SIGTERM. A normal terminate should always tear down the supervisor; if something is blocking, a 5 s timeout then self-exit would be reasonable.

Happy to grab additional diagnostics (heap snapshot, perf profile, strace of the wedged process) if that'd help — just let me know what format is most useful.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions