Daemon unresponsive after prolonged multi-workspace use; restart fails because zombie holds singleton (v0.1.51, Linux)

## Summary

After several hours of use with multiple concurrent workspaces / agent sessions, the Paseo daemon becomes unresponsive: the UI times out with `Timeout waiting for message (60000ms)`, in-flight sessions stop progressing, and clicking **Settings → Start daemon** fails with `Daemon failed to start` because the new detached child exits during the grace period — the previous daemon is still running and holding the singleton lock. Recovery requires `kill -9` of the supervisor + daemon processes; `SIGTERM` is ignored.

Reproduces on Linux with v0.1.51. Severity correlates with the number of concurrently active workspaces, which aligns with a log-volume / memory-growth hypothesis described below.

## Environment

- **Paseo desktop:** v0.1.51 (Electron 41.0.3, Chromium 146.0.7680.80)
- **OS:** Linux Mint, kernel `6.14.0-37-generic` (x86_64)
- **Active workspaces at time of wedge:** several concurrent sessions under `~/.paseo/worktrees/1kw2wg5k/*`

## Symptoms

1. Active sessions progressively slow down, then stop responding.
2. `[AgentInput] Failed to send message: Error: Timeout waiting for message (60000ms)` repeats in `~/.config/Paseo/logs/main.log`.
3. Clicking **Settings → Start daemon** fails:

   ```
   Start daemon failed: Error invoking remote method 'paseo:invoke':
     Error: Daemon failed to start.
   ```

4. The attempted spawn immediately exits during the grace period:

   ```
   [desktop daemon] starting detached daemon
   [desktop daemon] detached spawn returned
   [desktop daemon] detached child emitted exit during grace period { pid: ..., childPid: ... }
   [desktop daemon] detached startup grace period completed { ..., exitedEarly: true }
   [Settings] Failed to change desktop daemon state Error: ... Daemon failed to start.
   ```

5. The **original** daemon and supervisor processes are still running, holding the singleton. They ignore `SIGTERM`. Only `SIGKILL` clears them.

## Process state at the time of wedge

```
PID    %CPU  RSS       CMD
4757   0.0   92 MB     /opt/Paseo/Paseo.bin … node-entrypoint-runner.js … supervisor-entrypoint.js
4770   33.2  1.30 GB   /opt/Paseo/Paseo.bin … node-entrypoint-runner.js … @getpaseo/server/dist/server/server/index.js
```

- **1.30 GB RSS** on the daemon process is unusually high for a Node server and strongly suggests unbounded buffering or a leak.
- **~33% of one core sustained** while IPC is not being serviced within 60 s — consistent with event-loop saturation from stream handling + logging + GC pauses, *not* whole-machine CPU starvation. The rest of the laptop was ~90% idle.
- The symptom worsens as more workspaces are opened, which fits a per-session rate-dependent bug.

## Daemon log evidence — runaway trace-level stream logging

`main.log` is dominated by pino trace (`"level":10`) entries from a single active Claude session. **Every SDK stream event is logged twice:**

```json
{"level":10,"time":1775744457431,"pid":4770,"module":"bootstrap","module":"agent","provider":"claude",
 "claudeSessionId":"ac2ffc96-d560-4e2e-a06b-6ebb6a4a0689","messageType":"stream_event",
 "turnId":"foreground-turn-5","msg":"Claude query pump: SDK message"}

{"level":10,"time":1775744457431,"pid":4770,"module":"bootstrap","module":"agent","provider":"claude",
 "claudeSessionId":"ac2ffc96-d560-4e2e-a06b-6ebb6a4a0689","messageType":"stream_event",
 "messageUuid":"a01e9640-ef24-4b69-874d-f81fade2b9b8","msg":"Claude query pump: raw SDK message"}
```

Observations:

- Both entries fire for every single streaming token — one tagged `turnId`, one tagged `messageUuid`, but otherwise identical cargo.
- The `module` binding is duplicated in the same record (`"module":"bootstrap","module":"agent"`), suggesting the `pino.child()` chain is stacking rather than merging.
- These are level-10 (trace) emissions in a shipped release build. There's no user-facing log-level control I can find.
- With N concurrent active sessions, log volume grows roughly N×, which matches the "gets worse the more workspaces I have" observation.

I strongly suspect this is the primary memory-growth driver behind the GC pressure that eventually stalls the event loop. Demoting these two emissions to `debug` (or gating them behind `PASEO_LOG_LEVEL=trace`) would likely eliminate the wedge on its own.

## Timeline from `~/.config/Paseo/logs/main.log`

All timestamps 2026-04-09, trimmed for length:

```
11:02:47  [error] [AgentInput] Failed to send message: Error:
          Working directory does not exist: /home/<user>/.paseo/worktrees/1kw2wg5k/radical-deer
11:04:04  [error] [AgentInput] Failed to send message: Error: Timeout waiting for message (60000ms)
11:04:26  [info]  [desktop daemon] starting detached daemon
11:04:26  [info]  [desktop daemon] detached child emitted exit during grace period { pid: 812098, childPid: 812990 }
11:04:26  [info]  [desktop daemon] detached startup grace period completed { …, exitedEarly: true }
11:04:34  [info]  [desktop daemon] starting detached daemon
11:04:34  [info]  [desktop daemon] detached child emitted exit during grace period { pid: 813521, childPid: 814426 }
11:04:34  [info]  [desktop daemon] detached startup grace period completed { …, exitedEarly: true }
                                              ↑ both failed because the prior daemon was still holding the singleton
11:05:43  [info]  [desktop daemon] starting detached daemon
11:05:44  [info]  [desktop daemon] detached startup grace period completed { pid: 4026, childPid: 4757, exitedEarly: false }
                                              ↑ succeeded only after I killed the zombie from a terminal
11:22:42  [error] [AgentInput] Failed to send message: Error: Timeout waiting for message (60000ms)
11:25:19  [error] [AgentInput] Failed to send message: Error: Timeout waiting for message (60000ms)
11:32:44  [info]  [desktop daemon] starting detached daemon
11:32:44  [info]  [desktop daemon] detached child emitted exit during grace period { pid: 4026, childPid: 222309 }
11:32:44  [info]  [desktop daemon] detached startup grace period completed { …, exitedEarly: true }
11:32:44  [error] [Settings] Failed to change desktop daemon state Error: … Daemon failed to start.
```

Note the sequence at **11:02 → 11:04**: a session pointed at a worktree (`radical-deer`) that no longer existed on disk, immediately followed by 60 s IPC timeouts. This may be a contributing trigger — the daemon possibly got stuck trying to drive a session against an invalid cwd and never released the stream, amplifying the log-spam spiral.

## Reproduction

Hard to trigger deterministically, but correlates with:

1. Open a project with many worktrees/workspaces (I had 5+ concurrent sessions).
2. Leave several agents actively streaming for an extended period.
3. Optionally, have at least one session whose worktree was removed from disk.
4. After some minutes, sends start timing out after 60 s.
5. **Start daemon** from the UI fails because the existing daemon is still registered as the singleton and the new detached child exits in the grace period.

## Why I think this isn't covered by existing issues

- **#203** fixed `reconcileActiveWorkspaceRecords` blocking daemon **restart**, but that's cold-path only. This issue is steady-state degradation during normal use.
- **#194** is a related symptom family (daemon ↔ UI state desync), but the *opposite* direction: UI stuck on "Connecting" while the daemon is actually fine. Here the UI says "failed to start" while the daemon is actually pegged and holding the lock.
- **#200** / **#156** are about `claudeSessionId` overwrites on the same `Claude query pump` code path surfacing in my wedge. Possibly related upstream, but the user-visible failure mode (unresponsive daemon + failed restart + zombie singleton) is distinct.

I couldn't find any existing issue touching the trace-level log emission volume, daemon RSS growth, or the "restart fails because the zombie holds the singleton" path.

## Impact

- Mid-turn work is lost (sessions terminate with `turn_failed` or silently hang).
- The UI offers no recovery path — **Start daemon** fails, and there's no **Force restart** action that would clear a stuck singleton first.
- Users without shell access to `kill -9` have no way out short of rebooting.

## Suggested fixes (rough priority)

1. **Demote `Claude query pump: SDK message` / `raw SDK message` below trace in release builds, or gate behind `PASEO_LOG_LEVEL=trace`.** Almost certainly the highest-leverage fix — cuts the memory-growth driver. Worth revisiting whether both emissions need to exist separately at all; they carry identical cargo with a different tag.
2. **Fix the duplicated `module` binding** in the pino child logger chain (`"module":"bootstrap","module":"agent"` in the same record). Not causal, but suggests `pino.child()` is being mis-chained and may be allocating fresh loggers per event.
3. **Have "Start daemon" detect a stale singleton and offer to kill it** instead of silently failing when the detached child exits in the grace period. A UI prompt like *"A previous daemon (pid X, RSS 1.3 GB, unresponsive) is still running — kill and restart?"* would be a huge UX win.
4. **Handle "Working directory does not exist" as a terminal error** for the session's stream pump and release any held resources, rather than leaving a stream in flight.
5. **Bound stream / log buffer sizes** with a hard cap + explicit warning, so pathological growth crashes early and loudly instead of silently stalling.
6. **Add an IPC heartbeat path that bypasses the main event-loop work** (worker thread or direct socket) so the UI can distinguish "busy but alive" from "wedged" and offer appropriate recovery.
7. **Respect SIGTERM.** A normal terminate should always tear down the supervisor; if something is blocking, a 5 s timeout then self-exit would be reasonable.

Happy to grab additional diagnostics (heap snapshot, perf profile, strace of the wedged process) if that'd help — just let me know what format is most useful.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Daemon unresponsive after prolonged multi-workspace use; restart fails because zombie holds singleton (v0.1.51, Linux) #227

Summary

Environment

Symptoms

Process state at the time of wedge

Daemon log evidence — runaway trace-level stream logging

Timeline from `~/.config/Paseo/logs/main.log`

Reproduction

Why I think this isn't covered by existing issues

Impact

Suggested fixes (rough priority)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Daemon unresponsive after prolonged multi-workspace use; restart fails because zombie holds singleton (v0.1.51, Linux) #227

Description

Summary

Environment

Symptoms

Process state at the time of wedge

Daemon log evidence — runaway trace-level stream logging

Timeline from ~/.config/Paseo/logs/main.log

Reproduction

Why I think this isn't covered by existing issues

Impact

Suggested fixes (rough priority)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Timeline from `~/.config/Paseo/logs/main.log`