Skip to content

servers: release write lock during saveServers to prevent reader starvation#416

Merged
myleshorton merged 3 commits intomainfrom
fix/servers-save-lock-contention
Apr 15, 2026
Merged

servers: release write lock during saveServers to prevent reader starvation#416
myleshorton merged 3 commits intomainfrom
fix/servers-save-lock-contention

Conversation

@myleshorton
Copy link
Copy Markdown
Contributor

@myleshorton myleshorton commented Apr 15, 2026

Summary

servers.(*Manager).SetServers, AddServers, and RemoveServer were holding the write lock on m.access across saveServers(), which does JSON marshalling of all outbounds (CPU-heavy via sing-box reflection) and an atomic file write to disk. This starved concurrent readers of ServersJSON().

Impact

Surfaced in Freshdesk #172640: a config fetch held the write lock for 1+ minute while marshalling 36 outbounds, blocking a cgo-callback goroutine in GetAvailableServers long enough for a GC-timed write-barrier race to crash the app.

Related issues:

Changes

Lock handling:

  • Add saveMu sync.Mutex to Manager to serialize concurrent disk writes
  • saveServers now acquires saveMu for the full marshal+write (prevents stale-write reordering), with only a brief RLock around marshalling — readers aren't blocked by either step
  • SetServers, AddServers, RemoveServer release the write lock after their in-memory mutation, before calling saveServers
  • AddServers / RemoveServer scope their locked mutation in a closure with defer Unlock so future early-return edits can't skip the unlock

Telemetry (new):

  • saveServers logs per-phase timing at trace always (saveMu wait, RLock+marshal, disk write). WARN with breakdown if total ≥ 2s. WARN with full goroutine stack dump if total ≥ 15s.
  • ServersJSON / GetServerByTagJSON log WARN with goroutine stack dump if the RLock wait exceeds 1s — direct evidence of reader starvation, with a snapshot of what was holding things up.

We still don't have a definitive explanation for the 1-minute hold in #172640. The instrumentation lets us identify which phase was slow next time it happens.

Test Plan

  • go build ./... passes
  • go vet ./servers/ clean
  • Existing go test ./servers/ passes
  • New TestSaveServersConcurrent passes 5× with -race

🤖 Generated with Claude Code

…vation

SetServers, AddServers, and RemoveServer were holding the write lock
across saveServers, which does JSON marshalling of all outbounds (CPU-
heavy via sing-box reflection) and an atomic file write. This starved
readers like ServersJSON for extended periods.

This surfaced in Freshdesk #172640: a config fetch held the write lock
for 1+ minute while marshalling 36 outbounds, blocking a cgo-callback
goroutine in GetAvailableServers long enough for a GC-timed write-
barrier race to crash the app (see getlantern/engineering#3175 for the
crash; getlantern/engineering#3176 for this lock issue).

Changes:
- Add saveMu to serialize concurrent disk writes
- saveServers now acquires RLock for marshalling (not write lock) and
  saveMu for the disk write, so readers aren't blocked by either step
- SetServers/AddServers/RemoveServer release the write lock after their
  in-memory mutation, before calling saveServers

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 15, 2026 13:34
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Reduces lock contention in servers.Manager by avoiding holding the write lock while persisting server configs to disk, addressing reader starvation in ServersJSON() during expensive JSON marshalling and file I/O.

Changes:

  • Add a dedicated saveMu mutex to serialize persistence work.
  • Release m.access write lock immediately after in-memory mutations in SetServers, AddServers, and RemoveServer.
  • Update saveServers() to marshal under RLock and perform file writes under saveMu.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread servers/manager.go Outdated
Comment thread servers/manager.go Outdated
Comment thread servers/manager.go
Comment thread servers/manager.go Outdated
Comment thread servers/manager.go Outdated
…test

- saveServers now holds saveMu across the full marshal+write sequence so
  two concurrent saves can't reorder and leave stale bytes on disk
- AddServers and RemoveServer scope their in-memory mutation in a closure
  with defer Unlock — robust against future early-return edits
- Drop m.servers from the trace log in saveServers (avoid eager
  formatting under RLock); log file path and size instead
- Add TestSaveServersConcurrent covering the lost-update regression

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@myleshorton myleshorton requested a review from garmr-ulfr April 15, 2026 14:11
We still don't know why saveServers held the lock for 1+ minute in the
crash that motivated this PR (Freshdesk #172640). Next time it happens
we want to know which phase was slow — marshal, disk write, lock wait —
and what else was running at the time.

Adds:
- saveServers: per-phase timings (saveMu wait, RLock+marshal, disk write),
  logged at trace always. WARN with breakdown if total >= 2s. WARN with
  full goroutine stack dump if total >= 15s.
- ServersJSON / GetServerByTagJSON: log WARN + goroutine stack dump if
  the RLock wait exceeds 1s — direct evidence of reader starvation.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@myleshorton
Copy link
Copy Markdown
Contributor Author

I think I'm just going to pull this in for this new build, as @Derekf5 experienced this just yesterday. Going to see if I can add this to the refactor branch as well (if it's even necessary @garmr-ulfr?).

@myleshorton myleshorton merged commit 4241e6c into main Apr 15, 2026
2 checks passed
@myleshorton myleshorton deleted the fix/servers-save-lock-contention branch April 15, 2026 16:19
@garmr-ulfr
Copy link
Copy Markdown
Collaborator

@myleshorton sorry, didn't see this! Taking a look now to see if it's still relevant.

garmr-ulfr pushed a commit that referenced this pull request Apr 15, 2026
… starvation

Port of PR #416 to the refactor branch. Splits the write lock so
mutators (SetServers, AddServers, RemoveServers) only hold it for the
in-memory mutation, then release before saveServers. saveServers
acquires a brief RLock for marshalling and a separate saveMu for
serializing disk writes, so readers are never blocked by slow fsync.

Includes per-phase timing instrumentation and reader-starvation
detection to help root-cause any future slow cases.

See getlantern/engineering#3176 and Freshdesk #172640.

Co-Authored-By: garmr <garmr@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants