servers: release write lock during saveServers to prevent reader starvation by myleshorton · Pull Request #416 · getlantern/radiance

myleshorton · 2026-04-15T13:34:53Z

Summary

servers.(*Manager).SetServers, AddServers, and RemoveServer were holding the write lock on m.access across saveServers(), which does JSON marshalling of all outbounds (CPU-heavy via sing-box reflection) and an atomic file write to disk. This starved concurrent readers of ServersJSON().

Impact

Surfaced in Freshdesk #172640: a config fetch held the write lock for 1+ minute while marshalling 36 outbounds, blocking a cgo-callback goroutine in GetAvailableServers long enough for a GC-timed write-barrier race to crash the app.

Related issues:

Crash: getlantern/engineering#3175 (fixed separately in mobile: return string instead of []byte from gomobile-exported funcs lantern#8662)
This lock issue: getlantern/engineering#3176

Changes

Lock handling:

Add saveMu sync.Mutex to Manager to serialize concurrent disk writes
saveServers now acquires saveMu for the full marshal+write (prevents stale-write reordering), with only a brief RLock around marshalling — readers aren't blocked by either step
SetServers, AddServers, RemoveServer release the write lock after their in-memory mutation, before calling saveServers
AddServers / RemoveServer scope their locked mutation in a closure with defer Unlock so future early-return edits can't skip the unlock

Telemetry (new):

saveServers logs per-phase timing at trace always (saveMu wait, RLock+marshal, disk write). WARN with breakdown if total ≥ 2s. WARN with full goroutine stack dump if total ≥ 15s.
ServersJSON / GetServerByTagJSON log WARN with goroutine stack dump if the RLock wait exceeds 1s — direct evidence of reader starvation, with a snapshot of what was holding things up.

We still don't have a definitive explanation for the 1-minute hold in #172640. The instrumentation lets us identify which phase was slow next time it happens.

Test Plan

go build ./... passes
go vet ./servers/ clean
Existing go test ./servers/ passes
New TestSaveServersConcurrent passes 5× with -race

🤖 Generated with Claude Code

…vation SetServers, AddServers, and RemoveServer were holding the write lock across saveServers, which does JSON marshalling of all outbounds (CPU- heavy via sing-box reflection) and an atomic file write. This starved readers like ServersJSON for extended periods. This surfaced in Freshdesk #172640: a config fetch held the write lock for 1+ minute while marshalling 36 outbounds, blocking a cgo-callback goroutine in GetAvailableServers long enough for a GC-timed write- barrier race to crash the app (see getlantern/engineering#3175 for the crash; getlantern/engineering#3176 for this lock issue). Changes: - Add saveMu to serialize concurrent disk writes - saveServers now acquires RLock for marshalling (not write lock) and saveMu for the disk write, so readers aren't blocked by either step - SetServers/AddServers/RemoveServer release the write lock after their in-memory mutation, before calling saveServers Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Copilot

Pull request overview

Reduces lock contention in servers.Manager by avoiding holding the write lock while persisting server configs to disk, addressing reader starvation in ServersJSON() during expensive JSON marshalling and file I/O.

Changes:

Add a dedicated saveMu mutex to serialize persistence work.
Release m.access write lock immediately after in-memory mutations in SetServers, AddServers, and RemoveServer.
Update saveServers() to marshal under RLock and perform file writes under saveMu.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…test - saveServers now holds saveMu across the full marshal+write sequence so two concurrent saves can't reorder and leave stale bytes on disk - AddServers and RemoveServer scope their in-memory mutation in a closure with defer Unlock — robust against future early-return edits - Drop m.servers from the trace log in saveServers (avoid eager formatting under RLock); log file path and size instead - Add TestSaveServersConcurrent covering the lost-update regression Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

We still don't know why saveServers held the lock for 1+ minute in the crash that motivated this PR (Freshdesk #172640). Next time it happens we want to know which phase was slow — marshal, disk write, lock wait — and what else was running at the time. Adds: - saveServers: per-phase timings (saveMu wait, RLock+marshal, disk write), logged at trace always. WARN with breakdown if total >= 2s. WARN with full goroutine stack dump if total >= 15s. - ServersJSON / GetServerByTagJSON: log WARN + goroutine stack dump if the RLock wait exceeds 1s — direct evidence of reader starvation. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

myleshorton · 2026-04-15T16:19:40Z

I think I'm just going to pull this in for this new build, as @Derekf5 experienced this just yesterday. Going to see if I can add this to the refactor branch as well (if it's even necessary @garmr-ulfr?).

garmr-ulfr · 2026-04-15T16:56:43Z

@myleshorton sorry, didn't see this! Taking a look now to see if it's still relevant.

… starvation Port of PR #416 to the refactor branch. Splits the write lock so mutators (SetServers, AddServers, RemoveServers) only hold it for the in-memory mutation, then release before saveServers. saveServers acquires a brief RLock for marshalling and a separate saveMu for serializing disk writes, so readers are never blocked by slow fsync. Includes per-phase timing instrumentation and reader-starvation detection to help root-cause any future slow cases. See getlantern/engineering#3176 and Freshdesk #172640. Co-Authored-By: garmr <garmr@users.noreply.github.com>

Copilot AI review requested due to automatic review settings April 15, 2026 13:34

Copilot started reviewing on behalf of myleshorton April 15, 2026 13:35 View session

Copilot AI reviewed Apr 15, 2026

View reviewed changes

Comment thread servers/manager.go Outdated

Comment thread servers/manager.go Outdated

Comment thread servers/manager.go

Comment thread servers/manager.go Outdated

Comment thread servers/manager.go Outdated

myleshorton requested a review from garmr-ulfr April 15, 2026 14:11

myleshorton merged commit 4241e6c into main Apr 15, 2026
2 checks passed

myleshorton deleted the fix/servers-save-lock-contention branch April 15, 2026 16:19

myleshorton mentioned this pull request Apr 15, 2026

mobile: return string instead of []byte + update Swift callers getlantern/lantern#8663

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

servers: release write lock during saveServers to prevent reader starvation#416

servers: release write lock during saveServers to prevent reader starvation#416
myleshorton merged 3 commits intomainfrom
fix/servers-save-lock-contention

myleshorton commented Apr 15, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

myleshorton commented Apr 15, 2026

Uh oh!

Uh oh!

garmr-ulfr commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

myleshorton commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Impact

Changes

Test Plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

myleshorton commented Apr 15, 2026

Uh oh!

Uh oh!

garmr-ulfr commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

myleshorton commented Apr 15, 2026 •

edited

Loading