servers: release write lock during saveServers to prevent reader starvation#416
Merged
myleshorton merged 3 commits intomainfrom Apr 15, 2026
Merged
servers: release write lock during saveServers to prevent reader starvation#416myleshorton merged 3 commits intomainfrom
myleshorton merged 3 commits intomainfrom
Conversation
…vation SetServers, AddServers, and RemoveServer were holding the write lock across saveServers, which does JSON marshalling of all outbounds (CPU- heavy via sing-box reflection) and an atomic file write. This starved readers like ServersJSON for extended periods. This surfaced in Freshdesk #172640: a config fetch held the write lock for 1+ minute while marshalling 36 outbounds, blocking a cgo-callback goroutine in GetAvailableServers long enough for a GC-timed write- barrier race to crash the app (see getlantern/engineering#3175 for the crash; getlantern/engineering#3176 for this lock issue). Changes: - Add saveMu to serialize concurrent disk writes - saveServers now acquires RLock for marshalling (not write lock) and saveMu for the disk write, so readers aren't blocked by either step - SetServers/AddServers/RemoveServer release the write lock after their in-memory mutation, before calling saveServers Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
Reduces lock contention in servers.Manager by avoiding holding the write lock while persisting server configs to disk, addressing reader starvation in ServersJSON() during expensive JSON marshalling and file I/O.
Changes:
- Add a dedicated
saveMumutex to serialize persistence work. - Release
m.accesswrite lock immediately after in-memory mutations inSetServers,AddServers, andRemoveServer. - Update
saveServers()to marshal underRLockand perform file writes undersaveMu.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…test - saveServers now holds saveMu across the full marshal+write sequence so two concurrent saves can't reorder and leave stale bytes on disk - AddServers and RemoveServer scope their in-memory mutation in a closure with defer Unlock — robust against future early-return edits - Drop m.servers from the trace log in saveServers (avoid eager formatting under RLock); log file path and size instead - Add TestSaveServersConcurrent covering the lost-update regression Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
We still don't know why saveServers held the lock for 1+ minute in the crash that motivated this PR (Freshdesk #172640). Next time it happens we want to know which phase was slow — marshal, disk write, lock wait — and what else was running at the time. Adds: - saveServers: per-phase timings (saveMu wait, RLock+marshal, disk write), logged at trace always. WARN with breakdown if total >= 2s. WARN with full goroutine stack dump if total >= 15s. - ServersJSON / GetServerByTagJSON: log WARN + goroutine stack dump if the RLock wait exceeds 1s — direct evidence of reader starvation. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Contributor
Author
|
I think I'm just going to pull this in for this new build, as @Derekf5 experienced this just yesterday. Going to see if I can add this to the refactor branch as well (if it's even necessary @garmr-ulfr?). |
Collaborator
|
@myleshorton sorry, didn't see this! Taking a look now to see if it's still relevant. |
garmr-ulfr
pushed a commit
that referenced
this pull request
Apr 15, 2026
… starvation Port of PR #416 to the refactor branch. Splits the write lock so mutators (SetServers, AddServers, RemoveServers) only hold it for the in-memory mutation, then release before saveServers. saveServers acquires a brief RLock for marshalling and a separate saveMu for serializing disk writes, so readers are never blocked by slow fsync. Includes per-phase timing instrumentation and reader-starvation detection to help root-cause any future slow cases. See getlantern/engineering#3176 and Freshdesk #172640. Co-Authored-By: garmr <garmr@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
servers.(*Manager).SetServers,AddServers, andRemoveServerwere holding the write lock onm.accessacrosssaveServers(), which does JSON marshalling of all outbounds (CPU-heavy via sing-box reflection) and an atomic file write to disk. This starved concurrent readers ofServersJSON().Impact
Surfaced in Freshdesk #172640: a config fetch held the write lock for 1+ minute while marshalling 36 outbounds, blocking a cgo-callback goroutine in
GetAvailableServerslong enough for a GC-timed write-barrier race to crash the app.Related issues:
Changes
Lock handling:
saveMu sync.MutextoManagerto serialize concurrent disk writessaveServersnow acquiressaveMufor the full marshal+write (prevents stale-write reordering), with only a briefRLockaround marshalling — readers aren't blocked by either stepSetServers,AddServers,RemoveServerrelease the write lock after their in-memory mutation, before callingsaveServersAddServers/RemoveServerscope their locked mutation in a closure withdefer Unlockso future early-return edits can't skip the unlockTelemetry (new):
saveServerslogs per-phase timing at trace always (saveMu wait, RLock+marshal, disk write). WARN with breakdown if total ≥ 2s. WARN with full goroutine stack dump if total ≥ 15s.ServersJSON/GetServerByTagJSONlog WARN with goroutine stack dump if the RLock wait exceeds 1s — direct evidence of reader starvation, with a snapshot of what was holding things up.We still don't have a definitive explanation for the 1-minute hold in #172640. The instrumentation lets us identify which phase was slow next time it happens.
Test Plan
go build ./...passesgo vet ./servers/cleango test ./servers/passesTestSaveServersConcurrentpasses 5× with-race🤖 Generated with Claude Code