fix(BucketWatcherManager): release mu before watcher.Start to avoid self-deadlock#3
fix(BucketWatcherManager): release mu before watcher.Start to avoid self-deadlock#3
Conversation
…elf-deadlock GetOrCreateWatcher held gwm.mu (write) for the entire body, including the call to watcher.Start() which writes to the OPA inmem store via Commit (acquiring the store's RWMutex.Lock). When watchBucketBuiltin spawns this function in a goroutine inside a Rego builtin invocation, the parent goroutine still holds the inmem store's RLock for the lifetime of its read transaction. The spawned writer therefore cannot acquire the store's write lock until the parent's read transaction closes, and meanwhile holds gwm.mu (write). The parent's next watch_bucket call hits HasWatcher(), which needs gwm.mu.RLock() — blocked behind the spawned writer — and the parent never finishes, so the writer never proceeds. Classic 2-resource circular wait. Once the writer is queued on the inmem store mutex, all new readers (including OPA's /health handler) are blocked to prevent writer starvation, and the pod stops responding within 180s the kubelet kills it (exit 137). Fix: hold gwm.mu only for the cache lookup and final cache insert. Run NewBucketWatcher and watcher.Start outside the lock. ContainsOrAdd already resolves the duplicate-create race; the loser is stopped and the winning watcher is returned to the caller. Verified against a captured goroutine dump from prod permit-opa where 39+ readers were stuck on inmem.store.NewTransaction's RLock, 1 writer (LoadBucketDataBulk -> Commit) on the inmem store's Lock for >2 minutes, and the parent goroutine that spawned the writer was itself blocked on gwm.mu.RLock inside HasWatcher.
There was a problem hiding this comment.
Pull request overview
This PR updates the NATS bucket watcher manager to reduce the scope of gwm.mu inside GetOrCreateWatcher, aiming to avoid a production self-deadlock between watcher startup and OPA in-memory store transactions.
Changes:
- Moves the initial watcher cache lookup to a read-locked fast path.
- Creates and starts new bucket watchers outside
gwm.mu. - Reconciles concurrent first-time creation with
ContainsOrAdd, stopping the loser and returning the cached watcher.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
… before docker compose up CI repeatedly failed with: Bind for 0.0.0.0:4222 failed: port is already allocated The integration test starts an example docker-compose stack that publishes host ports 4222, 8222, 8181, and 31311. When a previous run on the same CI host left containers behind (or another concurrent job is using one of those ports), `docker compose up` fails before any test code runs. Fix: before bringing the stack up, run `docker compose down -v --remove-orphans` and forcibly `docker rm -f` any container still publishing one of our ports. This makes the test self-cleaning and resilient to leaked containers from prior runs.
…ntainers before docker compose up" This reverts commit 0c5f270.
…opping flag Address review feedback on the previous deadlock fix: 1. Without serialization, two callers could both create+Start a watcher for the same bucket, and stopping the loser would call cleanOPAStore which RemoveOps everything under /nats/kv/<bucket> — wiping the winner's just-loaded data. Add a dedicated createMu mutex that serializes the create-and-start path so only one watcher is ever created per bucket; the second caller sees the cache via a double- check and returns the canonical watcher. 2. Without a stopping flag, BucketWatcherManager.Stop could complete while a slow GetOrCreateWatcher (blocked on the OPA inmem store's write lock) was still in progress, and the in-flight create would then insert a new watcher into the now-purged cache and leave a background watch loop running after shutdown. Add a stopping flag protected by mu, set under createMu by Stop. GetOrCreateWatcher checks it both before Start and again before insert, abandoning a half-built watcher if Stop won the race. 3. createMu is distinct from gwm.mu, so HasWatcher and the fast path are NOT blocked while we are creating. The original self-deadlock fix is preserved: Start runs without gwm.mu held, the parent Rego query can finish and release the inmem store's RLock, and our Commit then proceeds. Tests: - TestGetOrCreateWatcher_FastPath - TestGetOrCreateWatcher_RefusedAfterStop - TestGetOrCreateWatcher_FastPathAfterStopReturnsCached - TestStop_IsIdempotent - TestGetOrCreateWatcher_ConcurrentFastPath
|
Thanks @copilot-pull-request-reviewer — all three concerns are valid and now addressed in commit 02f5d4e (force-pushed on top of the original 32cf15e):
Note that |
…rt bindings TestIntegration's example docker-compose published host ports for NATS (4222, 8222) and the NATS UI (31311). The integration test only talks to OPA at 8181, so those bindings are not needed for the test to function — and they collide with the GHA workflow's `services: nats:` (which already binds 4222), making CI fail every time with: Bind for 0.0.0.0:4222 failed: port is already allocated Add a test-only docker-compose override at cmd/opa-nats/docker-compose.test.override.yaml that resets the host-port mappings on `nats` and `nats-ui` to []. The example file itself is not changed, so end users still get the published ports outside of CI. The test now invokes `docker compose -f docker-compose.yaml -f <override>` and also runs an explicit `down -v --remove-orphans` before `up` to clear any leftover containers from previous runs.
Last CI run got past the port-collision but OPA never became ready and all we saw was "context deadline exceeded" with no idea why. Add a helper that dumps `docker compose logs --tail 200` and `docker compose ps -a` when the test fails or when waitForOPA times out, so the next failure surfaces the actual reason in the GHA log.
The docker-compose example pointed OPA at nats://localhost:4222, but inside the OPA container `localhost` is the OPA container itself — not the NATS container. OPA therefore failed at startup with: Failed to connect to NATS: ... nats: no servers available for connection and exited before opening the HTTP server, which is why TestIntegration in CI got "context deadline exceeded" while waiting for /health. Use the docker-compose service name `nats` so OPA can reach NATS via the docker network. A comment notes the standalone (host) value.
…n failure The previous CI run made it past waitForOPA but then hung in evaluatePolicy's http.Post for 10 minutes, until the Go test framework's default timeout killed the process — before t.Cleanup could run dumpComposeLogs. - Give evaluatePolicy a 30s per-request HTTP timeout, so a hung OPA surfaces a clear test failure within 30s and t.Cleanup runs. - Enable OPA's --pprof in the test compose override. - On failure / OPA-not-ready, dump OPA's full goroutine stack from /debug/pprof/goroutine?debug=2 BEFORE the compose logs. This is the highest-signal diagnostic for the deadlock-class of hangs we are trying to verify the fix for.
…erting on call N The previous assertions assumed the second call would already see bucket_watched=true. nats.kv.watch_bucket spawns the watcher registration in a goroutine that completes off the request-handling path — there is no guarantee a follow-up call sees the watcher in the cache yet. Before the deadlock fix this happened to work because the parent's HasWatcher blocked behind the spawned writer (the very deadlock this PR is fixing); after the fix it is a race the test should not be asserting against. Use require.Eventually to poll for bucket_watched=true within 30s. If the watcher registration never completes within the deadline the test still fails loudly, but normal async timing is no longer treated as a bug.
The example rego intentionally returns different shapes for `x` in the
two states:
bucket_watched=false → x = nats.kv.get_data(bucket, "members")
(just the members map)
bucket_watched=true → x = data.nats.kv[bucket]
(the entire bucket: members + metadata + permissions)
The original assertion `x1 == x2` assumed otherwise and was always
wrong; before the deadlock fix the second call deadlocked instead of
returning, so we never got to compare. Now that calls return correctly
in both states, compare the members submap explicitly:
x_unwatched vs x_watched["members"]
zeevmoney
left a comment
There was a problem hiding this comment.
Check comments, there is a critical bug here.
zeevmoney pointed out that ContainsOrAdd was called while gwm.mu.Lock
was held, and the LRU was created with NewWithEvict — so an eviction
fired the onEviction callback synchronously under the lock. The
callback ran BucketWatcher.Stop → cleanOPAStore → opaStore.Commit,
which takes the inmem store's write lock and blocks on outstanding
read transactions. Same deadlock as the original bug, just shifted
from Start to eviction:
1. Goroutine A (this call, spawned from watchBucketBuiltin): holds
gwm.mu.Lock, blocked on db.rmu.Lock for the eviction Commit.
2. Goroutine B (the parent Rego query, or any concurrent reader):
holds db.rmu.RLock, calls another watch_bucket → HasWatcher →
blocked on gwm.mu.RLock.
Default MaxBucketsWatchers = 10, so the eviction path is hot in any
deployment with more than 10 buckets — exactly the workload that
produced the original incident.
Fix: drop the LRU's eviction callback (lru.New instead of
lru.NewWithEvict) and perform eviction manually. Peek capacity and
RemoveOldest under gwm.mu (no callbacks fire), drop the lock, then
call Stop on the evicted watcher with no manager lock held.
Also addressed in this commit (related cleanups in the same area):
- Remove the dead `if gwm.stopping` re-check after Start. createMu is
held for the full slow path and Stop() takes createMu BEFORE setting
stopping=true, so stopping cannot transition false→true here.
- Add a comment to Stop()'s Purge() noting that it is now a plain map
clear with no eviction callbacks (each watcher was already stopped
in the loop above).
- Introduce a `newWatcher` function field on BucketWatcherManager so
the regression tests can exercise Start without a real NATS
connection. Production behavior unchanged.
Addresses review comments:
- #3 (comment) (@zeevmoney)
- #3 (comment) (@zeevmoney)
- #3 (comment) (@zeevmoney)
…enarios The previous concurrency test file only exercised fast-path lookups, the stopping flag set manually, and Stop() idempotency on an empty manager — none of the genuinely concurrent scenarios that motivated this PR. Reviewers correctly pointed out that a future change could re-introduce the deadlock without any test failing. Add four regression tests, all running under -race: - TestGetOrCreateWatcher_DoesNotDeadlockOnReader The original bug. A reader holds the OPA inmem store's RLock while another goroutine enters GetOrCreateWatcher and tries to Commit (which needs the store's WLock). Asserts both that the create eventually completes after the reader releases its RLock AND that concurrent HasWatcher calls remain responsive throughout — proving gwm.mu is not held across watcher.Start. - TestGetOrCreateWatcher_DoesNotDeadlockOnEviction The eviction-deadlock variant zeevmoney flagged. Fills the cache to capacity, then triggers an eviction-during-create with concurrent HasWatcher calls. Asserts every HasWatcher call returns within 100ms — proving gwm.mu is not held across the evicted watcher's Stop. Also asserts the cache contents reflect the eviction. - TestGetOrCreateWatcher_SerializesSameBucket Asserts that N concurrent callers for the same bucket cause newWatcher to be invoked exactly once and all callers receive the same (canonical) watcher — proving createMu serializes creates and the double-check returns the canonical watcher to the loser(s). - TestStop_DrainsInflightCreates Asserts that Stop blocks until any in-flight GetOrCreateWatcher completes, and that subsequent creates are refused. This is the contract that prevents a watcher from being inserted into the cache after Stop has torn the manager down. Tests use the new newWatcher seam (defaults to NewBucketWatcher in production) to exercise Start without a real NATS connection. The deadlock test substitutes a Start that opens a write transaction on the supplied opa store and Commits it, faithfully exercising the same lock chain the real cleanOPAStore touches. Addresses review comments: - #3 (comment) (@zeevmoney) - #3 (comment) (@copilot-pull-request-reviewer)
…pose.yaml in CI
zeevmoney pointed out (correctly) that an example config should default
to the form most users will be running — OPA on the host, reaching NATS
at localhost:4222 — not the in-compose-network form. The earlier change
that flipped the default to nats://nats:4222 was a workaround for the CI
integration test, which OPA-inside-compose needed.
This commit:
- Reverts examples/opa-nats/config.yaml back to nats://localhost:4222
so `opa run -c examples/opa-nats/config.yaml` works out of the box
on the host.
- Updates examples/opa-nats/config-compose.yaml (which already
existed and shipped with the same broken localhost URL) to actually
be the in-compose config, with server_url=nats://nats:4222 and a
header comment explaining when to use which file.
- Updates cmd/opa-nats/docker-compose.test.override.yaml to mount
examples/opa-nats/config-compose.yaml in place of the default
config.yaml inside the OPA container, so the CI integration test
keeps working with no extra files in cmd/opa-nats/.
Addresses review comment:
- #3 (comment) (@zeevmoney)
| // NewBucketWatcher + watcher.Start; tests override it to inject failures | ||
| // (e.g. a Start that exercises the OPA store write lock without needing | ||
| // a real NATS connection). Set in NewBucketWatcherManager and after | ||
| // Reconfigure; never reassigned at runtime, so no synchronization needed. |
| # using those ports on the host (in CI, the workflow's | ||
| # `services: nats:` already binds 4222). OPA's 8181 is the only port | ||
| # the test reaches, so it stays published. | ||
| # - Mounts test-config.yaml in place of the example config.yaml so OPA |
| // Give the create a moment to acquire createMu. | ||
| time.Sleep(50 * time.Millisecond) | ||
|
|
||
| stopReturned := make(chan struct{}) | ||
| go func() { | ||
| defer close(stopReturned) | ||
| _ = m.Stop() | ||
| }() |
Summary
GetOrCreateWatcherheldgwm.mu(write) for the entire body, including the call towatcher.Start().Start()ultimately writes to the OPA inmem store viaCommit, which acquires the store'sRWMutex.Lock.watchBucketBuiltininvokesGetOrCreateWatcherfrom a goroutine spawned inside a Rego builtin call. The parent goroutine still holds the inmem store'sRLockfor the lifetime of its read transaction. The spawned writer therefore cannot acquire the store's write lock until the parent's read transaction closes — and meanwhile holdsgwm.mu(write). The parent's nextwatch_bucketcall hitsHasWatcher(), which needsgwm.mu.RLock()— blocked behind the spawned writer — and the parent never finishes, so the writer never proceeds. Classic 2-resource circular wait.Once the writer is queued on the inmem store mutex, all new readers (including OPA's
/healthhandler, which goes throughServer.canEval→Rego.Eval→getTxn→inmem.store.NewTransaction) are blocked to prevent writer starvation. Within ~180s the kubelet liveness probe fails 36× and kills the pod (exit 137, "Failed to shutdown server gracefully").Evidence
Captured goroutine dump from a hung
permit-opapod in production (433KB, 106 goroutines):The writer (
opa-nats/natsstore.(*DataTransformer).LoadBucketDataBulk→inmem.(*store).Commit) was spawned by goroutine N which was itself blocked atBucketWatcherManager.HasWatcherwaiting ongwm.mu.RLock. "Created by" line of the writer pointed at the same parent N. Pure self-deadlock.Fix
Hold
gwm.muonly for the cache lookup and finalContainsOrAdd. RunNewBucketWatcherandwatcher.Start()outside the lock.ContainsOrAddalready resolves the duplicate-create race; the loser is stopped and the winning watcher is returned to the caller.This is purely a lock-scope reduction — no behavior change in the single-caller fast path. The race window for two concurrent first-time callers on the same bucket is handled exactly the same way it was before (
ContainsOrAdd+ stop the loser).Test plan
permit-opaand verify thepermit-opapods in prod-us-east stop restarting (they currently restart every ~5–8 minutes due to this bug)🤖 Generated with Claude Code