fix(BucketWatcherManager): release mu before watcher.Start to avoid self-deadlock by omer9564 · Pull Request #3 · permitio/opa-nats

omer9564 · 2026-05-04T15:06:14Z

Summary

GetOrCreateWatcher held gwm.mu (write) for the entire body, including the call to watcher.Start(). Start() ultimately writes to the OPA inmem store via Commit, which acquires the store's RWMutex.Lock.

watchBucketBuiltin invokes GetOrCreateWatcher from a goroutine spawned inside a Rego builtin call. The parent goroutine still holds the inmem store's RLock for the lifetime of its read transaction. The spawned writer therefore cannot acquire the store's write lock until the parent's read transaction closes — and meanwhile holds gwm.mu (write). The parent's next watch_bucket call hits HasWatcher(), which needs gwm.mu.RLock() — blocked behind the spawned writer — and the parent never finishes, so the writer never proceeds. Classic 2-resource circular wait.

Once the writer is queued on the inmem store mutex, all new readers (including OPA's /health handler, which goes through Server.canEval → Rego.Eval → getTxn → inmem.store.NewTransaction) are blocked to prevent writer starvation. Within ~180s the kubelet liveness probe fails 36× and kills the pod (exit 137, "Failed to shutdown server gracefully").

Evidence

Captured goroutine dump from a hung permit-opa pod in production (433KB, 106 goroutines):

 41  [sync.RWMutex.RLock]                 ← new readers piling up
 39  [sync.RWMutex.RLock, 2 minutes]      ← readers stuck on inmem.store.NewTransaction
  1  [sync.RWMutex.Lock,  2 minutes]      ← THE WRITER (this code path)

The writer (opa-nats/natsstore.(*DataTransformer).LoadBucketDataBulk → inmem.(*store).Commit) was spawned by goroutine N which was itself blocked at BucketWatcherManager.HasWatcher waiting on gwm.mu.RLock. "Created by" line of the writer pointed at the same parent N. Pure self-deadlock.

Fix

Hold gwm.mu only for the cache lookup and final ContainsOrAdd. Run NewBucketWatcher and watcher.Start() outside the lock. ContainsOrAdd already resolves the duplicate-create race; the loser is stopped and the winning watcher is returned to the caller.

This is purely a lock-scope reduction — no behavior change in the single-caller fast path. The race window for two concurrent first-time callers on the same bucket is handled exactly the same way it was before (ContainsOrAdd + stop the loser).

Test plan

go build ./...
go vet ./...
golangci-lint (via pre-commit) clean
Existing tests pass on a NATS-enabled CI (the local failures are pre-existing — they require a NATS server)
After merge & release: bump in permit-opa and verify the permit-opa pods in prod-us-east stop restarting (they currently restart every ~5–8 minutes due to this bug)

🤖 Generated with Claude Code

…elf-deadlock GetOrCreateWatcher held gwm.mu (write) for the entire body, including the call to watcher.Start() which writes to the OPA inmem store via Commit (acquiring the store's RWMutex.Lock). When watchBucketBuiltin spawns this function in a goroutine inside a Rego builtin invocation, the parent goroutine still holds the inmem store's RLock for the lifetime of its read transaction. The spawned writer therefore cannot acquire the store's write lock until the parent's read transaction closes, and meanwhile holds gwm.mu (write). The parent's next watch_bucket call hits HasWatcher(), which needs gwm.mu.RLock() — blocked behind the spawned writer — and the parent never finishes, so the writer never proceeds. Classic 2-resource circular wait. Once the writer is queued on the inmem store mutex, all new readers (including OPA's /health handler) are blocked to prevent writer starvation, and the pod stops responding within 180s the kubelet kills it (exit 137). Fix: hold gwm.mu only for the cache lookup and final cache insert. Run NewBucketWatcher and watcher.Start outside the lock. ContainsOrAdd already resolves the duplicate-create race; the loser is stopped and the winning watcher is returned to the caller. Verified against a captured goroutine dump from prod permit-opa where 39+ readers were stuck on inmem.store.NewTransaction's RLock, 1 writer (LoadBucketDataBulk -> Commit) on the inmem store's Lock for >2 minutes, and the parent goroutine that spawned the writer was itself blocked on gwm.mu.RLock inside HasWatcher.

Copilot

Pull request overview

This PR updates the NATS bucket watcher manager to reduce the scope of gwm.mu inside GetOrCreateWatcher, aiming to avoid a production self-deadlock between watcher startup and OPA in-memory store transactions.

Changes:

Moves the initial watcher cache lookup to a read-locked fast path.
Creates and starts new bucket watchers outside gwm.mu.
Reconciles concurrent first-time creation with ContainsOrAdd, stopping the loser and returning the cached watcher.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

… before docker compose up CI repeatedly failed with: Bind for 0.0.0.0:4222 failed: port is already allocated The integration test starts an example docker-compose stack that publishes host ports 4222, 8222, 8181, and 31311. When a previous run on the same CI host left containers behind (or another concurrent job is using one of those ports), `docker compose up` fails before any test code runs. Fix: before bringing the stack up, run `docker compose down -v --remove-orphans` and forcibly `docker rm -f` any container still publishing one of our ports. This makes the test self-cleaning and resilient to leaked containers from prior runs.

…ntainers before docker compose up" This reverts commit 0c5f270.

…opping flag Address review feedback on the previous deadlock fix: 1. Without serialization, two callers could both create+Start a watcher for the same bucket, and stopping the loser would call cleanOPAStore which RemoveOps everything under /nats/kv/<bucket> — wiping the winner's just-loaded data. Add a dedicated createMu mutex that serializes the create-and-start path so only one watcher is ever created per bucket; the second caller sees the cache via a double- check and returns the canonical watcher. 2. Without a stopping flag, BucketWatcherManager.Stop could complete while a slow GetOrCreateWatcher (blocked on the OPA inmem store's write lock) was still in progress, and the in-flight create would then insert a new watcher into the now-purged cache and leave a background watch loop running after shutdown. Add a stopping flag protected by mu, set under createMu by Stop. GetOrCreateWatcher checks it both before Start and again before insert, abandoning a half-built watcher if Stop won the race. 3. createMu is distinct from gwm.mu, so HasWatcher and the fast path are NOT blocked while we are creating. The original self-deadlock fix is preserved: Start runs without gwm.mu held, the parent Rego query can finish and release the inmem store's RLock, and our Commit then proceeds. Tests: - TestGetOrCreateWatcher_FastPath - TestGetOrCreateWatcher_RefusedAfterStop - TestGetOrCreateWatcher_FastPathAfterStopReturnsCached - TestStop_IsIdempotent - TestGetOrCreateWatcher_ConcurrentFastPath

omer9564 · 2026-05-04T15:30:52Z

Thanks @copilot-pull-request-reviewer — all three concerns are valid and now addressed in commit 02f5d4e (force-pushed on top of the original 32cf15e):

Loser's Stop() wipes winner's data (pkg/natsstore/bucket_watcher_manager.go:307 of the old code) — confirmed: BucketWatcher.Stop calls cleanOPAStore which RemoveOps everything under /nats/kv/<bucket>, so a duplicate-create race would have wiped the winner's just-loaded data. Fix: added createMu sync.Mutex that serializes the create-and-start path, so only one watcher per bucket is ever created. The double-check after createMu.Lock() returns the canonical watcher to the second caller. The duplicate-create branch is gone entirely.
Stop() racing with the slow path (line 300) — confirmed: in-flight creates could complete and insert a watcher into a now-purged cache, leaving a background loop running. Fix: added stopping bool (protected by mu). Stop() acquires createMu first to drain any in-flight create, then sets stopping=true under mu.Lock(). GetOrCreateWatcher checks stopping both right after acquiring createMu (early bail) and again under mu.Lock() right before insert (late bail — tears down the half-built watcher if Stop won the race).
No regression test (line 299) — added pkg/natsstore/bucket_watcher_manager_concurrency_test.go with 5 tests:
- TestGetOrCreateWatcher_FastPath
- TestGetOrCreateWatcher_RefusedAfterStop
- TestGetOrCreateWatcher_FastPathAfterStopReturnsCached
- TestStop_IsIdempotent
- TestGetOrCreateWatcher_ConcurrentFastPath

Note that createMu is intentionally distinct from gwm.mu — the original self-deadlock fix is preserved: Start runs without gwm.mu held, HasWatcher callers (using gwm.mu.RLock) are not blocked by an in-flight create, the parent Rego query can finish and release the OPA inmem store's RLock, and the spawned Commit then proceeds.

…rt bindings TestIntegration's example docker-compose published host ports for NATS (4222, 8222) and the NATS UI (31311). The integration test only talks to OPA at 8181, so those bindings are not needed for the test to function — and they collide with the GHA workflow's `services: nats:` (which already binds 4222), making CI fail every time with: Bind for 0.0.0.0:4222 failed: port is already allocated Add a test-only docker-compose override at cmd/opa-nats/docker-compose.test.override.yaml that resets the host-port mappings on `nats` and `nats-ui` to []. The example file itself is not changed, so end users still get the published ports outside of CI. The test now invokes `docker compose -f docker-compose.yaml -f <override>` and also runs an explicit `down -v --remove-orphans` before `up` to clear any leftover containers from previous runs.

Last CI run got past the port-collision but OPA never became ready and all we saw was "context deadline exceeded" with no idea why. Add a helper that dumps `docker compose logs --tail 200` and `docker compose ps -a` when the test fails or when waitForOPA times out, so the next failure surfaces the actual reason in the GHA log.

The docker-compose example pointed OPA at nats://localhost:4222, but inside the OPA container `localhost` is the OPA container itself — not the NATS container. OPA therefore failed at startup with: Failed to connect to NATS: ... nats: no servers available for connection and exited before opening the HTTP server, which is why TestIntegration in CI got "context deadline exceeded" while waiting for /health. Use the docker-compose service name `nats` so OPA can reach NATS via the docker network. A comment notes the standalone (host) value.

…n failure The previous CI run made it past waitForOPA but then hung in evaluatePolicy's http.Post for 10 minutes, until the Go test framework's default timeout killed the process — before t.Cleanup could run dumpComposeLogs. - Give evaluatePolicy a 30s per-request HTTP timeout, so a hung OPA surfaces a clear test failure within 30s and t.Cleanup runs. - Enable OPA's --pprof in the test compose override. - On failure / OPA-not-ready, dump OPA's full goroutine stack from /debug/pprof/goroutine?debug=2 BEFORE the compose logs. This is the highest-signal diagnostic for the deadlock-class of hangs we are trying to verify the fix for.

…erting on call N The previous assertions assumed the second call would already see bucket_watched=true. nats.kv.watch_bucket spawns the watcher registration in a goroutine that completes off the request-handling path — there is no guarantee a follow-up call sees the watcher in the cache yet. Before the deadlock fix this happened to work because the parent's HasWatcher blocked behind the spawned writer (the very deadlock this PR is fixing); after the fix it is a race the test should not be asserting against. Use require.Eventually to poll for bucket_watched=true within 30s. If the watcher registration never completes within the deadline the test still fails loudly, but normal async timing is no longer treated as a bug.

The example rego intentionally returns different shapes for `x` in the two states: bucket_watched=false → x = nats.kv.get_data(bucket, "members") (just the members map) bucket_watched=true → x = data.nats.kv[bucket] (the entire bucket: members + metadata + permissions) The original assertion `x1 == x2` assumed otherwise and was always wrong; before the deadlock fix the second call deadlocked instead of returning, so we never got to compare. Now that calls return correctly in both states, compare the members submap explicitly: x_unwatched vs x_watched["members"]

zeevmoney

Check comments, there is a critical bug here.

@zeevmoney

zeevmoney pointed out that ContainsOrAdd was called while gwm.mu.Lock was held, and the LRU was created with NewWithEvict — so an eviction fired the onEviction callback synchronously under the lock. The callback ran BucketWatcher.Stop → cleanOPAStore → opaStore.Commit, which takes the inmem store's write lock and blocks on outstanding read transactions. Same deadlock as the original bug, just shifted from Start to eviction: 1. Goroutine A (this call, spawned from watchBucketBuiltin): holds gwm.mu.Lock, blocked on db.rmu.Lock for the eviction Commit. 2. Goroutine B (the parent Rego query, or any concurrent reader): holds db.rmu.RLock, calls another watch_bucket → HasWatcher → blocked on gwm.mu.RLock. Default MaxBucketsWatchers = 10, so the eviction path is hot in any deployment with more than 10 buckets — exactly the workload that produced the original incident. Fix: drop the LRU's eviction callback (lru.New instead of lru.NewWithEvict) and perform eviction manually. Peek capacity and RemoveOldest under gwm.mu (no callbacks fire), drop the lock, then call Stop on the evicted watcher with no manager lock held. Also addressed in this commit (related cleanups in the same area): - Remove the dead `if gwm.stopping` re-check after Start. createMu is held for the full slow path and Stop() takes createMu BEFORE setting stopping=true, so stopping cannot transition false→true here. - Add a comment to Stop()'s Purge() noting that it is now a plain map clear with no eviction callbacks (each watcher was already stopped in the loop above). - Introduce a `newWatcher` function field on BucketWatcherManager so the regression tests can exercise Start without a real NATS connection. Production behavior unchanged. Addresses review comments: - #3 (comment) (@zeevmoney) - #3 (comment) (@zeevmoney) - #3 (comment) (@zeevmoney)

@zeevmoney

…enarios The previous concurrency test file only exercised fast-path lookups, the stopping flag set manually, and Stop() idempotency on an empty manager — none of the genuinely concurrent scenarios that motivated this PR. Reviewers correctly pointed out that a future change could re-introduce the deadlock without any test failing. Add four regression tests, all running under -race: - TestGetOrCreateWatcher_DoesNotDeadlockOnReader The original bug. A reader holds the OPA inmem store's RLock while another goroutine enters GetOrCreateWatcher and tries to Commit (which needs the store's WLock). Asserts both that the create eventually completes after the reader releases its RLock AND that concurrent HasWatcher calls remain responsive throughout — proving gwm.mu is not held across watcher.Start. - TestGetOrCreateWatcher_DoesNotDeadlockOnEviction The eviction-deadlock variant zeevmoney flagged. Fills the cache to capacity, then triggers an eviction-during-create with concurrent HasWatcher calls. Asserts every HasWatcher call returns within 100ms — proving gwm.mu is not held across the evicted watcher's Stop. Also asserts the cache contents reflect the eviction. - TestGetOrCreateWatcher_SerializesSameBucket Asserts that N concurrent callers for the same bucket cause newWatcher to be invoked exactly once and all callers receive the same (canonical) watcher — proving createMu serializes creates and the double-check returns the canonical watcher to the loser(s). - TestStop_DrainsInflightCreates Asserts that Stop blocks until any in-flight GetOrCreateWatcher completes, and that subsequent creates are refused. This is the contract that prevents a watcher from being inserted into the cache after Stop has torn the manager down. Tests use the new newWatcher seam (defaults to NewBucketWatcher in production) to exercise Start without a real NATS connection. The deadlock test substitutes a Start that opens a write transaction on the supplied opa store and Commits it, faithfully exercising the same lock chain the real cleanOPAStore touches. Addresses review comments: - #3 (comment) (@zeevmoney) - #3 (comment) (@copilot-pull-request-reviewer)

@zeevmoney

…pose.yaml in CI zeevmoney pointed out (correctly) that an example config should default to the form most users will be running — OPA on the host, reaching NATS at localhost:4222 — not the in-compose-network form. The earlier change that flipped the default to nats://nats:4222 was a workaround for the CI integration test, which OPA-inside-compose needed. This commit: - Reverts examples/opa-nats/config.yaml back to nats://localhost:4222 so `opa run -c examples/opa-nats/config.yaml` works out of the box on the host. - Updates examples/opa-nats/config-compose.yaml (which already existed and shipped with the same broken localhost URL) to actually be the in-compose config, with server_url=nats://nats:4222 and a header comment explaining when to use which file. - Updates cmd/opa-nats/docker-compose.test.override.yaml to mount examples/opa-nats/config-compose.yaml in place of the default config.yaml inside the OPA container, so the CI integration test keeps working with no extra files in cmd/opa-nats/. Addresses review comment: - #3 (comment) (@zeevmoney)

Copilot

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

+	// NewBucketWatcher + watcher.Start; tests override it to inject failures
+	// (e.g. a Start that exercises the OPA store write lock without needing
+	// a real NATS connection). Set in NewBucketWatcherManager and after
+	// Reconfigure; never reassigned at runtime, so no synchronization needed.


+#    using those ports on the host (in CI, the workflow's
+#    `services: nats:` already binds 4222). OPA's 8181 is the only port
+#    the test reaches, so it stays published.
+#  - Mounts test-config.yaml in place of the example config.yaml so OPA


+	// Give the create a moment to acquire createMu.
+	time.Sleep(50 * time.Millisecond)
+
+	stopReturned := make(chan struct{})
+	go func() {
+		defer close(stopReturned)
+		_ = m.Stop()
+	}()


omer9564 requested a review from Copilot May 4, 2026 15:07

Copilot started reviewing on behalf of omer9564 May 4, 2026 15:07 View session

Copilot AI reviewed May 4, 2026

View reviewed changes

Comment thread pkg/natsstore/bucket_watcher_manager.go Outdated

Comment thread pkg/natsstore/bucket_watcher_manager.go Outdated

Comment thread pkg/natsstore/bucket_watcher_manager.go Outdated

omer9564 added 3 commits May 4, 2026 18:16

Revert "test(integration): aggressively clean up host-port-binding co…

8573092

…ntainers before docker compose up" This reverts commit 0c5f270.

Copilot started work on behalf of omer9564 May 4, 2026 15:31 View session

Copilot finished work on behalf of omer9564 May 4, 2026 15:35

omer9564 added 6 commits May 4, 2026 20:02

omer9564 requested review from Zivxx and zeevmoney May 4, 2026 18:13

zeevmoney requested changes May 5, 2026

View reviewed changes

omer9564 added 3 commits May 6, 2026 14:48

omer9564 requested review from Copilot and zeevmoney May 6, 2026 12:08

Copilot started reviewing on behalf of omer9564 May 6, 2026 12:09 View session

Copilot AI reviewed May 6, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(BucketWatcherManager): release mu before watcher.Start to avoid self-deadlock#3

fix(BucketWatcherManager): release mu before watcher.Start to avoid self-deadlock#3
omer9564 wants to merge 13 commits intomainfrom
omer/fix-getorcreatewatcher-deadlock

omer9564 commented May 4, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

omer9564 commented May 4, 2026

Uh oh!

zeevmoney left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

omer9564 commented May 4, 2026

Summary

Evidence

Fix

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

omer9564 commented May 4, 2026

Uh oh!

zeevmoney left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants