Skip to content

fix(BucketWatcherManager): release mu before watcher.Start to avoid self-deadlock#3

Open
omer9564 wants to merge 13 commits intomainfrom
omer/fix-getorcreatewatcher-deadlock
Open

fix(BucketWatcherManager): release mu before watcher.Start to avoid self-deadlock#3
omer9564 wants to merge 13 commits intomainfrom
omer/fix-getorcreatewatcher-deadlock

Conversation

@omer9564
Copy link
Copy Markdown
Contributor

@omer9564 omer9564 commented May 4, 2026

Summary

GetOrCreateWatcher held gwm.mu (write) for the entire body, including the call to watcher.Start(). Start() ultimately writes to the OPA inmem store via Commit, which acquires the store's RWMutex.Lock.

watchBucketBuiltin invokes GetOrCreateWatcher from a goroutine spawned inside a Rego builtin call. The parent goroutine still holds the inmem store's RLock for the lifetime of its read transaction. The spawned writer therefore cannot acquire the store's write lock until the parent's read transaction closes — and meanwhile holds gwm.mu (write). The parent's next watch_bucket call hits HasWatcher(), which needs gwm.mu.RLock() — blocked behind the spawned writer — and the parent never finishes, so the writer never proceeds. Classic 2-resource circular wait.

Once the writer is queued on the inmem store mutex, all new readers (including OPA's /health handler, which goes through Server.canEvalRego.EvalgetTxninmem.store.NewTransaction) are blocked to prevent writer starvation. Within ~180s the kubelet liveness probe fails 36× and kills the pod (exit 137, "Failed to shutdown server gracefully").

Evidence

Captured goroutine dump from a hung permit-opa pod in production (433KB, 106 goroutines):

 41  [sync.RWMutex.RLock]                 ← new readers piling up
 39  [sync.RWMutex.RLock, 2 minutes]      ← readers stuck on inmem.store.NewTransaction
  1  [sync.RWMutex.Lock,  2 minutes]      ← THE WRITER (this code path)

The writer (opa-nats/natsstore.(*DataTransformer).LoadBucketDataBulkinmem.(*store).Commit) was spawned by goroutine N which was itself blocked at BucketWatcherManager.HasWatcher waiting on gwm.mu.RLock. "Created by" line of the writer pointed at the same parent N. Pure self-deadlock.

Fix

Hold gwm.mu only for the cache lookup and final ContainsOrAdd. Run NewBucketWatcher and watcher.Start() outside the lock. ContainsOrAdd already resolves the duplicate-create race; the loser is stopped and the winning watcher is returned to the caller.

This is purely a lock-scope reduction — no behavior change in the single-caller fast path. The race window for two concurrent first-time callers on the same bucket is handled exactly the same way it was before (ContainsOrAdd + stop the loser).

Test plan

  • go build ./...
  • go vet ./...
  • golangci-lint (via pre-commit) clean
  • Existing tests pass on a NATS-enabled CI (the local failures are pre-existing — they require a NATS server)
  • After merge & release: bump in permit-opa and verify the permit-opa pods in prod-us-east stop restarting (they currently restart every ~5–8 minutes due to this bug)

🤖 Generated with Claude Code

…elf-deadlock

GetOrCreateWatcher held gwm.mu (write) for the entire body, including the call
to watcher.Start() which writes to the OPA inmem store via Commit (acquiring
the store's RWMutex.Lock).

When watchBucketBuiltin spawns this function in a goroutine inside a Rego
builtin invocation, the parent goroutine still holds the inmem store's RLock
for the lifetime of its read transaction. The spawned writer therefore cannot
acquire the store's write lock until the parent's read transaction closes,
and meanwhile holds gwm.mu (write). The parent's next watch_bucket call hits
HasWatcher(), which needs gwm.mu.RLock() — blocked behind the spawned
writer — and the parent never finishes, so the writer never proceeds.
Classic 2-resource circular wait. Once the writer is queued on the inmem
store mutex, all new readers (including OPA's /health handler) are blocked
to prevent writer starvation, and the pod stops responding within 180s the
kubelet kills it (exit 137).

Fix: hold gwm.mu only for the cache lookup and final cache insert. Run
NewBucketWatcher and watcher.Start outside the lock. ContainsOrAdd already
resolves the duplicate-create race; the loser is stopped and the winning
watcher is returned to the caller.

Verified against a captured goroutine dump from prod permit-opa where 39+
readers were stuck on inmem.store.NewTransaction's RLock, 1 writer
(LoadBucketDataBulk -> Commit) on the inmem store's Lock for >2 minutes,
and the parent goroutine that spawned the writer was itself blocked on
gwm.mu.RLock inside HasWatcher.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the NATS bucket watcher manager to reduce the scope of gwm.mu inside GetOrCreateWatcher, aiming to avoid a production self-deadlock between watcher startup and OPA in-memory store transactions.

Changes:

  • Moves the initial watcher cache lookup to a read-locked fast path.
  • Creates and starts new bucket watchers outside gwm.mu.
  • Reconciles concurrent first-time creation with ContainsOrAdd, stopping the loser and returning the cached watcher.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pkg/natsstore/bucket_watcher_manager.go Outdated
Comment thread pkg/natsstore/bucket_watcher_manager.go Outdated
Comment thread pkg/natsstore/bucket_watcher_manager.go Outdated
omer9564 added 3 commits May 4, 2026 18:16
… before docker compose up

CI repeatedly failed with:
  Bind for 0.0.0.0:4222 failed: port is already allocated

The integration test starts an example docker-compose stack that publishes
host ports 4222, 8222, 8181, and 31311. When a previous run on the same CI
host left containers behind (or another concurrent job is using one of those
ports), `docker compose up` fails before any test code runs.

Fix: before bringing the stack up, run `docker compose down -v --remove-orphans`
and forcibly `docker rm -f` any container still publishing one of our ports.
This makes the test self-cleaning and resilient to leaked containers from
prior runs.
…ntainers before docker compose up"

This reverts commit 0c5f270.
…opping flag

Address review feedback on the previous deadlock fix:

1. Without serialization, two callers could both create+Start a watcher
   for the same bucket, and stopping the loser would call cleanOPAStore
   which RemoveOps everything under /nats/kv/<bucket> — wiping the
   winner's just-loaded data. Add a dedicated createMu mutex that
   serializes the create-and-start path so only one watcher is ever
   created per bucket; the second caller sees the cache via a double-
   check and returns the canonical watcher.

2. Without a stopping flag, BucketWatcherManager.Stop could complete
   while a slow GetOrCreateWatcher (blocked on the OPA inmem store's
   write lock) was still in progress, and the in-flight create would
   then insert a new watcher into the now-purged cache and leave a
   background watch loop running after shutdown. Add a stopping flag
   protected by mu, set under createMu by Stop. GetOrCreateWatcher
   checks it both before Start and again before insert, abandoning a
   half-built watcher if Stop won the race.

3. createMu is distinct from gwm.mu, so HasWatcher and the fast path
   are NOT blocked while we are creating. The original self-deadlock
   fix is preserved: Start runs without gwm.mu held, the parent Rego
   query can finish and release the inmem store's RLock, and our
   Commit then proceeds.

Tests:
- TestGetOrCreateWatcher_FastPath
- TestGetOrCreateWatcher_RefusedAfterStop
- TestGetOrCreateWatcher_FastPathAfterStopReturnsCached
- TestStop_IsIdempotent
- TestGetOrCreateWatcher_ConcurrentFastPath
@omer9564
Copy link
Copy Markdown
Contributor Author

omer9564 commented May 4, 2026

Thanks @copilot-pull-request-reviewer — all three concerns are valid and now addressed in commit 02f5d4e (force-pushed on top of the original 32cf15e):

  1. Loser's Stop() wipes winner's data (pkg/natsstore/bucket_watcher_manager.go:307 of the old code) — confirmed: BucketWatcher.Stop calls cleanOPAStore which RemoveOps everything under /nats/kv/<bucket>, so a duplicate-create race would have wiped the winner's just-loaded data. Fix: added createMu sync.Mutex that serializes the create-and-start path, so only one watcher per bucket is ever created. The double-check after createMu.Lock() returns the canonical watcher to the second caller. The duplicate-create branch is gone entirely.

  2. Stop() racing with the slow path (line 300) — confirmed: in-flight creates could complete and insert a watcher into a now-purged cache, leaving a background loop running. Fix: added stopping bool (protected by mu). Stop() acquires createMu first to drain any in-flight create, then sets stopping=true under mu.Lock(). GetOrCreateWatcher checks stopping both right after acquiring createMu (early bail) and again under mu.Lock() right before insert (late bail — tears down the half-built watcher if Stop won the race).

  3. No regression test (line 299) — added pkg/natsstore/bucket_watcher_manager_concurrency_test.go with 5 tests:

    • TestGetOrCreateWatcher_FastPath
    • TestGetOrCreateWatcher_RefusedAfterStop
    • TestGetOrCreateWatcher_FastPathAfterStopReturnsCached
    • TestStop_IsIdempotent
    • TestGetOrCreateWatcher_ConcurrentFastPath

Note that createMu is intentionally distinct from gwm.mu — the original self-deadlock fix is preserved: Start runs without gwm.mu held, HasWatcher callers (using gwm.mu.RLock) are not blocked by an in-flight create, the parent Rego query can finish and release the OPA inmem store's RLock, and the spawned Commit then proceeds.

omer9564 added 6 commits May 4, 2026 20:02
…rt bindings

TestIntegration's example docker-compose published host ports for NATS
(4222, 8222) and the NATS UI (31311). The integration test only talks to
OPA at 8181, so those bindings are not needed for the test to function —
and they collide with the GHA workflow's `services: nats:` (which already
binds 4222), making CI fail every time with:

  Bind for 0.0.0.0:4222 failed: port is already allocated

Add a test-only docker-compose override at
cmd/opa-nats/docker-compose.test.override.yaml that resets the host-port
mappings on `nats` and `nats-ui` to []. The example file itself is not
changed, so end users still get the published ports outside of CI.

The test now invokes `docker compose -f docker-compose.yaml -f <override>`
and also runs an explicit `down -v --remove-orphans` before `up` to clear
any leftover containers from previous runs.
Last CI run got past the port-collision but OPA never became ready and
all we saw was "context deadline exceeded" with no idea why. Add a
helper that dumps `docker compose logs --tail 200` and `docker compose
ps -a` when the test fails or when waitForOPA times out, so the next
failure surfaces the actual reason in the GHA log.
The docker-compose example pointed OPA at nats://localhost:4222, but
inside the OPA container `localhost` is the OPA container itself — not
the NATS container. OPA therefore failed at startup with:

  Failed to connect to NATS: ... nats: no servers available for connection

and exited before opening the HTTP server, which is why TestIntegration
in CI got "context deadline exceeded" while waiting for /health.

Use the docker-compose service name `nats` so OPA can reach NATS via
the docker network. A comment notes the standalone (host) value.
…n failure

The previous CI run made it past waitForOPA but then hung in evaluatePolicy's
http.Post for 10 minutes, until the Go test framework's default timeout
killed the process — before t.Cleanup could run dumpComposeLogs.

- Give evaluatePolicy a 30s per-request HTTP timeout, so a hung OPA
  surfaces a clear test failure within 30s and t.Cleanup runs.
- Enable OPA's --pprof in the test compose override.
- On failure / OPA-not-ready, dump OPA's full goroutine stack from
  /debug/pprof/goroutine?debug=2 BEFORE the compose logs. This is the
  highest-signal diagnostic for the deadlock-class of hangs we are
  trying to verify the fix for.
…erting on call N

The previous assertions assumed the second call would already see
bucket_watched=true. nats.kv.watch_bucket spawns the watcher
registration in a goroutine that completes off the request-handling
path — there is no guarantee a follow-up call sees the watcher in the
cache yet. Before the deadlock fix this happened to work because the
parent's HasWatcher blocked behind the spawned writer (the very deadlock
this PR is fixing); after the fix it is a race the test should not be
asserting against.

Use require.Eventually to poll for bucket_watched=true within 30s. If
the watcher registration never completes within the deadline the test
still fails loudly, but normal async timing is no longer treated as a
bug.
The example rego intentionally returns different shapes for `x` in the
two states:
  bucket_watched=false → x = nats.kv.get_data(bucket, "members")
                              (just the members map)
  bucket_watched=true  → x = data.nats.kv[bucket]
                              (the entire bucket: members + metadata + permissions)

The original assertion `x1 == x2` assumed otherwise and was always
wrong; before the deadlock fix the second call deadlocked instead of
returning, so we never got to compare. Now that calls return correctly
in both states, compare the members submap explicitly:

  x_unwatched         vs  x_watched["members"]
@omer9564 omer9564 requested review from Zivxx and zeevmoney May 4, 2026 18:13
Copy link
Copy Markdown

@zeevmoney zeevmoney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check comments, there is a critical bug here.

Comment thread pkg/natsstore/bucket_watcher_manager.go Outdated
Comment thread pkg/natsstore/bucket_watcher_manager.go Outdated
Comment thread pkg/natsstore/bucket_watcher_manager.go
Comment thread pkg/natsstore/bucket_watcher_manager_concurrency_test.go
Comment thread examples/opa-nats/config.yaml Outdated
omer9564 added 3 commits May 6, 2026 14:48
zeevmoney pointed out that ContainsOrAdd was called while gwm.mu.Lock
was held, and the LRU was created with NewWithEvict — so an eviction
fired the onEviction callback synchronously under the lock. The
callback ran BucketWatcher.Stop → cleanOPAStore → opaStore.Commit,
which takes the inmem store's write lock and blocks on outstanding
read transactions. Same deadlock as the original bug, just shifted
from Start to eviction:

  1. Goroutine A (this call, spawned from watchBucketBuiltin): holds
     gwm.mu.Lock, blocked on db.rmu.Lock for the eviction Commit.
  2. Goroutine B (the parent Rego query, or any concurrent reader):
     holds db.rmu.RLock, calls another watch_bucket → HasWatcher →
     blocked on gwm.mu.RLock.

Default MaxBucketsWatchers = 10, so the eviction path is hot in any
deployment with more than 10 buckets — exactly the workload that
produced the original incident.

Fix: drop the LRU's eviction callback (lru.New instead of
lru.NewWithEvict) and perform eviction manually. Peek capacity and
RemoveOldest under gwm.mu (no callbacks fire), drop the lock, then
call Stop on the evicted watcher with no manager lock held.

Also addressed in this commit (related cleanups in the same area):
- Remove the dead `if gwm.stopping` re-check after Start. createMu is
  held for the full slow path and Stop() takes createMu BEFORE setting
  stopping=true, so stopping cannot transition false→true here.
- Add a comment to Stop()'s Purge() noting that it is now a plain map
  clear with no eviction callbacks (each watcher was already stopped
  in the loop above).
- Introduce a `newWatcher` function field on BucketWatcherManager so
  the regression tests can exercise Start without a real NATS
  connection. Production behavior unchanged.

Addresses review comments:
- #3 (comment) (@zeevmoney)
- #3 (comment) (@zeevmoney)
- #3 (comment) (@zeevmoney)
…enarios

The previous concurrency test file only exercised fast-path lookups,
the stopping flag set manually, and Stop() idempotency on an empty
manager — none of the genuinely concurrent scenarios that motivated
this PR. Reviewers correctly pointed out that a future change could
re-introduce the deadlock without any test failing.

Add four regression tests, all running under -race:

- TestGetOrCreateWatcher_DoesNotDeadlockOnReader
  The original bug. A reader holds the OPA inmem store's RLock while
  another goroutine enters GetOrCreateWatcher and tries to Commit
  (which needs the store's WLock). Asserts both that the create
  eventually completes after the reader releases its RLock AND that
  concurrent HasWatcher calls remain responsive throughout — proving
  gwm.mu is not held across watcher.Start.

- TestGetOrCreateWatcher_DoesNotDeadlockOnEviction
  The eviction-deadlock variant zeevmoney flagged. Fills the cache to
  capacity, then triggers an eviction-during-create with concurrent
  HasWatcher calls. Asserts every HasWatcher call returns within
  100ms — proving gwm.mu is not held across the evicted watcher's
  Stop. Also asserts the cache contents reflect the eviction.

- TestGetOrCreateWatcher_SerializesSameBucket
  Asserts that N concurrent callers for the same bucket cause
  newWatcher to be invoked exactly once and all callers receive the
  same (canonical) watcher — proving createMu serializes creates and
  the double-check returns the canonical watcher to the loser(s).

- TestStop_DrainsInflightCreates
  Asserts that Stop blocks until any in-flight GetOrCreateWatcher
  completes, and that subsequent creates are refused. This is the
  contract that prevents a watcher from being inserted into the cache
  after Stop has torn the manager down.

Tests use the new newWatcher seam (defaults to NewBucketWatcher in
production) to exercise Start without a real NATS connection. The
deadlock test substitutes a Start that opens a write transaction on
the supplied opa store and Commits it, faithfully exercising the same
lock chain the real cleanOPAStore touches.

Addresses review comments:
- #3 (comment) (@zeevmoney)
- #3 (comment) (@copilot-pull-request-reviewer)
…pose.yaml in CI

zeevmoney pointed out (correctly) that an example config should default
to the form most users will be running — OPA on the host, reaching NATS
at localhost:4222 — not the in-compose-network form. The earlier change
that flipped the default to nats://nats:4222 was a workaround for the CI
integration test, which OPA-inside-compose needed.

This commit:
  - Reverts examples/opa-nats/config.yaml back to nats://localhost:4222
    so `opa run -c examples/opa-nats/config.yaml` works out of the box
    on the host.
  - Updates examples/opa-nats/config-compose.yaml (which already
    existed and shipped with the same broken localhost URL) to actually
    be the in-compose config, with server_url=nats://nats:4222 and a
    header comment explaining when to use which file.
  - Updates cmd/opa-nats/docker-compose.test.override.yaml to mount
    examples/opa-nats/config-compose.yaml in place of the default
    config.yaml inside the OPA container, so the CI integration test
    keeps working with no extra files in cmd/opa-nats/.

Addresses review comment:
- #3 (comment) (@zeevmoney)
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Comment on lines +220 to +223
// NewBucketWatcher + watcher.Start; tests override it to inject failures
// (e.g. a Start that exercises the OPA store write lock without needing
// a real NATS connection). Set in NewBucketWatcherManager and after
// Reconfigure; never reassigned at runtime, so no synchronization needed.
# using those ports on the host (in CI, the workflow's
# `services: nats:` already binds 4222). OPA's 8181 is the only port
# the test reaches, so it stays published.
# - Mounts test-config.yaml in place of the example config.yaml so OPA
Comment on lines +369 to +376
// Give the create a moment to acquire createMu.
time.Sleep(50 * time.Millisecond)

stopReturned := make(chan struct{})
go func() {
defer close(stopReturned)
_ = m.Stop()
}()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants