Skip to content

Fix pool timeout overrides, defaults, and retain_connections_max quota bug#139

Merged
vadv merged 19 commits intomasterfrom
fix/pool-timeout-overrides-and-retain-bypass
Mar 1, 2026
Merged

Fix pool timeout overrides, defaults, and retain_connections_max quota bug#139
vadv merged 19 commits intomasterfrom
fix/pool-timeout-overrides-and-retain-bypass

Conversation

@vadv
Copy link
Copy Markdown
Collaborator

@vadv vadv commented Mar 1, 2026

Summary

  • Fix: Pool-level server_lifetime and idle_timeout overrides were silently ignored — general values were always used instead. Fixed in 6 places across 3 pool creation contexts (static pools, auth_query shared pools, dynamic pools).
  • Fix: idle_timeout default was 300,000,000ms (83 hours) instead of 600,000ms (10 minutes), effectively disabling idle connection cleanup.
  • Fix: server_lifetime default changed from 5 minutes to 20 minutes. The old value (5 min) was shorter than idle_timeout (10 min), making idle_timeout permanently inert — server_lifetime killed every connection before it could be idle for 10 minutes.
  • Fix: retain_connections_max quota exhaustion caused remaining=0 which was treated as "unlimited" by retain_oldest_first(), causing pools processed after quota exhaustion to lose ALL idle connections in a single retain cycle.
  • Fix: idle_timeout = 0 closed connections after ~1ms instead of disabling idle timeout. Now 0 means disabled, consistent with server_lifetime = 0 and PgBouncer semantics.
  • Fix: idle_timeout had no per-connection jitter — all connections in a pool expired simultaneously after a traffic burst, causing synchronized mass closures. Now applies the same ±20% jitter as server_lifetime.
  • Fix: retain_connections_max quota was unfairly distributed — HashMap iteration order is deterministic within a process, so the same pool always consumed the entire quota while other pools starved. Fixed by shuffling pool iteration order each cycle.
  • Fix: Retain and replenish phases used separate get_all_pools() snapshots. If POOLS was atomically updated between them (config reload, dynamic pool GC), they operated on different pool sets. Fixed by using a single snapshot.
  • Fix: Doc comment for retain_connections_max incorrectly stated default as 0 (unlimited); actual default is 3.
  • Feature: default_min_pool_size for dynamic auth_query passthrough pools — minimum backend connections maintained per dynamic user pool. Prewarmed on first pool creation, replenished by retain cycle. GC skips pools with min_pool_size > 0 to prevent race condition with retain.
  • Infra: make generate target for regenerating reference configs and docs. Balanced CI test partitions (@rust-4).
  • Version: Bumped to 3.3.2 with changelog.

How server_lifetime and idle_timeout interact

Parameter Default Jitter What it measures When checked
server_lifetime 20 min ±20% per connection Time since creation At recycle (→ new connection for next client) AND at retain (→ close idle)
idle_timeout 10 min ±20% per connection Time since last use At retain only (→ close idle). 0 = disabled.

With the new defaults (20m / 10m), each serves its purpose:

  • idle_timeout cleans up connections sitting idle after load drops (10 min ±20%)
  • server_lifetime rotates long-lived connections even under constant use (20 min ±20%)

Note: idle_timeout only applies to connections that have been used at least once. Prewarmed/replenished connections that were never checked out by a client have recycled=None and are not subject to idle_timeout — they live until server_lifetime expires.

default_min_pool_size for dynamic pools

New auth_query.default_min_pool_size (default: 0) controls minimum backend connections per dynamic user pool in passthrough mode:

  • Prewarm: When a dynamic pool is first created, a background task spawns default_min_pool_size connections without blocking the client.
  • Retain cycle: Already iterates all pools via get_all_pools() — dynamic pools with min_pool_size are automatically replenished. No changes needed in retain.rs.
  • GC protection: Dynamic pools with min_pool_size > 0 are skipped by GC. This prevents a race condition where GC could delete a pool between retain's close and replenish phases.
  • Lifecycle: Prewarmed connections live until server_lifetime, then retain closes and replenishes them. Pools persist until config RELOAD or default_min_pool_size is set to 0.
Edge case Behavior
default_min_pool_size = 0 (default) Current behavior preserved: no prewarm, GC removes empty pools
Backend unreachable during prewarm replenish() returns 0, warn logged. GC won't delete (min > 0). Retain retries
N users × min = many connections 100 users × min=2 = 200 backend connections. Documented in fields.yaml
RELOAD with changed value All dynamic pools destroyed, new ones created fresh (existing logic)

Details

Pool-level timeout overrides

In src/pool/mod.rs, both ServerPool::new() (controls recycle lifetime check) and PoolSettings (controls retain idle_timeout/lifetime) always used config.general.server_lifetime / config.general.idle_timeout, completely ignoring pool-level overrides.

The fix uses pool_config.server_lifetime.unwrap_or(config.general.server_lifetime.as_millis()) — the same pattern already used in config logging.

retain_connections_max quota exhaustion

When retain_connections_max > 0 and the global counter reached the limit, saturating_sub returned 0. Since 0 means "unlimited" in retain_oldest_first(), subsequent pools got unlimited closure. HashMap iteration order is non-deterministic, so the bug manifested randomly.

Fix: early return when current_count >= max, preventing the 0-as-unlimited ambiguity.

idle_timeout jitter and 0 = disabled

idle_timeout was stored as a single pool-wide value in PoolSettings. After a traffic burst, all returned connections had the same idle start time and expired simultaneously. Now idle_timeout_ms is stored per-connection in Metrics with ±20% jitter applied at creation time — the same approach as server_lifetime.

Additionally, idle_timeout = 0 now means "disabled" (consistent with server_lifetime = 0 and PgBouncer's server_idle_timeout = 0). Previously it closed connections after ~1ms of idle time because the check was elapsed > 0ms.

Retain cycle fairness

The retain cycle iterated pools via HashMap::iter(), whose order is deterministic within a process (fixed RandomState seed). With retain_connections_max = 3, the same pool always consumed the entire quota while other pools' expired connections were never cleaned up. Now pools are shuffled each cycle via rand::seq::SliceRandom::shuffle.

Both retain and replenish phases now use a single get_all_pools() snapshot to avoid inconsistency if POOLS is atomically updated between them.

Test plan

Pool timeout overrides:

  • @pool-override-server-lifetime — pool-level server_lifetime triggers recycle
  • @pool-override-idle-timeout — pool-level idle_timeout closes idle connections
  • @pool-override-two-pools-different-lifetime — two pools with different lifetimes behave independently
  • @general-server-lifetime-baseline — general server_lifetime still works
  • @general-idle-timeout-baseline — general idle_timeout still works

Dynamic pool overrides:

  • @dynamic-pool-idle-timeout — auth_query dynamic pool inherits pool-level idle_timeout
  • @dynamic-pool-server-lifetime — auth_query dynamic pool inherits pool-level server_lifetime
  • @min-pool-size-with-pool-lifetime — min_pool_size replenishment works with pool-level server_lifetime

retain_connections_max quota:

  • @retain-max-quota-two-pools — retain_connections_max quota enforced across multiple pools

End-to-end pool lifecycle:

  • @e2e-dedicated-lifecycle — dedicated pool: prewarm (min_pool_size=2) → scale up (pool_size=5) → shrink to min_pool_size=2
  • @e2e-auth-query-lifecycle — auth_query pool: scale up (default_pool_size=5) → shrink to 0

default_min_pool_size:

  • @auth-query-min-pool-size — dynamic pool prewarm with default_min_pool_size (2 scenarios: prewarm + lifetime replenish)

Regression:

  • @min-pool-size — existing min_pool_size tests (3 scenarios)
  • @auth-query-passthrough — passthrough auth tests (4 scenarios)
  • cargo test --lib — 259 passed (includes reference config matching)

Run:

DEBUG=1 make test-bdd TAGS=@pool-lifecycle-e2e
DEBUG=1 make test-bdd TAGS=@pool-timeout-override
DEBUG=1 make test-bdd TAGS=@pool-timeout-override-dynamic
DEBUG=1 make test-bdd TAGS=@retain-max-quota
DEBUG=1 make test-bdd TAGS=@min-pool-size
DEBUG=1 make test-bdd TAGS=@auth-query-min-pool-size

🤖 Generated with Claude Code

dmitrivasilyev added 8 commits March 1, 2026 11:40
…_timeout)

These tests expose the bug where pool-level overrides for server_lifetime
and idle_timeout are ignored — general values are always used instead
(src/pool/mod.rs:401-402). Scenarios 1-3 (pool overrides) are expected to
FAIL, confirming the bug. Scenarios 4-5 (general baseline) should PASS.
Pool-level overrides for server_lifetime and idle_timeout were silently
ignored in three places: static pool creation, auth_query shared pool
creation, and dynamic pool creation. Both ServerPool::new() (which
controls recycle lifetime) and PoolSettings (which controls retain
idle_timeout) always used config.general values. Now they use
pool_config.server_lifetime/idle_timeout with general as fallback.
Verify that pool-level idle_timeout and server_lifetime overrides work
for auth_query passthrough dynamic pools, and that min_pool_size is
correctly maintained when connections expire with pool-level lifetime.

All 3 scenarios pass, confirming the fix applies to dynamic pools too.
The previous test only checked SHOW SERVERS >= 2, which didn't prove
connections were actually replaced. Now also verifies backend_pid
changes after server_lifetime expiry, confirming the full cycle:
pool-level expire → retain close → replenish with correct lifetime.
Open 5 concurrent transactions (BEGIN) to force pool to scale to
pool_size=5, record all backend PIDs, COMMIT all, wait for pool-level
server_lifetime=1000ms to expire, verify pool shrank to exactly
min_pool_size=2, and confirm all backends were replaced (new PIDs).
The default idle_timeout was set to 300,000,000ms (83 hours / 5000 minutes),
effectively disabling idle connection cleanup. This caused idle server
connections to accumulate indefinitely in production.

Changed default to 600,000ms (10 minutes) — a reasonable value that ensures
idle connections are cleaned up in a timely manner while not being too
aggressive for normal workloads.

Updated all documentation and example configs accordingly.
When retain_connections_max > 0 and the global counter reached the limit,
remaining became 0 via saturating_sub. Since 0 means "unlimited" in
retain_oldest_first(), subsequent pools got unlimited closure instead of
zero. This caused pools processed after quota exhaustion to lose ALL idle
connections in a single retain cycle.

Fix: early return when current_count >= max, preventing the 0-as-unlimited
ambiguity. Also fix doc comment that said default was 0 (unlimited) when
the actual default is 3.
Add changelog entries for all bug fixes in this branch:
- Pool-level server_lifetime/idle_timeout overrides ignored
- idle_timeout default was 83 hours instead of 10 minutes
- retain_connections_max quota exhaustion causing unlimited closure
- retain_connections_max doc comment incorrect default
@vadv vadv changed the title Fix pool-level server_lifetime and idle_timeout overrides being ignored Fix pool timeout overrides, idle_timeout default, and retain_connections_max quota bug Mar 1, 2026
dmitrivasilyev added 6 commits March 1, 2026 13:17
Two end-to-end scenarios covering the full connection pool lifecycle:

1. Dedicated pool: prewarm (verify min_pool_size=2 at startup) → scale up
   to pool_size=5 under load → shrink back to min_pool_size=2 after
   server_lifetime expiry and retain cycles.

2. Auth_query pool: scale up to default_pool_size=5 under load → shrink
   to 0 after server_lifetime expiry (min_pool_size not supported for
   dynamic pools).
The reference TOML/YAML configs were manually edited in the idle_timeout
fix but diverged from what `generate --reference` produces. Regenerated
to match the code-generated format.
Move pool lifecycle and timeout tests with significant sleep times from
@rust-3 to new @rust-4 partition:
- pool-lifecycle-e2e.feature (9.5s sleep)
- pool-timeout-overrides.feature (8.5s sleep)
- min-pool-size.feature (7.5s sleep)
- retain-max-quota.feature (4s sleep)

Before: @rust-3 had 63 scenarios with 33s total sleep.
After:  @rust-3 has 52 scenarios with 3s sleep,
        @rust-4 has 11 scenarios with 29s sleep.

Added corresponding CI job in bdd-tests.yml workflow.
Regenerates pg_doorman.toml, pg_doorman.yaml (reference configs) and
documentation/en/src/reference/ (Markdown docs) from source code.

Must be run after any changes to src/config/, fields.yaml, or
annotated.rs to keep generated files in sync with code.
The previous default of 5 minutes was shorter than idle_timeout (10 min),
which meant idle_timeout could never trigger — connections were always
killed by server_lifetime first. With server_lifetime=20m and
idle_timeout=10m, each parameter serves its purpose:
- idle_timeout (10m): cleans up connections sitting idle after load drops
- server_lifetime (20m): rotates long-lived connections for freshness
Prewarmed/replenished connections that were never used by a client have
recycled=None, so idle_timeout does not apply to them — they live until
server_lifetime expires.
@vadv vadv changed the title Fix pool timeout overrides, idle_timeout default, and retain_connections_max quota bug Fix pool timeout overrides, defaults, and retain_connections_max quota bug Mar 1, 2026
dmitrivasilyev added 5 commits March 1, 2026 15:36
…ntics

What was needed: idle_timeout had two bugs: (1) setting it to 0 closed
connections after ~1ms instead of disabling the timeout (inconsistent with
server_lifetime=0 and PgBouncer semantics), (2) it used a single pool-wide
value without jitter, causing synchronized mass closures when many connections
became idle simultaneously.

Implementation: added idle_timeout_ms to Metrics (per-connection, with ±20%
jitter) mirroring how server_lifetime already works. Base idle_timeout_ms is
stored in ServerPool and passed through to each connection's Metrics at creation
time. The retain closure now uses metrics.idle_timeout_ms with a >0 guard
(0 = disabled).
Two issues in the retain cycle:

1. retain_connections_max was unfairly distributed — HashMap iteration
order is deterministic within a process (fixed RandomState seed), so the
same pool always consumed the entire quota while other pools' expired
connections were never cleaned up. Fixed by shuffling pool order each cycle.

2. Retain and replenish phases called get_all_pools() separately. If POOLS
was atomically updated between them (config reload, dynamic pool GC), they
operated on different snapshots. Fixed by using a single snapshot for both.
…te_backend

Two scenarios verify that dynamic passthrough pools (MD5 and SCRAM)
can successfully reconnect after the backend is terminated via
pg_terminate_backend, using cached credentials from auth_query.

Also adds SCRAM-SHA-256 client support to the test PgConnection,
enabling session-based SCRAM authentication in BDD tests.
Adds default_min_pool_size to AuthQueryConfig that controls minimum backend
connections per dynamic user pool in passthrough mode. Connections are
prewarmed on first pool creation and replenished by the retain cycle.

Key changes:
- New field default_min_pool_size (default: 0 = no prewarm, backward compatible)
- Prewarm spawn in create_dynamic_pool() after POOLS.store()
- GC skips pools with min_pool_size > 0 (prevents race condition with retain)
- Validation: default_min_pool_size <= default_pool_size
- BDD tests for prewarm and lifetime replenish scenarios
@vadv vadv merged commit 4dbece6 into master Mar 1, 2026
42 checks passed
@vadv vadv deleted the fix/pool-timeout-overrides-and-retain-bypass branch March 1, 2026 16:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant