Fix pool timeout overrides, defaults, and retain_connections_max quota bug by vadv · Pull Request #139 · ozontech/pg_doorman

vadv · 2026-03-01T09:03:39Z

Summary

Fix: Pool-level server_lifetime and idle_timeout overrides were silently ignored — general values were always used instead. Fixed in 6 places across 3 pool creation contexts (static pools, auth_query shared pools, dynamic pools).
Fix: idle_timeout default was 300,000,000ms (83 hours) instead of 600,000ms (10 minutes), effectively disabling idle connection cleanup.
Fix: server_lifetime default changed from 5 minutes to 20 minutes. The old value (5 min) was shorter than idle_timeout (10 min), making idle_timeout permanently inert — server_lifetime killed every connection before it could be idle for 10 minutes.
Fix: retain_connections_max quota exhaustion caused remaining=0 which was treated as "unlimited" by retain_oldest_first(), causing pools processed after quota exhaustion to lose ALL idle connections in a single retain cycle.
Fix: idle_timeout = 0 closed connections after ~1ms instead of disabling idle timeout. Now 0 means disabled, consistent with server_lifetime = 0 and PgBouncer semantics.
Fix: idle_timeout had no per-connection jitter — all connections in a pool expired simultaneously after a traffic burst, causing synchronized mass closures. Now applies the same ±20% jitter as server_lifetime.
Fix: retain_connections_max quota was unfairly distributed — HashMap iteration order is deterministic within a process, so the same pool always consumed the entire quota while other pools starved. Fixed by shuffling pool iteration order each cycle.
Fix: Retain and replenish phases used separate get_all_pools() snapshots. If POOLS was atomically updated between them (config reload, dynamic pool GC), they operated on different pool sets. Fixed by using a single snapshot.
Fix: Doc comment for retain_connections_max incorrectly stated default as 0 (unlimited); actual default is 3.
Feature: default_min_pool_size for dynamic auth_query passthrough pools — minimum backend connections maintained per dynamic user pool. Prewarmed on first pool creation, replenished by retain cycle. GC skips pools with min_pool_size > 0 to prevent race condition with retain.
Infra: make generate target for regenerating reference configs and docs. Balanced CI test partitions (@rust-4).
Version: Bumped to 3.3.2 with changelog.

How `server_lifetime` and `idle_timeout` interact

Parameter	Default	Jitter	What it measures	When checked
`server_lifetime`	20 min	±20% per connection	Time since creation	At recycle (→ new connection for next client) AND at retain (→ close idle)
`idle_timeout`	10 min	±20% per connection	Time since last use	At retain only (→ close idle). 0 = disabled.

With the new defaults (20m / 10m), each serves its purpose:

idle_timeout cleans up connections sitting idle after load drops (10 min ±20%)
server_lifetime rotates long-lived connections even under constant use (20 min ±20%)

Note: idle_timeout only applies to connections that have been used at least once. Prewarmed/replenished connections that were never checked out by a client have recycled=None and are not subject to idle_timeout — they live until server_lifetime expires.

`default_min_pool_size` for dynamic pools

New auth_query.default_min_pool_size (default: 0) controls minimum backend connections per dynamic user pool in passthrough mode:

Prewarm: When a dynamic pool is first created, a background task spawns default_min_pool_size connections without blocking the client.
Retain cycle: Already iterates all pools via get_all_pools() — dynamic pools with min_pool_size are automatically replenished. No changes needed in retain.rs.
GC protection: Dynamic pools with min_pool_size > 0 are skipped by GC. This prevents a race condition where GC could delete a pool between retain's close and replenish phases.
Lifecycle: Prewarmed connections live until server_lifetime, then retain closes and replenishes them. Pools persist until config RELOAD or default_min_pool_size is set to 0.

Edge case	Behavior
`default_min_pool_size = 0` (default)	Current behavior preserved: no prewarm, GC removes empty pools
Backend unreachable during prewarm	`replenish()` returns 0, warn logged. GC won't delete (min > 0). Retain retries
N users × min = many connections	100 users × min=2 = 200 backend connections. Documented in fields.yaml
RELOAD with changed value	All dynamic pools destroyed, new ones created fresh (existing logic)

Details

Pool-level timeout overrides

In src/pool/mod.rs, both ServerPool::new() (controls recycle lifetime check) and PoolSettings (controls retain idle_timeout/lifetime) always used config.general.server_lifetime / config.general.idle_timeout, completely ignoring pool-level overrides.

The fix uses pool_config.server_lifetime.unwrap_or(config.general.server_lifetime.as_millis()) — the same pattern already used in config logging.

retain_connections_max quota exhaustion

When retain_connections_max > 0 and the global counter reached the limit, saturating_sub returned 0. Since 0 means "unlimited" in retain_oldest_first(), subsequent pools got unlimited closure. HashMap iteration order is non-deterministic, so the bug manifested randomly.

Fix: early return when current_count >= max, preventing the 0-as-unlimited ambiguity.

idle_timeout jitter and `0 = disabled`

idle_timeout was stored as a single pool-wide value in PoolSettings. After a traffic burst, all returned connections had the same idle start time and expired simultaneously. Now idle_timeout_ms is stored per-connection in Metrics with ±20% jitter applied at creation time — the same approach as server_lifetime.

Additionally, idle_timeout = 0 now means "disabled" (consistent with server_lifetime = 0 and PgBouncer's server_idle_timeout = 0). Previously it closed connections after ~1ms of idle time because the check was elapsed > 0ms.

Retain cycle fairness

The retain cycle iterated pools via HashMap::iter(), whose order is deterministic within a process (fixed RandomState seed). With retain_connections_max = 3, the same pool always consumed the entire quota while other pools' expired connections were never cleaned up. Now pools are shuffled each cycle via rand::seq::SliceRandom::shuffle.

Both retain and replenish phases now use a single get_all_pools() snapshot to avoid inconsistency if POOLS is atomically updated between them.

Test plan

Pool timeout overrides:

@pool-override-server-lifetime — pool-level server_lifetime triggers recycle
@pool-override-idle-timeout — pool-level idle_timeout closes idle connections
@pool-override-two-pools-different-lifetime — two pools with different lifetimes behave independently
@general-server-lifetime-baseline — general server_lifetime still works
@general-idle-timeout-baseline — general idle_timeout still works

Dynamic pool overrides:

@dynamic-pool-idle-timeout — auth_query dynamic pool inherits pool-level idle_timeout
@dynamic-pool-server-lifetime — auth_query dynamic pool inherits pool-level server_lifetime
@min-pool-size-with-pool-lifetime — min_pool_size replenishment works with pool-level server_lifetime

retain_connections_max quota:

@retain-max-quota-two-pools — retain_connections_max quota enforced across multiple pools

End-to-end pool lifecycle:

@e2e-dedicated-lifecycle — dedicated pool: prewarm (min_pool_size=2) → scale up (pool_size=5) → shrink to min_pool_size=2
@e2e-auth-query-lifecycle — auth_query pool: scale up (default_pool_size=5) → shrink to 0

default_min_pool_size:

@auth-query-min-pool-size — dynamic pool prewarm with default_min_pool_size (2 scenarios: prewarm + lifetime replenish)

Regression:

@min-pool-size — existing min_pool_size tests (3 scenarios)
@auth-query-passthrough — passthrough auth tests (4 scenarios)
cargo test --lib — 259 passed (includes reference config matching)

Run:

DEBUG=1 make test-bdd TAGS=@pool-lifecycle-e2e
DEBUG=1 make test-bdd TAGS=@pool-timeout-override
DEBUG=1 make test-bdd TAGS=@pool-timeout-override-dynamic
DEBUG=1 make test-bdd TAGS=@retain-max-quota
DEBUG=1 make test-bdd TAGS=@min-pool-size
DEBUG=1 make test-bdd TAGS=@auth-query-min-pool-size

🤖 Generated with Claude Code

…_timeout) These tests expose the bug where pool-level overrides for server_lifetime and idle_timeout are ignored — general values are always used instead (src/pool/mod.rs:401-402). Scenarios 1-3 (pool overrides) are expected to FAIL, confirming the bug. Scenarios 4-5 (general baseline) should PASS.

Pool-level overrides for server_lifetime and idle_timeout were silently ignored in three places: static pool creation, auth_query shared pool creation, and dynamic pool creation. Both ServerPool::new() (which controls recycle lifetime) and PoolSettings (which controls retain idle_timeout) always used config.general values. Now they use pool_config.server_lifetime/idle_timeout with general as fallback.

Verify that pool-level idle_timeout and server_lifetime overrides work for auth_query passthrough dynamic pools, and that min_pool_size is correctly maintained when connections expire with pool-level lifetime. All 3 scenarios pass, confirming the fix applies to dynamic pools too.

The previous test only checked SHOW SERVERS >= 2, which didn't prove connections were actually replaced. Now also verifies backend_pid changes after server_lifetime expiry, confirming the full cycle: pool-level expire → retain close → replenish with correct lifetime.

Open 5 concurrent transactions (BEGIN) to force pool to scale to pool_size=5, record all backend PIDs, COMMIT all, wait for pool-level server_lifetime=1000ms to expire, verify pool shrank to exactly min_pool_size=2, and confirm all backends were replaced (new PIDs).

The default idle_timeout was set to 300,000,000ms (83 hours / 5000 minutes), effectively disabling idle connection cleanup. This caused idle server connections to accumulate indefinitely in production. Changed default to 600,000ms (10 minutes) — a reasonable value that ensures idle connections are cleaned up in a timely manner while not being too aggressive for normal workloads. Updated all documentation and example configs accordingly.

When retain_connections_max > 0 and the global counter reached the limit, remaining became 0 via saturating_sub. Since 0 means "unlimited" in retain_oldest_first(), subsequent pools got unlimited closure instead of zero. This caused pools processed after quota exhaustion to lose ALL idle connections in a single retain cycle. Fix: early return when current_count >= max, preventing the 0-as-unlimited ambiguity. Also fix doc comment that said default was 0 (unlimited) when the actual default is 3.

Add changelog entries for all bug fixes in this branch: - Pool-level server_lifetime/idle_timeout overrides ignored - idle_timeout default was 83 hours instead of 10 minutes - retain_connections_max quota exhaustion causing unlimited closure - retain_connections_max doc comment incorrect default

Two end-to-end scenarios covering the full connection pool lifecycle: 1. Dedicated pool: prewarm (verify min_pool_size=2 at startup) → scale up to pool_size=5 under load → shrink back to min_pool_size=2 after server_lifetime expiry and retain cycles. 2. Auth_query pool: scale up to default_pool_size=5 under load → shrink to 0 after server_lifetime expiry (min_pool_size not supported for dynamic pools).

The reference TOML/YAML configs were manually edited in the idle_timeout fix but diverged from what `generate --reference` produces. Regenerated to match the code-generated format.

Move pool lifecycle and timeout tests with significant sleep times from @rust-3 to new @rust-4 partition: - pool-lifecycle-e2e.feature (9.5s sleep) - pool-timeout-overrides.feature (8.5s sleep) - min-pool-size.feature (7.5s sleep) - retain-max-quota.feature (4s sleep) Before: @rust-3 had 63 scenarios with 33s total sleep. After: @rust-3 has 52 scenarios with 3s sleep, @rust-4 has 11 scenarios with 29s sleep. Added corresponding CI job in bdd-tests.yml workflow.

Regenerates pg_doorman.toml, pg_doorman.yaml (reference configs) and documentation/en/src/reference/ (Markdown docs) from source code. Must be run after any changes to src/config/, fields.yaml, or annotated.rs to keep generated files in sync with code.

The previous default of 5 minutes was shorter than idle_timeout (10 min), which meant idle_timeout could never trigger — connections were always killed by server_lifetime first. With server_lifetime=20m and idle_timeout=10m, each parameter serves its purpose: - idle_timeout (10m): cleans up connections sitting idle after load drops - server_lifetime (20m): rotates long-lived connections for freshness

Prewarmed/replenished connections that were never used by a client have recycled=None, so idle_timeout does not apply to them — they live until server_lifetime expires.

…ntics What was needed: idle_timeout had two bugs: (1) setting it to 0 closed connections after ~1ms instead of disabling the timeout (inconsistent with server_lifetime=0 and PgBouncer semantics), (2) it used a single pool-wide value without jitter, causing synchronized mass closures when many connections became idle simultaneously. Implementation: added idle_timeout_ms to Metrics (per-connection, with ±20% jitter) mirroring how server_lifetime already works. Base idle_timeout_ms is stored in ServerPool and passed through to each connection's Metrics at creation time. The retain closure now uses metrics.idle_timeout_ms with a >0 guard (0 = disabled).

Two issues in the retain cycle: 1. retain_connections_max was unfairly distributed — HashMap iteration order is deterministic within a process (fixed RandomState seed), so the same pool always consumed the entire quota while other pools' expired connections were never cleaned up. Fixed by shuffling pool order each cycle. 2. Retain and replenish phases called get_all_pools() separately. If POOLS was atomically updated between them (config reload, dynamic pool GC), they operated on different snapshots. Fixed by using a single snapshot for both.

…te_backend Two scenarios verify that dynamic passthrough pools (MD5 and SCRAM) can successfully reconnect after the backend is terminated via pg_terminate_backend, using cached credentials from auth_query. Also adds SCRAM-SHA-256 client support to the test PgConnection, enabling session-based SCRAM authentication in BDD tests.

Adds default_min_pool_size to AuthQueryConfig that controls minimum backend connections per dynamic user pool in passthrough mode. Connections are prewarmed on first pool creation and replenished by the retain cycle. Key changes: - New field default_min_pool_size (default: 0 = no prewarm, backward compatible) - Prewarm spawn in create_dynamic_pool() after POOLS.store() - GC skips pools with min_pool_size > 0 (prevents race condition with retain) - Validation: default_min_pool_size <= default_pool_size - BDD tests for prewarm and lifetime replenish scenarios

dmitrivasilyev added 8 commits March 1, 2026 11:40

vadv changed the title ~~Fix pool-level server_lifetime and idle_timeout overrides being ignored~~ Fix pool timeout overrides, idle_timeout default, and retain_connections_max quota bug Mar 1, 2026

dmitrivasilyev added 6 commits March 1, 2026 13:17

Regenerate reference configs to match code output

d909054

The reference TOML/YAML configs were manually edited in the idle_timeout fix but diverged from what `generate --reference` produces. Regenerated to match the code-generated format.

Document idle_timeout nuance for prewarmed connections in changelog

0260b20

Prewarmed/replenished connections that were never used by a client have recycled=None, so idle_timeout does not apply to them — they live until server_lifetime expires.

vadv changed the title ~~Fix pool timeout overrides, idle_timeout default, and retain_connections_max quota bug~~ Fix pool timeout overrides, defaults, and retain_connections_max quota bug Mar 1, 2026

dmitrivasilyev added 5 commits March 1, 2026 15:36

Add default_min_pool_size to changelog

b3d7add

vadv merged commit 4dbece6 into master Mar 1, 2026
42 checks passed

vadv deleted the fix/pool-timeout-overrides-and-retain-bypass branch March 1, 2026 16:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix pool timeout overrides, defaults, and retain_connections_max quota bug#139

Fix pool timeout overrides, defaults, and retain_connections_max quota bug#139
vadv merged 19 commits intomasterfrom
fix/pool-timeout-overrides-and-retain-bypass

vadv commented Mar 1, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

vadv commented Mar 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

How server_lifetime and idle_timeout interact

default_min_pool_size for dynamic pools

Details

Pool-level timeout overrides

retain_connections_max quota exhaustion

idle_timeout jitter and 0 = disabled

Retain cycle fairness

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vadv commented Mar 1, 2026 •

edited

Loading

How `server_lifetime` and `idle_timeout` interact

`default_min_pool_size` for dynamic pools

idle_timeout jitter and `0 = disabled`