Fix pool timeout overrides, defaults, and retain_connections_max quota bug#139
Merged
Fix pool timeout overrides, defaults, and retain_connections_max quota bug#139
Conversation
added 8 commits
March 1, 2026 11:40
…_timeout) These tests expose the bug where pool-level overrides for server_lifetime and idle_timeout are ignored — general values are always used instead (src/pool/mod.rs:401-402). Scenarios 1-3 (pool overrides) are expected to FAIL, confirming the bug. Scenarios 4-5 (general baseline) should PASS.
Pool-level overrides for server_lifetime and idle_timeout were silently ignored in three places: static pool creation, auth_query shared pool creation, and dynamic pool creation. Both ServerPool::new() (which controls recycle lifetime) and PoolSettings (which controls retain idle_timeout) always used config.general values. Now they use pool_config.server_lifetime/idle_timeout with general as fallback.
Verify that pool-level idle_timeout and server_lifetime overrides work for auth_query passthrough dynamic pools, and that min_pool_size is correctly maintained when connections expire with pool-level lifetime. All 3 scenarios pass, confirming the fix applies to dynamic pools too.
The previous test only checked SHOW SERVERS >= 2, which didn't prove connections were actually replaced. Now also verifies backend_pid changes after server_lifetime expiry, confirming the full cycle: pool-level expire → retain close → replenish with correct lifetime.
Open 5 concurrent transactions (BEGIN) to force pool to scale to pool_size=5, record all backend PIDs, COMMIT all, wait for pool-level server_lifetime=1000ms to expire, verify pool shrank to exactly min_pool_size=2, and confirm all backends were replaced (new PIDs).
The default idle_timeout was set to 300,000,000ms (83 hours / 5000 minutes), effectively disabling idle connection cleanup. This caused idle server connections to accumulate indefinitely in production. Changed default to 600,000ms (10 minutes) — a reasonable value that ensures idle connections are cleaned up in a timely manner while not being too aggressive for normal workloads. Updated all documentation and example configs accordingly.
When retain_connections_max > 0 and the global counter reached the limit, remaining became 0 via saturating_sub. Since 0 means "unlimited" in retain_oldest_first(), subsequent pools got unlimited closure instead of zero. This caused pools processed after quota exhaustion to lose ALL idle connections in a single retain cycle. Fix: early return when current_count >= max, preventing the 0-as-unlimited ambiguity. Also fix doc comment that said default was 0 (unlimited) when the actual default is 3.
Add changelog entries for all bug fixes in this branch: - Pool-level server_lifetime/idle_timeout overrides ignored - idle_timeout default was 83 hours instead of 10 minutes - retain_connections_max quota exhaustion causing unlimited closure - retain_connections_max doc comment incorrect default
added 6 commits
March 1, 2026 13:17
Two end-to-end scenarios covering the full connection pool lifecycle: 1. Dedicated pool: prewarm (verify min_pool_size=2 at startup) → scale up to pool_size=5 under load → shrink back to min_pool_size=2 after server_lifetime expiry and retain cycles. 2. Auth_query pool: scale up to default_pool_size=5 under load → shrink to 0 after server_lifetime expiry (min_pool_size not supported for dynamic pools).
The reference TOML/YAML configs were manually edited in the idle_timeout fix but diverged from what `generate --reference` produces. Regenerated to match the code-generated format.
Move pool lifecycle and timeout tests with significant sleep times from
@rust-3 to new @rust-4 partition:
- pool-lifecycle-e2e.feature (9.5s sleep)
- pool-timeout-overrides.feature (8.5s sleep)
- min-pool-size.feature (7.5s sleep)
- retain-max-quota.feature (4s sleep)
Before: @rust-3 had 63 scenarios with 33s total sleep.
After: @rust-3 has 52 scenarios with 3s sleep,
@rust-4 has 11 scenarios with 29s sleep.
Added corresponding CI job in bdd-tests.yml workflow.
Regenerates pg_doorman.toml, pg_doorman.yaml (reference configs) and documentation/en/src/reference/ (Markdown docs) from source code. Must be run after any changes to src/config/, fields.yaml, or annotated.rs to keep generated files in sync with code.
The previous default of 5 minutes was shorter than idle_timeout (10 min), which meant idle_timeout could never trigger — connections were always killed by server_lifetime first. With server_lifetime=20m and idle_timeout=10m, each parameter serves its purpose: - idle_timeout (10m): cleans up connections sitting idle after load drops - server_lifetime (20m): rotates long-lived connections for freshness
Prewarmed/replenished connections that were never used by a client have recycled=None, so idle_timeout does not apply to them — they live until server_lifetime expires.
added 5 commits
March 1, 2026 15:36
…ntics What was needed: idle_timeout had two bugs: (1) setting it to 0 closed connections after ~1ms instead of disabling the timeout (inconsistent with server_lifetime=0 and PgBouncer semantics), (2) it used a single pool-wide value without jitter, causing synchronized mass closures when many connections became idle simultaneously. Implementation: added idle_timeout_ms to Metrics (per-connection, with ±20% jitter) mirroring how server_lifetime already works. Base idle_timeout_ms is stored in ServerPool and passed through to each connection's Metrics at creation time. The retain closure now uses metrics.idle_timeout_ms with a >0 guard (0 = disabled).
Two issues in the retain cycle: 1. retain_connections_max was unfairly distributed — HashMap iteration order is deterministic within a process (fixed RandomState seed), so the same pool always consumed the entire quota while other pools' expired connections were never cleaned up. Fixed by shuffling pool order each cycle. 2. Retain and replenish phases called get_all_pools() separately. If POOLS was atomically updated between them (config reload, dynamic pool GC), they operated on different snapshots. Fixed by using a single snapshot for both.
…te_backend Two scenarios verify that dynamic passthrough pools (MD5 and SCRAM) can successfully reconnect after the backend is terminated via pg_terminate_backend, using cached credentials from auth_query. Also adds SCRAM-SHA-256 client support to the test PgConnection, enabling session-based SCRAM authentication in BDD tests.
Adds default_min_pool_size to AuthQueryConfig that controls minimum backend connections per dynamic user pool in passthrough mode. Connections are prewarmed on first pool creation and replenished by the retain cycle. Key changes: - New field default_min_pool_size (default: 0 = no prewarm, backward compatible) - Prewarm spawn in create_dynamic_pool() after POOLS.store() - GC skips pools with min_pool_size > 0 (prevents race condition with retain) - Validation: default_min_pool_size <= default_pool_size - BDD tests for prewarm and lifetime replenish scenarios
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
server_lifetimeandidle_timeoutoverrides were silently ignored — general values were always used instead. Fixed in 6 places across 3 pool creation contexts (static pools, auth_query shared pools, dynamic pools).idle_timeoutdefault was 300,000,000ms (83 hours) instead of 600,000ms (10 minutes), effectively disabling idle connection cleanup.server_lifetimedefault changed from 5 minutes to 20 minutes. The old value (5 min) was shorter thanidle_timeout(10 min), makingidle_timeoutpermanently inert —server_lifetimekilled every connection before it could be idle for 10 minutes.retain_connections_maxquota exhaustion causedremaining=0which was treated as "unlimited" byretain_oldest_first(), causing pools processed after quota exhaustion to lose ALL idle connections in a single retain cycle.idle_timeout = 0closed connections after ~1ms instead of disabling idle timeout. Now0means disabled, consistent withserver_lifetime = 0and PgBouncer semantics.idle_timeouthad no per-connection jitter — all connections in a pool expired simultaneously after a traffic burst, causing synchronized mass closures. Now applies the same ±20% jitter asserver_lifetime.retain_connections_maxquota was unfairly distributed — HashMap iteration order is deterministic within a process, so the same pool always consumed the entire quota while other pools starved. Fixed by shuffling pool iteration order each cycle.get_all_pools()snapshots. If POOLS was atomically updated between them (config reload, dynamic pool GC), they operated on different pool sets. Fixed by using a single snapshot.retain_connections_maxincorrectly stated default as0(unlimited); actual default is3.default_min_pool_sizefor dynamic auth_query passthrough pools — minimum backend connections maintained per dynamic user pool. Prewarmed on first pool creation, replenished by retain cycle. GC skips pools withmin_pool_size > 0to prevent race condition with retain.make generatetarget for regenerating reference configs and docs. Balanced CI test partitions (@rust-4).How
server_lifetimeandidle_timeoutinteractserver_lifetimeidle_timeoutWith the new defaults (20m / 10m), each serves its purpose:
idle_timeoutcleans up connections sitting idle after load drops (10 min ±20%)server_lifetimerotates long-lived connections even under constant use (20 min ±20%)Note:
idle_timeoutonly applies to connections that have been used at least once. Prewarmed/replenished connections that were never checked out by a client haverecycled=Noneand are not subject toidle_timeout— they live untilserver_lifetimeexpires.default_min_pool_sizefor dynamic poolsNew
auth_query.default_min_pool_size(default:0) controls minimum backend connections per dynamic user pool in passthrough mode:default_min_pool_sizeconnections without blocking the client.get_all_pools()— dynamic pools withmin_pool_sizeare automatically replenished. No changes needed inretain.rs.min_pool_size > 0are skipped by GC. This prevents a race condition where GC could delete a pool between retain's close and replenish phases.server_lifetime, then retain closes and replenishes them. Pools persist until config RELOAD ordefault_min_pool_sizeis set to0.default_min_pool_size = 0(default)replenish()returns 0, warn logged. GC won't delete (min > 0). Retain retriesDetails
Pool-level timeout overrides
In
src/pool/mod.rs, bothServerPool::new()(controls recycle lifetime check) andPoolSettings(controls retain idle_timeout/lifetime) always usedconfig.general.server_lifetime/config.general.idle_timeout, completely ignoring pool-level overrides.The fix uses
pool_config.server_lifetime.unwrap_or(config.general.server_lifetime.as_millis())— the same pattern already used in config logging.retain_connections_max quota exhaustion
When
retain_connections_max > 0and the global counter reached the limit,saturating_subreturned0. Since0means "unlimited" inretain_oldest_first(), subsequent pools got unlimited closure. HashMap iteration order is non-deterministic, so the bug manifested randomly.Fix: early return when
current_count >= max, preventing the 0-as-unlimited ambiguity.idle_timeout jitter and
0 = disabledidle_timeoutwas stored as a single pool-wide value inPoolSettings. After a traffic burst, all returned connections had the same idle start time and expired simultaneously. Nowidle_timeout_msis stored per-connection inMetricswith ±20% jitter applied at creation time — the same approach asserver_lifetime.Additionally,
idle_timeout = 0now means "disabled" (consistent withserver_lifetime = 0and PgBouncer'sserver_idle_timeout = 0). Previously it closed connections after ~1ms of idle time because the check waselapsed > 0ms.Retain cycle fairness
The retain cycle iterated pools via
HashMap::iter(), whose order is deterministic within a process (fixedRandomStateseed). Withretain_connections_max = 3, the same pool always consumed the entire quota while other pools' expired connections were never cleaned up. Now pools are shuffled each cycle viarand::seq::SliceRandom::shuffle.Both retain and replenish phases now use a single
get_all_pools()snapshot to avoid inconsistency ifPOOLSis atomically updated between them.Test plan
Pool timeout overrides:
@pool-override-server-lifetime— pool-level server_lifetime triggers recycle@pool-override-idle-timeout— pool-level idle_timeout closes idle connections@pool-override-two-pools-different-lifetime— two pools with different lifetimes behave independently@general-server-lifetime-baseline— general server_lifetime still works@general-idle-timeout-baseline— general idle_timeout still worksDynamic pool overrides:
@dynamic-pool-idle-timeout— auth_query dynamic pool inherits pool-level idle_timeout@dynamic-pool-server-lifetime— auth_query dynamic pool inherits pool-level server_lifetime@min-pool-size-with-pool-lifetime— min_pool_size replenishment works with pool-level server_lifetimeretain_connections_max quota:
@retain-max-quota-two-pools— retain_connections_max quota enforced across multiple poolsEnd-to-end pool lifecycle:
@e2e-dedicated-lifecycle— dedicated pool: prewarm (min_pool_size=2) → scale up (pool_size=5) → shrink to min_pool_size=2@e2e-auth-query-lifecycle— auth_query pool: scale up (default_pool_size=5) → shrink to 0default_min_pool_size:
@auth-query-min-pool-size— dynamic pool prewarm with default_min_pool_size (2 scenarios: prewarm + lifetime replenish)Regression:
@min-pool-size— existing min_pool_size tests (3 scenarios)@auth-query-passthrough— passthrough auth tests (4 scenarios)cargo test --lib— 259 passed (includes reference config matching)Run:
🤖 Generated with Claude Code