Skip to content

PRD: PodDisruptionBudgets require HA (2+ replicas) for stateful services #2497

@groundnuty

Description

@groundnuty

Context

During the dome-prod capacity stabilization work (PRs #2491, #2492), we identified that none of the 16 single-replica StatefulSets in production have PodDisruptionBudgets (PDBs). This means any voluntary disruption — node drain, cluster upgrade, autoscaler scale-down — can kill critical databases and stateful services without warning.

However, adding PDBs without first addressing the single-replica problem creates a different issue. This issue documents the full picture.

What is a PDB and why we need one

A PDB tells Kubernetes: "during voluntary disruptions, keep at least N pods running." Without PDBs:

  • kubectl drain (node maintenance) evicts pods immediately
  • Cluster upgrades kill pods one node at a time with no availability guarantee
  • Autoscaler can remove nodes hosting critical databases

With PDBs, the eviction API respects availability constraints. But PDBs only work meaningfully with 2+ replicas — you can't keep "at least 1 of 1" running while also evicting it.

Current state (dome-prod)

Existing PDBs (3)

PDB Namespace MaxUnavailable
dome-dss-vault-server in2 1
wallet-vault-server in2 1
elasticsearch-master-pdb search-engine 1

StatefulSets without PDBs (16, all single-replica)

Critical (data loss / outage risk):

StatefulSet Namespace Replicas Impact if evicted
mysql-til marketplace 1 TIL, CCS, trusted-issuers-list all go down
bae-marketplace-biz-ecosystem-logic-proxy marketplace 1 Marketplace frontend unavailable
cs-identity-keycloak cs-identity 1 Authentication broken for all users
cs-identity-postgresql cs-identity 1 Keycloak loses its database

Medium (service degradation):

StatefulSet Namespace Replicas Impact if evicted
elasticsearch-master (zammad) zammad 1 Ticketing search broken
zammad-postgresql zammad 1 Ticketing data unavailable
zammad2-postgresql zammad2 1 Second ticketing instance down
zammad2-elasticsearch-master zammad2 1 Second ticketing search broken
loki-distributed-ingester loki-distributed 1 Log ingestion stops

Low (non-critical):

StatefulSet Namespace Replicas Impact if evicted
argocd-application-controller argocd 1 GitOps paused (no new deploys)
mysql-knowledgebase knowledgebase 1 Bookstack unavailable
dekra-postgres dome-certification 1 Certification service down
zammad-redis-master zammad 1 Ticketing sessions lost
zammad2-redis-master zammad2 1 Second ticketing sessions lost
wallet-vault-server in2 1 Wallet vault (has PDB already)
prometheus/alertmanager kube-prometheus-stack 1 Monitoring gap

Multi-replica deployments without PDBs

Deployment Namespace Replicas Notes
zammad-railsserver zammad 2 PDB would work here, but non-critical
zammad2-railsserver zammad2 2 Same
coredns kube-system 2 Cluster DNS — should have PDB

The single-replica PDB dilemma

With 1 replica, there are only two PDB options, both problematic:

Option A: maxUnavailable: 0

  • Effect: Pod can never be voluntarily evicted
  • Problem: Blocks all node drains, cluster upgrades, and autoscaler operations. Operations team must manually delete the PDB before any maintenance, then recreate it after.
  • When it makes sense: Never, for routine operations. Only as a temporary "do not touch" signal.

Option B: maxUnavailable: 1

  • Effect: Allows eviction of the single pod (same as having no PDB)
  • Problem: Provides no actual protection. The only benefit is that the eviction goes through the PDB API, making it visible in audit logs.
  • When it makes sense: Compliance/audit requirements only.

Conclusion: PDBs on single-replica services are not a solution. The prerequisite is scaling to 2+ replicas.

What each critical service needs for proper HA + PDB

1. mysql-til (marketplace)

  • Current: Bitnami MySQL 8.0.31, standalone, 1 replica
  • Required changes:
    • Enable MySQL replication: architecture: replication in values.yaml
    • Configure secondary.replicaCount: 1
    • Set primary.pdb.create: true, primary.pdb.maxUnavailable: 0
    • Set secondary.pdb.create: true, secondary.pdb.maxUnavailable: 0
    • Verify applications (TIL, CCS) handle failover (connect to MySQL service, not pod directly)
  • Risk: Replication setup requires data migration or a maintenance window. Read-after-write consistency must be verified.

2. cs-identity-keycloak

  • Current: Bitnami Keycloak 24.0.4, 1 replica
  • Required changes:
    • Set replicaCount: 2
    • Set pdb.create: true, pdb.minAvailable: 1
    • Keycloak natively supports clustering via Infinispan — should work with multiple replicas out of the box
    • Verify session replication works (distributed caches)
  • Risk: Low — Keycloak is designed for HA. May need cache.stack: kubernetes or similar JGroups/DNS_PING config.

3. cs-identity-postgresql

  • Current: Bitnami PostgreSQL 16.3.0, primary only
  • Required changes:
    • Enable replication: architecture: replication in values.yaml
    • Configure readReplicas.replicaCount: 1
    • Set primary.pdb.create: true, primary.pdb.maxUnavailable: 0
    • Keycloak connects to the primary service — no app changes needed
  • Risk: Streaming replication is straightforward for PostgreSQL. Needs a short maintenance window for initial replica sync.

4. bae-marketplace-biz-ecosystem-logic-proxy

  • Current: StatefulSet, 1 replica, managed by bae-marketplace Helm chart
  • Required changes:
    • Increase replicas in BAE chart values
    • Verify the logic-proxy supports horizontal scaling (stateless request handling with shared MongoDB backend)
    • Add PDB via chart values or standalone manifest
  • Risk: Need to verify the app doesn't use local state/sessions. MongoDB is the shared backend, so this should scale horizontally.

Recommended approach

Phase 1 — Quick wins (no replication needed)

  • Keycloak → scale to 2 replicas + PDB (natively supports clustering)
  • coredns → add PDB minAvailable: 1 (already 2 replicas, just needs PDB)

Phase 2 — Database HA (requires planning + maintenance window)

  • cs-identity-postgresql → enable streaming replication + PDB
  • mysql-til → enable MySQL replication + PDB

Phase 3 — Remaining services

  • logic-proxy → scale + PDB (verify statelessness first)
  • Loki ingester → scale to 2 + PDB (Loki supports this natively)
  • Zammad/Zammad2 databases — only if ticketing becomes critical

Related PRs

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions