PRD: PodDisruptionBudgets require HA (2+ replicas) for stateful services

## Context

During the dome-prod capacity stabilization work (PRs #2491, #2492), we identified that **none of the 16 single-replica StatefulSets** in production have PodDisruptionBudgets (PDBs). This means any voluntary disruption — node drain, cluster upgrade, autoscaler scale-down — can kill critical databases and stateful services without warning.

However, adding PDBs without first addressing the single-replica problem creates a different issue. This issue documents the full picture.

## What is a PDB and why we need one

A PDB tells Kubernetes: *"during voluntary disruptions, keep at least N pods running."* Without PDBs:
- `kubectl drain` (node maintenance) evicts pods immediately
- Cluster upgrades kill pods one node at a time with no availability guarantee
- Autoscaler can remove nodes hosting critical databases

With PDBs, the eviction API respects availability constraints. But **PDBs only work meaningfully with 2+ replicas** — you can't keep "at least 1 of 1" running while also evicting it.

## Current state (dome-prod)

### Existing PDBs (3)
| PDB | Namespace | MaxUnavailable |
|---|---|---|
| `dome-dss-vault-server` | in2 | 1 |
| `wallet-vault-server` | in2 | 1 |
| `elasticsearch-master-pdb` | search-engine | 1 |

### StatefulSets without PDBs (16, all single-replica)

**Critical (data loss / outage risk):**
| StatefulSet | Namespace | Replicas | Impact if evicted |
|---|---|---|---|
| `mysql-til` | marketplace | 1 | TIL, CCS, trusted-issuers-list all go down |
| `bae-marketplace-biz-ecosystem-logic-proxy` | marketplace | 1 | Marketplace frontend unavailable |
| `cs-identity-keycloak` | cs-identity | 1 | Authentication broken for all users |
| `cs-identity-postgresql` | cs-identity | 1 | Keycloak loses its database |

**Medium (service degradation):**
| StatefulSet | Namespace | Replicas | Impact if evicted |
|---|---|---|---|
| `elasticsearch-master` (zammad) | zammad | 1 | Ticketing search broken |
| `zammad-postgresql` | zammad | 1 | Ticketing data unavailable |
| `zammad2-postgresql` | zammad2 | 1 | Second ticketing instance down |
| `zammad2-elasticsearch-master` | zammad2 | 1 | Second ticketing search broken |
| `loki-distributed-ingester` | loki-distributed | 1 | Log ingestion stops |

**Low (non-critical):**
| StatefulSet | Namespace | Replicas | Impact if evicted |
|---|---|---|---|
| `argocd-application-controller` | argocd | 1 | GitOps paused (no new deploys) |
| `mysql-knowledgebase` | knowledgebase | 1 | Bookstack unavailable |
| `dekra-postgres` | dome-certification | 1 | Certification service down |
| `zammad-redis-master` | zammad | 1 | Ticketing sessions lost |
| `zammad2-redis-master` | zammad2 | 1 | Second ticketing sessions lost |
| `wallet-vault-server` | in2 | 1 | Wallet vault (has PDB already) |
| `prometheus/alertmanager` | kube-prometheus-stack | 1 | Monitoring gap |

### Multi-replica deployments without PDBs
| Deployment | Namespace | Replicas | Notes |
|---|---|---|---|
| `zammad-railsserver` | zammad | 2 | PDB would work here, but non-critical |
| `zammad2-railsserver` | zammad2 | 2 | Same |
| `coredns` | kube-system | 2 | Cluster DNS — should have PDB |

## The single-replica PDB dilemma

With 1 replica, there are only two PDB options, both problematic:

### Option A: `maxUnavailable: 0`
- **Effect:** Pod can never be voluntarily evicted
- **Problem:** Blocks all node drains, cluster upgrades, and autoscaler operations. Operations team must manually delete the PDB before any maintenance, then recreate it after.
- **When it makes sense:** Never, for routine operations. Only as a temporary "do not touch" signal.

### Option B: `maxUnavailable: 1`
- **Effect:** Allows eviction of the single pod (same as having no PDB)
- **Problem:** Provides no actual protection. The only benefit is that the eviction goes through the PDB API, making it visible in audit logs.
- **When it makes sense:** Compliance/audit requirements only.

**Conclusion:** PDBs on single-replica services are not a solution. The prerequisite is scaling to 2+ replicas.

## What each critical service needs for proper HA + PDB

### 1. `mysql-til` (marketplace)
- **Current:** Bitnami MySQL 8.0.31, standalone, 1 replica
- **Required changes:**
  - Enable MySQL replication: `architecture: replication` in values.yaml
  - Configure `secondary.replicaCount: 1`
  - Set `primary.pdb.create: true`, `primary.pdb.maxUnavailable: 0`
  - Set `secondary.pdb.create: true`, `secondary.pdb.maxUnavailable: 0`
  - Verify applications (TIL, CCS) handle failover (connect to MySQL service, not pod directly)
- **Risk:** Replication setup requires data migration or a maintenance window. Read-after-write consistency must be verified.

### 2. `cs-identity-keycloak`
- **Current:** Bitnami Keycloak 24.0.4, 1 replica
- **Required changes:**
  - Set `replicaCount: 2`
  - Set `pdb.create: true`, `pdb.minAvailable: 1`
  - Keycloak natively supports clustering via Infinispan — should work with multiple replicas out of the box
  - Verify session replication works (distributed caches)
- **Risk:** Low — Keycloak is designed for HA. May need `cache.stack: kubernetes` or similar JGroups/DNS_PING config.

### 3. `cs-identity-postgresql`
- **Current:** Bitnami PostgreSQL 16.3.0, primary only
- **Required changes:**
  - Enable replication: `architecture: replication` in values.yaml
  - Configure `readReplicas.replicaCount: 1`
  - Set `primary.pdb.create: true`, `primary.pdb.maxUnavailable: 0`
  - Keycloak connects to the primary service — no app changes needed
- **Risk:** Streaming replication is straightforward for PostgreSQL. Needs a short maintenance window for initial replica sync.

### 4. `bae-marketplace-biz-ecosystem-logic-proxy`
- **Current:** StatefulSet, 1 replica, managed by `bae-marketplace` Helm chart
- **Required changes:**
  - Increase replicas in BAE chart values
  - Verify the logic-proxy supports horizontal scaling (stateless request handling with shared MongoDB backend)
  - Add PDB via chart values or standalone manifest
- **Risk:** Need to verify the app doesn't use local state/sessions. MongoDB is the shared backend, so this should scale horizontally.

## Recommended approach

### Phase 1 — Quick wins (no replication needed)
- **Keycloak** → scale to 2 replicas + PDB (natively supports clustering)
- **coredns** → add PDB `minAvailable: 1` (already 2 replicas, just needs PDB)

### Phase 2 — Database HA (requires planning + maintenance window)
- **cs-identity-postgresql** → enable streaming replication + PDB
- **mysql-til** → enable MySQL replication + PDB

### Phase 3 — Remaining services
- **logic-proxy** → scale + PDB (verify statelessness first)
- **Loki ingester** → scale to 2 + PDB (Loki supports this natively)
- **Zammad/Zammad2 databases** — only if ticketing becomes critical

## Related PRs
- #2491 — Right-sized resource requests (merged)
- #2492 — Added missing resource requests to marketplace (merged)
- Anti-affinity / topology spread PR (planned, addresses node-level resilience separately)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PRD: PodDisruptionBudgets require HA (2+ replicas) for stateful services #2497

Context

What is a PDB and why we need one

Current state (dome-prod)

Existing PDBs (3)

StatefulSets without PDBs (16, all single-replica)

Multi-replica deployments without PDBs

The single-replica PDB dilemma

Option A: `maxUnavailable: 0`

Option B: `maxUnavailable: 1`

What each critical service needs for proper HA + PDB

1. `mysql-til` (marketplace)

2. `cs-identity-keycloak`

3. `cs-identity-postgresql`

4. `bae-marketplace-biz-ecosystem-logic-proxy`

Recommended approach

Phase 1 — Quick wins (no replication needed)

Phase 2 — Database HA (requires planning + maintenance window)

Phase 3 — Remaining services

Related PRs

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

PDB	Namespace	MaxUnavailable
`dome-dss-vault-server`	in2	1
`wallet-vault-server`	in2	1
`elasticsearch-master-pdb`	search-engine	1

StatefulSet	Namespace	Replicas	Impact if evicted
`mysql-til`	marketplace	1	TIL, CCS, trusted-issuers-list all go down
`bae-marketplace-biz-ecosystem-logic-proxy`	marketplace	1	Marketplace frontend unavailable
`cs-identity-keycloak`	cs-identity	1	Authentication broken for all users
`cs-identity-postgresql`	cs-identity	1	Keycloak loses its database

StatefulSet	Namespace	Replicas	Impact if evicted
`elasticsearch-master` (zammad)	zammad	1	Ticketing search broken
`zammad-postgresql`	zammad	1	Ticketing data unavailable
`zammad2-postgresql`	zammad2	1	Second ticketing instance down
`zammad2-elasticsearch-master`	zammad2	1	Second ticketing search broken
`loki-distributed-ingester`	loki-distributed	1	Log ingestion stops

StatefulSet	Namespace	Replicas	Impact if evicted
`argocd-application-controller`	argocd	1	GitOps paused (no new deploys)
`mysql-knowledgebase`	knowledgebase	1	Bookstack unavailable
`dekra-postgres`	dome-certification	1	Certification service down
`zammad-redis-master`	zammad	1	Ticketing sessions lost
`zammad2-redis-master`	zammad2	1	Second ticketing sessions lost
`wallet-vault-server`	in2	1	Wallet vault (has PDB already)
`prometheus/alertmanager`	kube-prometheus-stack	1	Monitoring gap

Deployment	Namespace	Replicas	Notes
`zammad-railsserver`	zammad	2	PDB would work here, but non-critical
`zammad2-railsserver`	zammad2	2	Same
`coredns`	kube-system	2	Cluster DNS — should have PDB

PRD: PodDisruptionBudgets require HA (2+ replicas) for stateful services #2497

Description

Context

What is a PDB and why we need one

Current state (dome-prod)

Existing PDBs (3)

StatefulSets without PDBs (16, all single-replica)

Multi-replica deployments without PDBs

The single-replica PDB dilemma

Option A: maxUnavailable: 0

Option B: maxUnavailable: 1

What each critical service needs for proper HA + PDB

1. mysql-til (marketplace)

2. cs-identity-keycloak

3. cs-identity-postgresql

4. bae-marketplace-biz-ecosystem-logic-proxy

Recommended approach

Phase 1 — Quick wins (no replication needed)

Phase 2 — Database HA (requires planning + maintenance window)

Phase 3 — Remaining services

Related PRs

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Option A: `maxUnavailable: 0`

Option B: `maxUnavailable: 1`

1. `mysql-til` (marketplace)

2. `cs-identity-keycloak`

3. `cs-identity-postgresql`

4. `bae-marketplace-biz-ecosystem-logic-proxy`