Skip to content

Self-signed CA certificate expires after 3 years and is not properly renewed, causing PXC cluster TLS failure #2411

@larainema

Description

@larainema

Description

When cert-manager is used for TLS certificate management (the default path when cert-manager is present in the cluster), the operator creates a self-signed CA certificate with a 3-year validity (DefaultCAValidity = time.Hour * 24 * 365 * 3). While cert-manager does renew this CA certificate before expiry (via RenewBefore: 730h), the renewal generates a new CA key pair, which invalidates all leaf certificates (<cluster>-ssl, <cluster>-ssl-internal) that were signed by the old CA. The operator does not detect or handle this CA rotation, causing the PXC cluster's TLS to break.

Steps to Reproduce

  1. Deploy a PerconaXtraDBCluster in a Kubernetes cluster with cert-manager installed
  2. Do not specify a custom spec.tls.issuerConf (use the default self-signed CA path)
  3. Wait for the CA certificate to approach its renewBefore threshold (3 years minus 730 hours ≈ 2 years and 11 months), or simulate by manually shortening the CA certificate's Duration
  4. cert-manager renews the <cluster>-ca-cert Certificate, generating a new self-signed CA with a new key pair
  5. The <cluster>-pxc-issuer (a CA-type Issuer) now references the new CA secret
  6. The leaf certificates (<cluster>-ssl, <cluster>-ssl-internal) are still signed by the old CA and the ca.crt field in their Secrets still contains the old CA certificate

Result: TLS verification fails. PXC nodes and HAProxy cannot establish trusted connections. Cluster breaks.

Root Cause Analysis

There are three interacting problems in pkg/controller/pxc/tls.go:

Problem 1: reconcileSSL short-circuits when secrets already exist

// tls.go, reconcileSSL function
if errSecret == nil && errInternalSecret == nil {
    return nil  // ← Returns immediately if both SSL secrets exist
}

Once the <cluster>-ssl and <cluster>-ssl-internal secrets exist, reconcileSSL returns immediately without ever checking whether the CA has been rotated or whether the leaf certificates are still valid against the current CA. This means after cert-manager renews the CA certificate (generating a new key pair), the operator never re-issues the leaf certificates.

Problem 2: createSSLByCertManager uses create-only semantics

// tls.go, createSSLByCertManager function
err := r.client.Create(ctx, caCert)
if err != nil && !k8serr.IsAlreadyExists(err) {
    return fmt.Errorf("create CA certificate: %v", err)
}

All cert-manager Certificate and Issuer resources are created with r.client.Create() and AlreadyExists errors are silently ignored. There is no reconciliation or update of existing resources. If the CA certificate's properties need to change (or if the relationship between the CA and leaf certs needs to be re-validated), nothing happens.

Problem 3: Self-signed CA renewal generates a new key pair

When cert-manager renews a Certificate issued by a SelfSigned issuer, it generates a completely new private key and self-signed certificate. This means:

  • The <cluster>-ca-cert secret gets a new CA certificate and key
  • The <cluster>-pxc-issuer (a CA-type issuer referencing <cluster>-ca-cert) now uses the new CA
  • But the existing leaf certificates' ca.crt field still contains the old CA
  • cert-manager does NOT automatically re-issue downstream leaf certificates when a CA issuer's signing material changes

Expected Behavior

When the CA certificate is renewed:

  1. The operator should detect that the CA has changed
  2. All leaf certificates (<cluster>-ssl, <cluster>-ssl-internal) should be re-issued against the new CA
  3. The ca.crt field in the leaf certificate secrets should be updated to contain the new CA certificate
  4. PXC pods should be rolling-restarted to pick up the new certificates (MySQL reads TLS certs at startup and does not watch for file changes)

Affected Versions

  • All versions using the cert-manager TLS path (this is the default when cert-manager is installed)
  • Confirmed on operator version 1.18.0 (current latest)
  • The issue has existed since the cert-manager integration was introduced

Relevant Code

File Function Issue
pkg/controller/pxc/tls.go reconcileSSL Early return when secrets exist; never re-checks CA validity
pkg/controller/pxc/tls.go createSSLByCertManager Create-only semantics; never updates existing cert-manager resources
pkg/pxctls/pxctls.go Constants DefaultCAValidity = 3 years, DefaultRenewBefore = 730h

Impact

  • Severity: High — This is a time-bomb that silently breaks every PXC cluster after ~3 years
  • Blast radius: Every PXC deployment using the default TLS configuration with cert-manager
  • Detection difficulty: No alerts or logs indicate the CA is about to expire or has rotated; the failure manifests as sudden MySQL connection failures across all OpenStack services (or any other PXC consumers)

Suggested Fix

The reconcileSSL function should be modified to:

  1. Check CA certificate validity — Even when leaf secrets exist, verify that ca.crt in the leaf secrets matches the current CA in <cluster>-ca-cert. If there is a mismatch, trigger re-issuance of the leaf certificates.

  2. Use server-side apply or update semantics — Instead of Create + IsAlreadyExists, use Apply or Update so that cert-manager Certificate resources can be reconciled on every loop.

  3. Trigger certificate renewal via cert-manager — When a CA rotation is detected, delete and recreate (or trigger renewal of) the leaf Certificate resources so cert-manager re-issues them with the new CA.

  4. Rolling restart PXC pods — After certificates are renewed, the operator should trigger a rolling restart of PXC and HAProxy pods since MySQL does not hot-reload TLS certificates.

A simpler alternative would be to use cert-manager's renewBefore on leaf certificates with appropriate overlap, combined with watching the CA secret for changes and triggering leaf certificate re-issuance.

Current Workaround

Manual recovery requires deleting all TLS artifacts and letting the operator recreate them:

# Delete cert-manager Certificate resources
kubectl -n <namespace> delete certificate <cluster>-ca-cert <cluster>-ssl <cluster>-ssl-internal

# Delete the secrets
kubectl -n <namespace> delete secret <cluster>-ca-cert <cluster>-ssl <cluster>-ssl-internal

# Delete the issuers
kubectl -n <namespace> delete issuer <cluster>-pxc-ca-issuer <cluster>-pxc-issuer

# Restart the PXC operator to trigger reconciliation
kubectl -n <namespace> rollout restart deployment pxc-operator

# After new certs are created, rolling restart PXC and HAProxy
kubectl -n <namespace> rollout restart statefulset <cluster>-pxc
kubectl -n <namespace> rollout restart statefulset <cluster>-haproxy

Environment

  • Operator version: 1.18.0
  • Kubernetes: 1.28+
  • cert-manager: v1.13+
  • PXC: 8.0

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions