Description
When cert-manager is used for TLS certificate management (the default path when cert-manager is present in the cluster), the operator creates a self-signed CA certificate with a 3-year validity (DefaultCAValidity = time.Hour * 24 * 365 * 3). While cert-manager does renew this CA certificate before expiry (via RenewBefore: 730h), the renewal generates a new CA key pair, which invalidates all leaf certificates (<cluster>-ssl, <cluster>-ssl-internal) that were signed by the old CA. The operator does not detect or handle this CA rotation, causing the PXC cluster's TLS to break.
Steps to Reproduce
- Deploy a
PerconaXtraDBCluster in a Kubernetes cluster with cert-manager installed
- Do not specify a custom
spec.tls.issuerConf (use the default self-signed CA path)
- Wait for the CA certificate to approach its
renewBefore threshold (3 years minus 730 hours ≈ 2 years and 11 months), or simulate by manually shortening the CA certificate's Duration
- cert-manager renews the
<cluster>-ca-cert Certificate, generating a new self-signed CA with a new key pair
- The
<cluster>-pxc-issuer (a CA-type Issuer) now references the new CA secret
- The leaf certificates (
<cluster>-ssl, <cluster>-ssl-internal) are still signed by the old CA and the ca.crt field in their Secrets still contains the old CA certificate
Result: TLS verification fails. PXC nodes and HAProxy cannot establish trusted connections. Cluster breaks.
Root Cause Analysis
There are three interacting problems in pkg/controller/pxc/tls.go:
Problem 1: reconcileSSL short-circuits when secrets already exist
// tls.go, reconcileSSL function
if errSecret == nil && errInternalSecret == nil {
return nil // ← Returns immediately if both SSL secrets exist
}
Once the <cluster>-ssl and <cluster>-ssl-internal secrets exist, reconcileSSL returns immediately without ever checking whether the CA has been rotated or whether the leaf certificates are still valid against the current CA. This means after cert-manager renews the CA certificate (generating a new key pair), the operator never re-issues the leaf certificates.
Problem 2: createSSLByCertManager uses create-only semantics
// tls.go, createSSLByCertManager function
err := r.client.Create(ctx, caCert)
if err != nil && !k8serr.IsAlreadyExists(err) {
return fmt.Errorf("create CA certificate: %v", err)
}
All cert-manager Certificate and Issuer resources are created with r.client.Create() and AlreadyExists errors are silently ignored. There is no reconciliation or update of existing resources. If the CA certificate's properties need to change (or if the relationship between the CA and leaf certs needs to be re-validated), nothing happens.
Problem 3: Self-signed CA renewal generates a new key pair
When cert-manager renews a Certificate issued by a SelfSigned issuer, it generates a completely new private key and self-signed certificate. This means:
- The
<cluster>-ca-cert secret gets a new CA certificate and key
- The
<cluster>-pxc-issuer (a CA-type issuer referencing <cluster>-ca-cert) now uses the new CA
- But the existing leaf certificates'
ca.crt field still contains the old CA
- cert-manager does NOT automatically re-issue downstream leaf certificates when a CA issuer's signing material changes
Expected Behavior
When the CA certificate is renewed:
- The operator should detect that the CA has changed
- All leaf certificates (
<cluster>-ssl, <cluster>-ssl-internal) should be re-issued against the new CA
- The
ca.crt field in the leaf certificate secrets should be updated to contain the new CA certificate
- PXC pods should be rolling-restarted to pick up the new certificates (MySQL reads TLS certs at startup and does not watch for file changes)
Affected Versions
- All versions using the cert-manager TLS path (this is the default when cert-manager is installed)
- Confirmed on operator version 1.18.0 (current latest)
- The issue has existed since the cert-manager integration was introduced
Relevant Code
Impact
- Severity: High — This is a time-bomb that silently breaks every PXC cluster after ~3 years
- Blast radius: Every PXC deployment using the default TLS configuration with cert-manager
- Detection difficulty: No alerts or logs indicate the CA is about to expire or has rotated; the failure manifests as sudden MySQL connection failures across all OpenStack services (or any other PXC consumers)
Suggested Fix
The reconcileSSL function should be modified to:
-
Check CA certificate validity — Even when leaf secrets exist, verify that ca.crt in the leaf secrets matches the current CA in <cluster>-ca-cert. If there is a mismatch, trigger re-issuance of the leaf certificates.
-
Use server-side apply or update semantics — Instead of Create + IsAlreadyExists, use Apply or Update so that cert-manager Certificate resources can be reconciled on every loop.
-
Trigger certificate renewal via cert-manager — When a CA rotation is detected, delete and recreate (or trigger renewal of) the leaf Certificate resources so cert-manager re-issues them with the new CA.
-
Rolling restart PXC pods — After certificates are renewed, the operator should trigger a rolling restart of PXC and HAProxy pods since MySQL does not hot-reload TLS certificates.
A simpler alternative would be to use cert-manager's renewBefore on leaf certificates with appropriate overlap, combined with watching the CA secret for changes and triggering leaf certificate re-issuance.
Current Workaround
Manual recovery requires deleting all TLS artifacts and letting the operator recreate them:
# Delete cert-manager Certificate resources
kubectl -n <namespace> delete certificate <cluster>-ca-cert <cluster>-ssl <cluster>-ssl-internal
# Delete the secrets
kubectl -n <namespace> delete secret <cluster>-ca-cert <cluster>-ssl <cluster>-ssl-internal
# Delete the issuers
kubectl -n <namespace> delete issuer <cluster>-pxc-ca-issuer <cluster>-pxc-issuer
# Restart the PXC operator to trigger reconciliation
kubectl -n <namespace> rollout restart deployment pxc-operator
# After new certs are created, rolling restart PXC and HAProxy
kubectl -n <namespace> rollout restart statefulset <cluster>-pxc
kubectl -n <namespace> rollout restart statefulset <cluster>-haproxy
Environment
- Operator version: 1.18.0
- Kubernetes: 1.28+
- cert-manager: v1.13+
- PXC: 8.0
Description
When cert-manager is used for TLS certificate management (the default path when cert-manager is present in the cluster), the operator creates a self-signed CA certificate with a 3-year validity (
DefaultCAValidity = time.Hour * 24 * 365 * 3). While cert-manager does renew this CA certificate before expiry (viaRenewBefore: 730h), the renewal generates a new CA key pair, which invalidates all leaf certificates (<cluster>-ssl,<cluster>-ssl-internal) that were signed by the old CA. The operator does not detect or handle this CA rotation, causing the PXC cluster's TLS to break.Steps to Reproduce
PerconaXtraDBClusterin a Kubernetes cluster with cert-manager installedspec.tls.issuerConf(use the default self-signed CA path)renewBeforethreshold (3 years minus 730 hours ≈ 2 years and 11 months), or simulate by manually shortening the CA certificate'sDuration<cluster>-ca-certCertificate, generating a new self-signed CA with a new key pair<cluster>-pxc-issuer(a CA-type Issuer) now references the new CA secret<cluster>-ssl,<cluster>-ssl-internal) are still signed by the old CA and theca.crtfield in their Secrets still contains the old CA certificateResult: TLS verification fails. PXC nodes and HAProxy cannot establish trusted connections. Cluster breaks.
Root Cause Analysis
There are three interacting problems in
pkg/controller/pxc/tls.go:Problem 1:
reconcileSSLshort-circuits when secrets already existOnce the
<cluster>-ssland<cluster>-ssl-internalsecrets exist,reconcileSSLreturns immediately without ever checking whether the CA has been rotated or whether the leaf certificates are still valid against the current CA. This means after cert-manager renews the CA certificate (generating a new key pair), the operator never re-issues the leaf certificates.Problem 2:
createSSLByCertManageruses create-only semanticsAll cert-manager
CertificateandIssuerresources are created withr.client.Create()andAlreadyExistserrors are silently ignored. There is no reconciliation or update of existing resources. If the CA certificate's properties need to change (or if the relationship between the CA and leaf certs needs to be re-validated), nothing happens.Problem 3: Self-signed CA renewal generates a new key pair
When cert-manager renews a
Certificateissued by aSelfSignedissuer, it generates a completely new private key and self-signed certificate. This means:<cluster>-ca-certsecret gets a new CA certificate and key<cluster>-pxc-issuer(a CA-type issuer referencing<cluster>-ca-cert) now uses the new CAca.crtfield still contains the old CAExpected Behavior
When the CA certificate is renewed:
<cluster>-ssl,<cluster>-ssl-internal) should be re-issued against the new CAca.crtfield in the leaf certificate secrets should be updated to contain the new CA certificateAffected Versions
Relevant Code
pkg/controller/pxc/tls.goreconcileSSLpkg/controller/pxc/tls.gocreateSSLByCertManagerpkg/pxctls/pxctls.goDefaultCAValidity = 3 years,DefaultRenewBefore = 730hImpact
Suggested Fix
The
reconcileSSLfunction should be modified to:Check CA certificate validity — Even when leaf secrets exist, verify that
ca.crtin the leaf secrets matches the current CA in<cluster>-ca-cert. If there is a mismatch, trigger re-issuance of the leaf certificates.Use server-side apply or update semantics — Instead of
Create+IsAlreadyExists, useApplyorUpdateso that cert-managerCertificateresources can be reconciled on every loop.Trigger certificate renewal via cert-manager — When a CA rotation is detected, delete and recreate (or trigger renewal of) the leaf
Certificateresources so cert-manager re-issues them with the new CA.Rolling restart PXC pods — After certificates are renewed, the operator should trigger a rolling restart of PXC and HAProxy pods since MySQL does not hot-reload TLS certificates.
A simpler alternative would be to use cert-manager's
renewBeforeon leaf certificates with appropriate overlap, combined with watching the CA secret for changes and triggering leaf certificate re-issuance.Current Workaround
Manual recovery requires deleting all TLS artifacts and letting the operator recreate them:
Environment