Skip to content

K8SPSMDB-1457: add certManagementPolicy option#2266

Open
myJamong wants to merge 12 commits intopercona:mainfrom
myJamong:K8SPSMDB-1457-add-certManagementPolicy-option
Open

K8SPSMDB-1457: add certManagementPolicy option#2266
myJamong wants to merge 12 commits intopercona:mainfrom
myJamong:K8SPSMDB-1457-add-certManagementPolicy-option

Conversation

@myJamong
Copy link
Copy Markdown

@myJamong myJamong commented Mar 5, 2026

CHANGE DESCRIPTION

Problem:
The Percona Server for MongoDB Operator automatically generates a new SSL certificate when it cannot find a user-provided secret. In scenarios where the user-provided secret is temporarily lost (e.g., EKS upgrade, External Secrets controller failure, operational mistake), the operator creates a new self-signed certificate with a different CA. This triggers a rolling restart of all MongoDB pods, causing client applications that rely on the original CA certificate to lose connectivity — leading to a severe and unexpected service outage.

Cause:
In reconcileSSL(), the operator cannot distinguish between "the secret was never created" and "the secret existed but was lost." When a user-provided secret is missing, the operator falls through to the automatic certificate creation logic (createSSLManually or createSSLByCertManager), regardless of whether the user intended to manage certificates externally.

Solution:
Added a new configurable field spec.tls.certManagementPolicy to the CRD with two possible values:

  • auto (default): Existing behavior — operator creates certificates automatically if none are found.
  • userProvidedOnly: Operator skips automatic certificate creation entirely and returns nil, leaving certificate lifecycle fully to the user(e.g., External Secrets, manual management). A log message is emitted when this policy is active.

This puts control back into the hands of the user, preventing unintended certificate regeneration and pod restarts in production environments.

Relates to: #1758

CHECKLIST

Jira

  • Is the Jira ticket created and referenced properly?
  • Does the Jira ticket have the proper statuses for documentation (Needs Doc) and QA (Needs QA)?
  • Does the Jira ticket link to the proper milestone (Fix Version field)?

Tests

  • Is an E2E test/test case added for the new feature/change?
  • Are unit tests added where appropriate?
  • Are OpenShift compare files changed for E2E tests (compare/*-oc.yml)?

Config/Logging/Testability

  • Are all needed new/changed options added to default YAML files?
  • Are all needed new/changed options added to the Helm Chart?
  • Did we add proper logging messages for operator actions?
  • Did we ensure compatibility with the previous version or cluster upgrade process?
  • Does the change support oldest and newest supported MongoDB version?
  • Does the change support oldest and newest supported Kubernetes version?

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Mar 5, 2026

CLA assistant check
All committers have signed the CLA.

@myJamong
Copy link
Copy Markdown
Author

myJamong commented Mar 5, 2026

One consideration is whether to add a guardrail that blocks the userProvidedOnly → auto switch when SSL secrets are missing. However, this might not be the right approach because:

  • The auto policy by definition means "create certificates if not found" — blocking that would contradict its purpose
  • There are legitimate cases where a user intentionally wants to discard old certificates and let the operator generate new ones

What do you think? Would it be better to add a validation/warning at the operator level, or is documenting this behavior sufficient?

Copy link
Copy Markdown
Contributor

@egegunes egegunes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@myJamong please add cert-management-policy into run-pr.csv and run-release.csv in e2e-tests/

@egegunes
Copy link
Copy Markdown
Contributor

egegunes commented Mar 6, 2026

One consideration is whether to add a guardrail that blocks the userProvidedOnly → auto switch when SSL secrets are missing. However, this might not be the right approach because:

  • The auto policy by definition means "create certificates if not found" — blocking that would contradict its purpose
  • There are legitimate cases where a user intentionally wants to discard old certificates and let the operator generate new ones

What do you think? Would it be better to add a validation/warning at the operator level, or is documenting this behavior sufficient?

I think this is auto doing its job. I don't think we need to add logic into operator for additional guardrails.

@egegunes egegunes added this to the v1.23.0 milestone Mar 6, 2026
@myJamong
Copy link
Copy Markdown
Author

myJamong commented Mar 6, 2026

@myJamong please add cert-management-policy into run-pr.csv and run-release.csv in e2e-tests/

@egegunes thanks for the guilde. I added and commited.
27e70a1

@egegunes egegunes changed the title K8sPSMDB-1457 add certManagementPolicy option K8SPSMDB-1457: add certManagementPolicy option Mar 6, 2026
@myJamong myJamong force-pushed the K8SPSMDB-1457-add-certManagementPolicy-option branch from 306d678 to 65bc7a8 Compare March 6, 2026 09:16
@myJamong
Copy link
Copy Markdown
Author

myJamong commented Mar 6, 2026

just changed the commit author on force push

Comment on lines +1639 to +1644
cr.Status.AddCondition(api.ClusterCondition{
Status: api.ConditionTrue,
Type: api.ConditionTypeTLSSecretMissing,
Reason: "TLSSecretNotFound",
Message: fmt.Sprintf("TLS secret %s is missing, certManagementPolicy is userProvidedOnly", api.SSLSecretName(cr)),
})
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@myJamong I think rather than having a negative condition type and a positive status, we should have a positive type and negative status. for example:

				cr.Status.AddCondition(api.ClusterCondition{
					Status:  api.ConditionFalse,
					Type:    api.ConditionTypeTLSSecretsReady,
					Reason:  "TLSSecretNotFound",
					Message: fmt.Sprintf("TLS secret %s is missing, certManagementPolicy is userProvidedOnly", api.SSLSecretName(cr)),
				})

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed - 4968ab6

Renamed TLSSecretMissing to TLSSecretsReady with positive type and negative status and updated all references including unit tests and e2e tests.

@JNKPercona
Copy link
Copy Markdown
Collaborator

Test Name Result Time
arbiter passed 00:11:55
balancer passed 00:18:41
cert-management-policy passed 00:08:44
cross-site-sharded passed 00:18:25
custom-replset-name failure 00:14:22
custom-tls passed 00:14:19
custom-users-roles passed 00:11:18
custom-users-roles-sharded passed 00:14:56
data-at-rest-encryption passed 00:12:53
data-sharded passed 00:23:42
demand-backup passed 00:15:49
demand-backup-eks-credentials-irsa passed 00:00:07
demand-backup-fs passed 00:28:38
demand-backup-if-unhealthy passed 00:10:35
demand-backup-incremental-aws passed 00:16:09
demand-backup-incremental-azure passed 00:15:56
demand-backup-incremental-gcp-native passed 00:11:10
demand-backup-incremental-gcp-s3 passed 00:11:21
demand-backup-incremental-minio passed 00:25:36
demand-backup-incremental-sharded-aws passed 00:26:19
demand-backup-incremental-sharded-azure passed 00:18:24
demand-backup-incremental-sharded-gcp-native passed 00:17:57
demand-backup-incremental-sharded-gcp-s3 passed 00:17:36
demand-backup-incremental-sharded-minio passed 00:27:24
demand-backup-physical-parallel passed 00:08:34
demand-backup-physical-aws passed 00:11:55
demand-backup-physical-azure passed 00:12:23
demand-backup-physical-gcp-s3 passed 00:12:01
demand-backup-physical-gcp-native passed 00:13:33
demand-backup-physical-minio passed 00:20:20
demand-backup-physical-minio-native passed 00:25:56
demand-backup-physical-minio-native-tls passed 00:20:06
demand-backup-physical-sharded-parallel passed 00:11:12
demand-backup-physical-sharded-aws passed 00:18:29
demand-backup-physical-sharded-azure passed 00:17:52
demand-backup-physical-sharded-gcp-native failure 00:09:00
demand-backup-physical-sharded-minio passed 00:18:18
demand-backup-physical-sharded-minio-native passed 00:17:44
demand-backup-sharded passed 00:26:37
disabled-auth passed 00:16:34
expose-sharded passed 00:34:04
finalizer passed 00:10:26
ignore-labels-annotations passed 00:07:54
init-deploy passed 00:13:08
ldap passed 00:09:09
ldap-tls passed 00:12:54
limits passed 00:06:21
liveness passed 00:10:27
mongod-major-upgrade passed 00:13:22
mongod-major-upgrade-sharded passed 00:20:56
monitoring-2-0 passed 00:25:07
monitoring-pmm3 passed 00:26:25
multi-cluster-service passed 00:14:01
multi-storage passed 00:19:12
non-voting-and-hidden passed 00:17:01
one-pod failure 00:04:58
operator-self-healing-chaos passed 00:13:50
pitr failure 00:05:14
pitr-physical passed 01:01:55
pitr-sharded passed 00:21:59
pitr-to-new-cluster passed 00:26:23
pitr-physical-backup-source passed 00:56:40
preinit-updates passed 00:05:06
pvc-auto-resize passed 00:14:00
pvc-resize passed 00:17:38
recover-no-primary passed 00:27:01
replset-overrides passed 00:17:57
replset-remapping passed 00:17:08
replset-remapping-sharded passed 00:17:13
rs-shard-migration passed 00:14:30
scaling passed 00:11:09
scheduled-backup passed 00:17:15
security-context passed 00:07:07
self-healing-chaos passed 00:14:59
service-per-pod passed 00:19:41
serviceless-external-nodes passed 00:07:27
smart-update passed 00:08:31
split-horizon passed 00:14:59
stable-resource-version passed 00:04:39
storage passed 00:07:33
tls-issue-cert-manager passed 00:30:09
unsafe-psa passed 00:07:52
upgrade passed 00:09:23
upgrade-consistency passed 00:06:23
upgrade-consistency-sharded-tls passed 00:54:04
upgrade-sharded passed 00:19:54
upgrade-partial-backup passed 00:16:04
users passed 00:17:26
users-vault passed 00:13:31
version-service passed 00:24:30
Summary Value
Tests Run 90/90
Job Duration 04:12:13
Total Test Time 25:28:10

commit: f9fc556
image: perconalab/percona-server-mongodb-operator:PR-2266-f9fc55604

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community size/L 100-499 lines

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants