CLOUDP-389867: add delay to backupConfig sharded cluster by filipcirtog · Pull Request #935 · mongodb/mongodb-kubernetes

filipcirtog · 2026-03-24T16:57:30Z

HELP

HELP-87476 - This Jira ticket addresses a race condition occurring when enabling backups for a sharded cluster deployed with the Kubernetes operator. The issue arises as individual shards may not receive 'addShard' events, leading to their indefinite inactivity. Investigations are focused on identifying the race condition in Ops Manager and finding a solution to ensure all shards are included in backups without delay.

Summary

When enabling backup on a sharded cluster, Ops Manager needs time to complete its internal topology discovery before it can successfully accept a backup request. Without a delay, the operator races against OM's discovery, causing backup enablement to fail and triggering reconciliation loops until a retry eventually succeeds.

This race is specific to sharded clusters due to their multi-process topology (mongos + config servers + shards), which takes longer for OM to fully register compared to replica sets.

Proof of Work

A configurable sleep is inserted in updateOmDeploymentShardedCluster immediately before calling ensureBackupConfigurationAndUpdateStatus, but only when a backup spec is present. The delay defaults to 60 seconds and is controlled by the MDB_BACKUP_START_DELAY_SECONDS environment variable on the operator deployment, allowing users to tune or disable it per environment.

Checklist

Have you linked a jira ticket and/or is the ticket in the title?
Have you checked whether your jira ticket required DOCSP changes?
Have you added changelog file?
- use skip-changelog label if not needed
- refer to Changelog files and Release Notes section in CONTRIBUTING.md for more details

github-actions · 2026-03-24T16:58:34Z

⚠️ (this preview might not be accurate if the PR is not rebased on current master branch)

MCK 1.7.1 Release Notes

Bug Fixes

MongoDBOpsManager: Correctly handle the edge case where -admin-key was created by user and malformed. Previously the error was only presented in DEBUG log entry.
MongoDBOpsManager: Improved readiness probe error handling and appDB agent status logging

Other Changes

Container images: Merged the init-database and init-appdb init container images into a single init-database image. The init-appdb image will no longer be published and does not affect existing deployments.
- The following Helm chart values have been removed: initAppDb.name, initAppDb.version, and registry.initAppDb. Use initDatabase.name, initDatabase.version, and registry.initDatabase instead.
- The following environment variables have been removed: INIT_APPDB_IMAGE_REPOSITORY and INIT_APPDB_VERSION. Use INIT_DATABASE_IMAGE_REPOSITORY and INIT_DATABASE_VERSION instead.
Helm Chart: Removed operator.baseName Helm value. This value was never intended to be consumed by operator users and was never documented. The value controls the prefix for workload RBAC resource names (mongodb-kubernetes default), but changing it could break the operator and workloads because the operator is not aware of custom prefixes. With this change, the Helm chart will no longer allow customisation and the relevant resources will be deployed with predefined names (ServiceAccount with names mongodb-kubernetes-appdb, mongodb-kubernetes-database-pods, mongodb-kubernetes-ops-manager, Role with name mongodb-kubernetes-appdb and RoleBinding with name mongodb-kubernetes-appdb).

Copilot

Pull request overview

Adds an operator-controlled delay before enabling backup for sharded clusters to mitigate a race between Ops Manager topology discovery and backup enablement (HELP-87476 / CLOUDP-389867).

Changes:

Introduces MDB_BACKUP_START_DELAY_SECONDS and a default delay value.
Adds a delay in updateOmDeploymentShardedCluster when a backup spec is present, before calling ensureBackupConfigurationAndUpdateStatus.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File	Description
`pkg/util/constants.go`	Adds env var name and default delay constant for backup start delay.
`controllers/operator/mongodbshardedcluster_controller.go`	Implements the delay before enabling backup for sharded clusters (env-configurable).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-25T09:02:48Z

controllers/operator/mongodbshardedcluster_controller.go

 import (
 	"context"
 	"fmt"
+	"time"
 	"github.com/google/go-cmp/cmp"
 	"github.com/hashicorp/go-multierror"
 	"go.uber.org/zap"


The import block is no longer in the expected sorted/grouped order after adding time (standard library imports should be grouped and ordered consistently, typically enforced by gofmt/gofumpt/gci). Please rerun the repo’s Go formatters (or reorder imports) so CI/pre-commit formatting checks don’t fail.

Copilot · 2026-03-25T09:02:49Z

controllers/operator/mongodbshardedcluster_controller.go

+		log.Infof("Waiting %d seconds before enabling backup to avoid race condition between OM topology discovery and backup enablement", delaySeconds)
+		select {
+		case <-ctx.Done():
+			return workflow.Failed(ctx.Err())
+		case <-time.After(time.Duration(delaySeconds) * time.Second):
+		}


Sleeping inside reconciliation will block the sharded-cluster controller worker for up to delaySeconds on every reconcile whenever a backup spec is present (including steady-state reconciles), which can severely slow processing and can also make existing unit tests take 60s+. Consider implementing this as a requeue (e.g., return a Pending status with RequeueAfter) and/or gating it so the delay is applied only once (e.g., only when backup is transitioning to enabled / status is not yet Started).

Suggested change

log.Infof("Waiting %d seconds before enabling backup to avoid race condition between OM topology discovery and backup enablement", delaySeconds)

select {

case <-ctx.Done():

return workflow.Failed(ctx.Err())

case <-time.After(time.Duration(delaySeconds) * time.Second):

}

log.Infof("Requeueing reconciliation to wait approximately %d seconds before enabling backup to avoid race condition between OM topology discovery and backup enablement", delaySeconds)

return workflow.Pending("waiting before enabling backup; reconciliation will be requeued")

I second to what Copliot mentioned. Instead of blocking reconcile loop we should requeue and make sure the configured time passed before calling backup config endpoint.

Copilot · 2026-03-25T09:02:50Z

controllers/operator/mongodbshardedcluster_controller.go

+		log.Infof("Waiting %d seconds before enabling backup to avoid race condition between OM topology discovery and backup enablement", delaySeconds)
+		select {
+		case <-ctx.Done():
+			return workflow.Failed(ctx.Err())
+		case <-time.After(time.Duration(delaySeconds) * time.Second):


delaySeconds is taken directly from an env var and could be negative/very large; combined with time.After this can lead to confusing logs (negative wait) and, when ctx.Done() wins, a timer that can’t be stopped early. Consider clamping to >= 0 (and optionally a sane max) and using a time.NewTimer that you Stop() on cancellation.

Suggested change

log.Infof("Waiting %d seconds before enabling backup to avoid race condition between OM topology discovery and backup enablement", delaySeconds)

select {

case <-ctx.Done():

return workflow.Failed(ctx.Err())

case <-time.After(time.Duration(delaySeconds) * time.Second):

if delaySeconds < 0 {

delaySeconds = 0

}

log.Infof("Waiting %d seconds before enabling backup to avoid race condition between OM topology discovery and backup enablement", delaySeconds)

timer := time.NewTimer(time.Duration(delaySeconds) * time.Second)

defer timer.Stop()

select {

case <-ctx.Done():

return workflow.Failed(ctx.Err())

case <-timer.C:

MaciejKaras · 2026-03-25T14:08:24Z

controllers/operator/mongodbshardedcluster_controller.go

+// The delay mitigates this by giving OM time to finish processing monitoring events before the
+// /backupConfigs call is made. It applies to every enablement (including re-enables after
+// termination, since OM resets configs back to Inactive after termination).
+func isBackupBeingEnabled(sc *mdbv1.MongoDB, conn om.Connection, log *zap.SugaredLogger) bool {


I don't think you need to read the backup config again. You can add your check just after:

if desiredConfig.Status == config.Status { }

and check if desiredConfig.Status == STARTED and config.Status == STOPPED/INACTIVE

implementation

a131e7d

filipcirtog added the skip-changelog Use this label in Pull Request to not require new changelog entry file label Mar 24, 2026

filipcirtog requested a review from Copilot March 25, 2026 08:54

Copilot started reviewing on behalf of filipcirtog March 25, 2026 08:55 View session

Copilot AI reviewed Mar 25, 2026

View reviewed changes

filipcirtog added 3 commits March 25, 2026 11:08

improvements

6aaf04c

improvements

a0b8b09

improvements

4d38f3c

MaciejKaras reviewed Mar 25, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLOUDP-389867: add delay to backupConfig sharded cluster#935

CLOUDP-389867: add delay to backupConfig sharded cluster#935
filipcirtog wants to merge 4 commits intomasterfrom
CLOUDP-389867/add-delay-to-backupConfig-sharded-cluster

filipcirtog commented Mar 24, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 24, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 25, 2026

Uh oh!

Copilot AI Mar 25, 2026

Uh oh!

MaciejKaras Mar 25, 2026

Uh oh!

Copilot AI Mar 25, 2026

Uh oh!

MaciejKaras Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

filipcirtog commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

HELP

Summary

Proof of Work

Checklist

Uh oh!

github-actions bot commented Mar 24, 2026

MCK 1.7.1 Release Notes

Bug Fixes

Other Changes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

MaciejKaras Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

MaciejKaras Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

filipcirtog commented Mar 24, 2026 •

edited

Loading