Skip to content

CLOUDP-389867: add delay to backupConfig sharded cluster#935

Draft
filipcirtog wants to merge 4 commits intomasterfrom
CLOUDP-389867/add-delay-to-backupConfig-sharded-cluster
Draft

CLOUDP-389867: add delay to backupConfig sharded cluster#935
filipcirtog wants to merge 4 commits intomasterfrom
CLOUDP-389867/add-delay-to-backupConfig-sharded-cluster

Conversation

@filipcirtog
Copy link
Collaborator

@filipcirtog filipcirtog commented Mar 24, 2026

HELP

HELP-87476 - This Jira ticket addresses a race condition occurring when enabling backups for a sharded cluster deployed with the Kubernetes operator. The issue arises as individual shards may not receive 'addShard' events, leading to their indefinite inactivity. Investigations are focused on identifying the race condition in Ops Manager and finding a solution to ensure all shards are included in backups without delay.

Summary

When enabling backup on a sharded cluster, Ops Manager needs time to complete its internal topology discovery before it can successfully accept a backup request. Without a delay, the operator races against OM's discovery, causing backup enablement to fail and triggering reconciliation loops until a retry eventually succeeds.

This race is specific to sharded clusters due to their multi-process topology (mongos + config servers + shards), which takes longer for OM to fully register compared to replica sets.

Proof of Work

A configurable sleep is inserted in updateOmDeploymentShardedCluster immediately before calling ensureBackupConfigurationAndUpdateStatus, but only when a backup spec is present. The delay defaults to 60 seconds and is controlled by the MDB_BACKUP_START_DELAY_SECONDS environment variable on the operator deployment, allowing users to tune or disable it per environment.

Checklist

  • Have you linked a jira ticket and/or is the ticket in the title?
  • Have you checked whether your jira ticket required DOCSP changes?
  • Have you added changelog file?

@github-actions
Copy link

⚠️ (this preview might not be accurate if the PR is not rebased on current master branch)

MCK 1.7.1 Release Notes

Bug Fixes

  • MongoDBOpsManager: Correctly handle the edge case where -admin-key was created by user and malformed. Previously the error was only presented in DEBUG log entry.
  • MongoDBOpsManager: Improved readiness probe error handling and appDB agent status logging

Other Changes

  • Container images: Merged the init-database and init-appdb init container images into a single init-database image. The init-appdb image will no longer be published and does not affect existing deployments.
    • The following Helm chart values have been removed: initAppDb.name, initAppDb.version, and registry.initAppDb. Use initDatabase.name, initDatabase.version, and registry.initDatabase instead.
    • The following environment variables have been removed: INIT_APPDB_IMAGE_REPOSITORY and INIT_APPDB_VERSION. Use INIT_DATABASE_IMAGE_REPOSITORY and INIT_DATABASE_VERSION instead.
  • Helm Chart: Removed operator.baseName Helm value. This value was never intended to be consumed by operator users and was never documented. The value controls the prefix for workload RBAC resource names (mongodb-kubernetes default), but changing it could break the operator and workloads because the operator is not aware of custom prefixes. With this change, the Helm chart will no longer allow customisation and the relevant resources will be deployed with predefined names (ServiceAccount with names mongodb-kubernetes-appdb, mongodb-kubernetes-database-pods, mongodb-kubernetes-ops-manager, Role with name mongodb-kubernetes-appdb and RoleBinding with name mongodb-kubernetes-appdb).

@filipcirtog filipcirtog added the skip-changelog Use this label in Pull Request to not require new changelog entry file label Mar 24, 2026
@filipcirtog filipcirtog requested a review from Copilot March 25, 2026 08:54
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an operator-controlled delay before enabling backup for sharded clusters to mitigate a race between Ops Manager topology discovery and backup enablement (HELP-87476 / CLOUDP-389867).

Changes:

  • Introduces MDB_BACKUP_START_DELAY_SECONDS and a default delay value.
  • Adds a delay in updateOmDeploymentShardedCluster when a backup spec is present, before calling ensureBackupConfigurationAndUpdateStatus.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
pkg/util/constants.go Adds env var name and default delay constant for backup start delay.
controllers/operator/mongodbshardedcluster_controller.go Implements the delay before enabling backup for sharded clusters (env-configurable).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 3 to 9
import (
"context"
"fmt"
"time"
"github.com/google/go-cmp/cmp"
"github.com/hashicorp/go-multierror"
"go.uber.org/zap"
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The import block is no longer in the expected sorted/grouped order after adding time (standard library imports should be grouped and ordered consistently, typically enforced by gofmt/gofumpt/gci). Please rerun the repo’s Go formatters (or reorder imports) so CI/pre-commit formatting checks don’t fail.

Copilot uses AI. Check for mistakes.
Comment on lines +1913 to +1918
log.Infof("Waiting %d seconds before enabling backup to avoid race condition between OM topology discovery and backup enablement", delaySeconds)
select {
case <-ctx.Done():
return workflow.Failed(ctx.Err())
case <-time.After(time.Duration(delaySeconds) * time.Second):
}
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sleeping inside reconciliation will block the sharded-cluster controller worker for up to delaySeconds on every reconcile whenever a backup spec is present (including steady-state reconciles), which can severely slow processing and can also make existing unit tests take 60s+. Consider implementing this as a requeue (e.g., return a Pending status with RequeueAfter) and/or gating it so the delay is applied only once (e.g., only when backup is transitioning to enabled / status is not yet Started).

Suggested change
log.Infof("Waiting %d seconds before enabling backup to avoid race condition between OM topology discovery and backup enablement", delaySeconds)
select {
case <-ctx.Done():
return workflow.Failed(ctx.Err())
case <-time.After(time.Duration(delaySeconds) * time.Second):
}
log.Infof("Requeueing reconciliation to wait approximately %d seconds before enabling backup to avoid race condition between OM topology discovery and backup enablement", delaySeconds)
return workflow.Pending("waiting before enabling backup; reconciliation will be requeued")

Copilot uses AI. Check for mistakes.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I second to what Copliot mentioned. Instead of blocking reconcile loop we should requeue and make sure the configured time passed before calling backup config endpoint.

Comment on lines +1913 to +1917
log.Infof("Waiting %d seconds before enabling backup to avoid race condition between OM topology discovery and backup enablement", delaySeconds)
select {
case <-ctx.Done():
return workflow.Failed(ctx.Err())
case <-time.After(time.Duration(delaySeconds) * time.Second):
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delaySeconds is taken directly from an env var and could be negative/very large; combined with time.After this can lead to confusing logs (negative wait) and, when ctx.Done() wins, a timer that can’t be stopped early. Consider clamping to >= 0 (and optionally a sane max) and using a time.NewTimer that you Stop() on cancellation.

Suggested change
log.Infof("Waiting %d seconds before enabling backup to avoid race condition between OM topology discovery and backup enablement", delaySeconds)
select {
case <-ctx.Done():
return workflow.Failed(ctx.Err())
case <-time.After(time.Duration(delaySeconds) * time.Second):
if delaySeconds < 0 {
delaySeconds = 0
}
log.Infof("Waiting %d seconds before enabling backup to avoid race condition between OM topology discovery and backup enablement", delaySeconds)
timer := time.NewTimer(time.Duration(delaySeconds) * time.Second)
defer timer.Stop()
select {
case <-ctx.Done():
return workflow.Failed(ctx.Err())
case <-timer.C:

Copilot uses AI. Check for mistakes.
// The delay mitigates this by giving OM time to finish processing monitoring events before the
// /backupConfigs call is made. It applies to every enablement (including re-enables after
// termination, since OM resets configs back to Inactive after termination).
func isBackupBeingEnabled(sc *mdbv1.MongoDB, conn om.Connection, log *zap.SugaredLogger) bool {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need to read the backup config again. You can add your check just after:

if desiredConfig.Status == config.Status {
}

and check if desiredConfig.Status == STARTED and config.Status == STOPPED/INACTIVE

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

skip-changelog Use this label in Pull Request to not require new changelog entry file

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants