CLOUDP-389867: add delay to backupConfig sharded cluster#935
CLOUDP-389867: add delay to backupConfig sharded cluster#935filipcirtog wants to merge 4 commits intomasterfrom
Conversation
MCK 1.7.1 Release NotesBug Fixes
Other Changes
|
There was a problem hiding this comment.
Pull request overview
Adds an operator-controlled delay before enabling backup for sharded clusters to mitigate a race between Ops Manager topology discovery and backup enablement (HELP-87476 / CLOUDP-389867).
Changes:
- Introduces
MDB_BACKUP_START_DELAY_SECONDSand a default delay value. - Adds a delay in
updateOmDeploymentShardedClusterwhen a backup spec is present, before callingensureBackupConfigurationAndUpdateStatus.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
pkg/util/constants.go |
Adds env var name and default delay constant for backup start delay. |
controllers/operator/mongodbshardedcluster_controller.go |
Implements the delay before enabling backup for sharded clusters (env-configurable). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| import ( | ||
| "context" | ||
| "fmt" | ||
| "time" | ||
| "github.com/google/go-cmp/cmp" | ||
| "github.com/hashicorp/go-multierror" | ||
| "go.uber.org/zap" |
There was a problem hiding this comment.
The import block is no longer in the expected sorted/grouped order after adding time (standard library imports should be grouped and ordered consistently, typically enforced by gofmt/gofumpt/gci). Please rerun the repo’s Go formatters (or reorder imports) so CI/pre-commit formatting checks don’t fail.
| log.Infof("Waiting %d seconds before enabling backup to avoid race condition between OM topology discovery and backup enablement", delaySeconds) | ||
| select { | ||
| case <-ctx.Done(): | ||
| return workflow.Failed(ctx.Err()) | ||
| case <-time.After(time.Duration(delaySeconds) * time.Second): | ||
| } |
There was a problem hiding this comment.
Sleeping inside reconciliation will block the sharded-cluster controller worker for up to delaySeconds on every reconcile whenever a backup spec is present (including steady-state reconciles), which can severely slow processing and can also make existing unit tests take 60s+. Consider implementing this as a requeue (e.g., return a Pending status with RequeueAfter) and/or gating it so the delay is applied only once (e.g., only when backup is transitioning to enabled / status is not yet Started).
| log.Infof("Waiting %d seconds before enabling backup to avoid race condition between OM topology discovery and backup enablement", delaySeconds) | |
| select { | |
| case <-ctx.Done(): | |
| return workflow.Failed(ctx.Err()) | |
| case <-time.After(time.Duration(delaySeconds) * time.Second): | |
| } | |
| log.Infof("Requeueing reconciliation to wait approximately %d seconds before enabling backup to avoid race condition between OM topology discovery and backup enablement", delaySeconds) | |
| return workflow.Pending("waiting before enabling backup; reconciliation will be requeued") |
There was a problem hiding this comment.
I second to what Copliot mentioned. Instead of blocking reconcile loop we should requeue and make sure the configured time passed before calling backup config endpoint.
| log.Infof("Waiting %d seconds before enabling backup to avoid race condition between OM topology discovery and backup enablement", delaySeconds) | ||
| select { | ||
| case <-ctx.Done(): | ||
| return workflow.Failed(ctx.Err()) | ||
| case <-time.After(time.Duration(delaySeconds) * time.Second): |
There was a problem hiding this comment.
delaySeconds is taken directly from an env var and could be negative/very large; combined with time.After this can lead to confusing logs (negative wait) and, when ctx.Done() wins, a timer that can’t be stopped early. Consider clamping to >= 0 (and optionally a sane max) and using a time.NewTimer that you Stop() on cancellation.
| log.Infof("Waiting %d seconds before enabling backup to avoid race condition between OM topology discovery and backup enablement", delaySeconds) | |
| select { | |
| case <-ctx.Done(): | |
| return workflow.Failed(ctx.Err()) | |
| case <-time.After(time.Duration(delaySeconds) * time.Second): | |
| if delaySeconds < 0 { | |
| delaySeconds = 0 | |
| } | |
| log.Infof("Waiting %d seconds before enabling backup to avoid race condition between OM topology discovery and backup enablement", delaySeconds) | |
| timer := time.NewTimer(time.Duration(delaySeconds) * time.Second) | |
| defer timer.Stop() | |
| select { | |
| case <-ctx.Done(): | |
| return workflow.Failed(ctx.Err()) | |
| case <-timer.C: |
| // The delay mitigates this by giving OM time to finish processing monitoring events before the | ||
| // /backupConfigs call is made. It applies to every enablement (including re-enables after | ||
| // termination, since OM resets configs back to Inactive after termination). | ||
| func isBackupBeingEnabled(sc *mdbv1.MongoDB, conn om.Connection, log *zap.SugaredLogger) bool { |
There was a problem hiding this comment.
I don't think you need to read the backup config again. You can add your check just after:
if desiredConfig.Status == config.Status {
}and check if desiredConfig.Status == STARTED and config.Status == STOPPED/INACTIVE
HELP
HELP-87476 - This Jira ticket addresses a race condition occurring when enabling backups for a sharded cluster deployed with the Kubernetes operator. The issue arises as individual shards may not receive 'addShard' events, leading to their indefinite inactivity. Investigations are focused on identifying the race condition in Ops Manager and finding a solution to ensure all shards are included in backups without delay.
Summary
When enabling backup on a sharded cluster, Ops Manager needs time to complete its internal topology discovery before it can successfully accept a backup request. Without a delay, the operator races against OM's discovery, causing backup enablement to fail and triggering reconciliation loops until a retry eventually succeeds.
This race is specific to sharded clusters due to their multi-process topology (mongos + config servers + shards), which takes longer for OM to fully register compared to replica sets.
Proof of Work
A configurable sleep is inserted in updateOmDeploymentShardedCluster immediately before calling ensureBackupConfigurationAndUpdateStatus, but only when a backup spec is present. The delay defaults to 60 seconds and is controlled by the MDB_BACKUP_START_DELAY_SECONDS environment variable on the operator deployment, allowing users to tune or disable it per environment.
Checklist
skip-changeloglabel if not needed