PBM-1345 Fail of PITR slicer on one RS doesn't stop the whole PITR process by jcechace · Pull Request #1276 · percona/percona-backup-mongodb

jcechace · 2026-02-25T14:02:09Z

Ticket:
https://perconadev.atlassian.net/browse/PBM-1345

Root cause:
When a PITR slicer fails on one replica set (e.g. insufficient oplog range, missing base backup), leadNomination() unconditionally calls InitMeta() which replaces the entire pbmPITR document — clearing the error state. This causes an infinite fail → stop → restart → fail loop where healthy replica sets accumulate useless oplog chunks.

Fix:
Classify slicer errors as fatal vs retriable. Fatal errors (no base backup, insufficient oplog range) set StatusError which stops all slicers and prevents the supervisor from restarting them until a successful backup clears the state. Retriable errors (network blips, transient failures) are logged without writing error status, allowing the supervisor to restart the slicer without disrupting other replica sets. OpMovedError (normal lock handoff between nodes) is no longer treated as an error.

…start loop When a slicer fails with a permanent error (e.g. insufficient oplog range, missing base backup), PITR now stops cluster-wide and stays stopped until a successful backup clears the error state. Previously, leadNomination unconditionally called InitMeta which cleared the error, causing an infinite fail-stop-restart loop where healthy replica sets accumulated useless oplog chunks.

boris-ilijic

We can work on this approach, but the current problem is that when PITR fails on the first start, we cannot start it again anyhow, except to create a new backup. So changing the config, stopping/starting PITR, --force-resync... doesn't work, and everything related to PITR is blocked and locked. Please, fix that first.

I also proposed a preferable approach in inline comment.

boris-ilijic · 2026-03-09T13:34:08Z

cmd/pbm-agent/backup.go

+		if nodeInfo.IsLeader() {
+			status, serr := oplog.GetClusterStatus(ctx, a.leadConn)
+			if serr == nil && status == oplog.StatusError {
+				if ierr := oplog.SetClusterStatus(ctx, a.leadConn, oplog.StatusUnset); ierr != nil {
+					l.Warning("clear PITR error status: %v", ierr)
+				} else {
+					l.Info("PITR error state cleared after successful backup")
+				}
+			}
+		}


Maybe to remove this from the backup logic, because backup and PITR logic should be orthogonal features.
I'd prefer to have e.g. pitrBackupMonitor for that detection.

Very good idea, I will look into it

jcechace · 2026-03-09T13:47:27Z

We can work on this approach, but the current problem is that when PITR fails on the first start, we cannot start it again anyhow, except to create a new backup. So changing the config, stopping/starting PITR, --force-resync... doesn't work, and everything related to PITR is blocked and locked. Please, fix that first.

I also proposed a preferable approach in inline comment.

Can you be more specific about the type of failure? What you are describing should happen only if it is impossible for the PITR to start precisely coz it requires a new backup.

If that is not the case then I indeed need to fix that -- if you know how to reproduce this "lockout", plese share.

boris-ilijic · 2026-03-09T14:13:12Z

Can you be more specific about the type of failure?

There are few ways to reproduce it, as I explained in the comment above, but this can be example:

start PITR without base backup --> it fails and PBM's PITR main loop is block from now on
change the storage: pbm config --file
PBM main loop is still blocked, and it's not possible to start PITR

jcechace · 2026-03-09T14:43:28Z

Can you be more specific about the type of failure?

There are few ways to reproduce it, as I explained in the comment above, but this can be example:

start PITR without base backup --> it fails and PBM's PITR main loop is block from now on

change the storage: pbm config --file

PBM main loop is still blocked, and it's not possible to start PITR

Thank you, I think I understand the scenario now

doing resync of main storage will sync the pbmBackup and pbmPITRChunks
it doesn't touch pbmPITR which persists the sticky error
That's an issue since after resync we might be in a healthy state where PITR should be able to start

It also nicely points to moving the sticky error cleaning to the main pitr loop. Sweet!

jcechace force-pushed the PBM-1345-fail-pitr-on-rs-fail branch from 38bb5c2 to b83ba60 Compare March 4, 2026 12:30

jcechace marked this pull request as ready for review March 4, 2026 16:44

jcechace requested a review from boris-ilijic as a code owner March 4, 2026 16:44

boris-ilijic reviewed Mar 9, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PBM-1345 Fail of PITR slicer on one RS doesn't stop the whole PITR process#1276

PBM-1345 Fail of PITR slicer on one RS doesn't stop the whole PITR process#1276
jcechace wants to merge 1 commit intopercona:devfrom
jcechace:PBM-1345-fail-pitr-on-rs-fail

jcechace commented Feb 25, 2026

Uh oh!

boris-ilijic left a comment

Uh oh!

boris-ilijic Mar 9, 2026

Uh oh!

jcechace Mar 9, 2026

Uh oh!

jcechace commented Mar 9, 2026

Uh oh!

boris-ilijic commented Mar 9, 2026

Uh oh!

jcechace commented Mar 9, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jcechace commented Feb 25, 2026

Uh oh!

boris-ilijic left a comment

Choose a reason for hiding this comment

Uh oh!

boris-ilijic Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

jcechace Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

jcechace commented Mar 9, 2026

Uh oh!

boris-ilijic commented Mar 9, 2026

Uh oh!

jcechace commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jcechace commented Mar 9, 2026 •

edited

Loading