Skip to content

PBM-1345 Fail of PITR slicer on one RS doesn't stop the whole PITR process#1276

Open
jcechace wants to merge 1 commit intopercona:devfrom
jcechace:PBM-1345-fail-pitr-on-rs-fail
Open

PBM-1345 Fail of PITR slicer on one RS doesn't stop the whole PITR process#1276
jcechace wants to merge 1 commit intopercona:devfrom
jcechace:PBM-1345-fail-pitr-on-rs-fail

Conversation

@jcechace
Copy link
Collaborator

Ticket:
https://perconadev.atlassian.net/browse/PBM-1345

Root cause:
When a PITR slicer fails on one replica set (e.g. insufficient oplog range, missing base backup), leadNomination() unconditionally calls InitMeta() which replaces the entire pbmPITR document — clearing the error state. This causes an infinite fail → stop → restart → fail loop where healthy replica sets accumulate useless oplog chunks.

Fix:
Classify slicer errors as fatal vs retriable. Fatal errors (no base backup, insufficient oplog range) set StatusError which stops all slicers and prevents the supervisor from restarting them until a successful backup clears the state. Retriable errors (network blips, transient failures) are logged without writing error status, allowing the supervisor to restart the slicer without disrupting other replica sets. OpMovedError (normal lock handoff between nodes) is no longer treated as an error.

…start loop

When a slicer fails with a permanent error (e.g. insufficient oplog range,
missing base backup), PITR now stops cluster-wide and stays stopped until
a successful backup clears the error state. Previously, leadNomination
unconditionally called InitMeta which cleared the error, causing an
infinite fail-stop-restart loop where healthy replica sets accumulated
useless oplog chunks.
@jcechace jcechace force-pushed the PBM-1345-fail-pitr-on-rs-fail branch from 38bb5c2 to b83ba60 Compare March 4, 2026 12:30
@jcechace jcechace marked this pull request as ready for review March 4, 2026 16:44
@jcechace jcechace requested a review from boris-ilijic as a code owner March 4, 2026 16:44
Copy link
Member

@boris-ilijic boris-ilijic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can work on this approach, but the current problem is that when PITR fails on the first start, we cannot start it again anyhow, except to create a new backup. So changing the config, stopping/starting PITR, --force-resync... doesn't work, and everything related to PITR is blocked and locked. Please, fix that first.

I also proposed a preferable approach in inline comment.

Comment on lines +265 to +274
if nodeInfo.IsLeader() {
status, serr := oplog.GetClusterStatus(ctx, a.leadConn)
if serr == nil && status == oplog.StatusError {
if ierr := oplog.SetClusterStatus(ctx, a.leadConn, oplog.StatusUnset); ierr != nil {
l.Warning("clear PITR error status: %v", ierr)
} else {
l.Info("PITR error state cleared after successful backup")
}
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe to remove this from the backup logic, because backup and PITR logic should be orthogonal features.
I'd prefer to have e.g. pitrBackupMonitor for that detection.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good idea, I will look into it

@jcechace
Copy link
Collaborator Author

jcechace commented Mar 9, 2026

We can work on this approach, but the current problem is that when PITR fails on the first start, we cannot start it again anyhow, except to create a new backup. So changing the config, stopping/starting PITR, --force-resync... doesn't work, and everything related to PITR is blocked and locked. Please, fix that first.

I also proposed a preferable approach in inline comment.

Can you be more specific about the type of failure? What you are describing should happen only if it is impossible for the PITR to start precisely coz it requires a new backup.

If that is not the case then I indeed need to fix that -- if you know how to reproduce this "lockout", plese share.

@boris-ilijic
Copy link
Member

Can you be more specific about the type of failure?

There are few ways to reproduce it, as I explained in the comment above, but this can be example:

  • start PITR without base backup --> it fails and PBM's PITR main loop is block from now on
  • change the storage: pbm config --file
  • PBM main loop is still blocked, and it's not possible to start PITR

@jcechace
Copy link
Collaborator Author

jcechace commented Mar 9, 2026

Can you be more specific about the type of failure?

There are few ways to reproduce it, as I explained in the comment above, but this can be example:

  • start PITR without base backup --> it fails and PBM's PITR main loop is block from now on
  • change the storage: pbm config --file
  • PBM main loop is still blocked, and it's not possible to start PITR

Thank you, I think I understand the scenario now

  • doing resync of main storage will sync the pbmBackup and pbmPITRChunks
  • it doesn't touch pbmPITR which persists the sticky error
  • That's an issue since after resync we might be in a healthy state where PITR should be able to start

It also nicely points to moving the sticky error cleaning to the main pitr loop. Sweet!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants