PBM-1345 Fail of PITR slicer on one RS doesn't stop the whole PITR process#1276
PBM-1345 Fail of PITR slicer on one RS doesn't stop the whole PITR process#1276jcechace wants to merge 1 commit intopercona:devfrom
Conversation
…start loop When a slicer fails with a permanent error (e.g. insufficient oplog range, missing base backup), PITR now stops cluster-wide and stays stopped until a successful backup clears the error state. Previously, leadNomination unconditionally called InitMeta which cleared the error, causing an infinite fail-stop-restart loop where healthy replica sets accumulated useless oplog chunks.
38bb5c2 to
b83ba60
Compare
boris-ilijic
left a comment
There was a problem hiding this comment.
We can work on this approach, but the current problem is that when PITR fails on the first start, we cannot start it again anyhow, except to create a new backup. So changing the config, stopping/starting PITR, --force-resync... doesn't work, and everything related to PITR is blocked and locked. Please, fix that first.
I also proposed a preferable approach in inline comment.
| if nodeInfo.IsLeader() { | ||
| status, serr := oplog.GetClusterStatus(ctx, a.leadConn) | ||
| if serr == nil && status == oplog.StatusError { | ||
| if ierr := oplog.SetClusterStatus(ctx, a.leadConn, oplog.StatusUnset); ierr != nil { | ||
| l.Warning("clear PITR error status: %v", ierr) | ||
| } else { | ||
| l.Info("PITR error state cleared after successful backup") | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Maybe to remove this from the backup logic, because backup and PITR logic should be orthogonal features.
I'd prefer to have e.g. pitrBackupMonitor for that detection.
There was a problem hiding this comment.
Very good idea, I will look into it
Can you be more specific about the type of failure? What you are describing should happen only if it is impossible for the PITR to start precisely coz it requires a new backup. If that is not the case then I indeed need to fix that -- if you know how to reproduce this "lockout", plese share. |
There are few ways to reproduce it, as I explained in the comment above, but this can be example:
|
Thank you, I think I understand the scenario now
It also nicely points to moving the sticky error cleaning to the main pitr loop. Sweet! |
Ticket:
https://perconadev.atlassian.net/browse/PBM-1345
Root cause:
When a PITR slicer fails on one replica set (e.g. insufficient oplog range, missing base backup), leadNomination() unconditionally calls InitMeta() which replaces the entire pbmPITR document — clearing the error state. This causes an infinite fail → stop → restart → fail loop where healthy replica sets accumulate useless oplog chunks.
Fix:
Classify slicer errors as fatal vs retriable. Fatal errors (no base backup, insufficient oplog range) set StatusError which stops all slicers and prevents the supervisor from restarting them until a successful backup clears the state. Retriable errors (network blips, transient failures) are logged without writing error status, allowing the supervisor to restart the slicer without disrupting other replica sets. OpMovedError (normal lock handoff between nodes) is no longer treated as an error.