PBM-1609: Prevent PITR supervisor from restarting slicer during restore#1273
Conversation
b49a97f to
aaaf4e0
Compare
|
@boris-ilijic Would appreciate if you could have a look today/early tomorrow |
boris-ilijic
left a comment
There was a problem hiding this comment.
The issue here is that the restore procedure and PITR procedure work in parallel, and I still getting the same issue:
2026-03-03T15:34:17.000+0000 I [restore/2026-03-03T15:34:17.215422964Z] oplog slicer disabled <-------pitr should be disabled after this
2026-03-03T15:34:17.000+0000 I [restore/2026-03-03T15:34:17.215422964Z] backup: 2026-03-03T15:31:39Z
2026-03-03T15:34:17.000+0000 I [restore/2026-03-03T15:34:17.215422964Z] recovery started
2026-03-03T15:34:17.000+0000 D [restore/2026-03-03T15:34:17.215422964Z] waiting for 'starting' status
2026-03-03T15:34:18.000+0000 I [restore/2026-03-03T15:34:17.215422964Z] moving to state cleanupCluster
2026-03-03T15:34:20.000+0000 I [restore/2026-03-03T15:34:17.215422964Z] moving to state running
2026-03-03T15:34:22.000+0000 D [restore/2026-03-03T15:34:17.215422964Z] restoring up to 3 collections in parallel
2026-03-03T15:34:22.000+0000 I [restore/2026-03-03T15:34:17.215422964Z] restoring users and roles
2026-03-03T15:34:22.000+0000 I [restore/2026-03-03T15:34:17.215422964Z] moving to state dumpDone
2026-03-03T15:34:24.000+0000 I [pitr] got done signal, stopping
2026-03-03T15:34:24.000+0000 I [restore/2026-03-03T15:34:17.215422964Z] starting oplog replay
2026-03-03T15:34:24.000+0000 I [pitr] created chunk 2026-03-03T15:34:09 - 2026-03-03T15:34:24 <--------pitr is still creating chunks
2026-03-03T15:34:24.000+0000 I [pitr] pausing/stopping with last_ts 2026-03-03 15:34:24 +0000 UTC
2026-03-03T15:34:24.000+0000 D [restore/2026-03-03T15:34:17.215422964Z] + applying {rs1 2026-03-03T15:31:39Z/rs1/oplog/20260303153140-13.20260303153144-2 {1772551900 13} {1772551904 2} 18316}
PBM should wait for the slicer to be completely disabled, so the restore procedure should block until PITR is disabled (~ oplog slicer disabled place in code). Moreover, this solution will not work when PITR and restores run in different processes (agents), which is usually the case.
|
@boris-ilijic Ah... I see. Thanks. I originally though that this problem happens exactly when the restore happens on the agent running PITR but you are right that doesn't have to be the case (and likely wont). |
5230aa7 to
c9a32a2
Compare
|
@boris-ilijic hopefully good now. See the log snippets in the summary |
cmd/pbm-agent/pitr.go
Outdated
| l.Info("PITR slicer stopped") | ||
| return nil | ||
| } | ||
| l.Debug("waiting for PITR slicer to release OpLock") |
There was a problem hiding this comment.
Let's move this in front of for{} loop, there's no benefit to log every try.
cmd/pbm-agent/pitr.go
Outdated
| for { | ||
| select { | ||
| case <-ctx.Done(): | ||
| return ctx.Err() |
There was a problem hiding this comment.
Do we? It's either nil or context timeout. I would expect downstream to wrap it. Seems like pointless wrapping level. But if you think it would be more correct I don't feel strongly about it.
There was a problem hiding this comment.
Typically it's like that, but in this case it's good error info, so it's something like:
waiting for PITR slicer to stop: context deadline exceeded
vs
waiting for PITR slicer to stop: timeout during waiting oplog slicer to stop: context deadline exceeded
Actually the bast should be to return new error: timeout during waiting oplog slicer to stop. But whatever I am also not feel strongly about it.
cmd/pbm-agent/restore.go
Outdated
| } | ||
| a.removePitr() | ||
| if err := a.waitForPITRSlicerStop(ctx, nodeInfo.SetName, l); err != nil { | ||
| l.Warning("waiting for PITR slicer to stop: %v", err) |
There was a problem hiding this comment.
PBM tried to stop PITR and failed, do we want to continue or mark restore as failed?
There was a problem hiding this comment.
I would say continue, worst case the pitr will record invalid chunks?
There was a problem hiding this comment.
That means corrupted PITR, but I am not sure how realistic is that, but there is chance. I'd say if we can fail let's do it.
c9a32a2 to
391f715
Compare
After disabling PITR via config and cancelling the local slicer context, poll the PITR OpLock until it is released (or stale). This ensures the oplog slicer has finished its last upload before the restore proceeds, preventing oplog chunks from being created during the restore. The OpLock check works for both same-process and cross-process slicers, which is the common deployment where PITR and restore run on different agents.
391f715 to
383e193
Compare
There was a problem hiding this comment.
@jcechace Now, when using the timeout error flow (primary node failed), secondary nodes proceed in case of physical restore:
- pitr is running
- physical restore is started
- agent failed due to timeout
- other agents proceed with physical restore
Move removePitr() and waitForPITRSlicerStop() outside the primary-only block so every node stops its local slicer. Add config check (pitr.enabled == false) before the OpLock check to prevent the race where a slicer restarts on another node before config is disabled.
| // Config is disabled | ||
| cfg, err := config.GetConfig(ctx, a.leadConn) | ||
| if err != nil { | ||
| return errors.Wrap(err, "get config") | ||
| } | ||
| if cfg.PITR.Enabled { | ||
| continue | ||
| } |
There was a problem hiding this comment.
this is not necessary, but ok...
There was a problem hiding this comment.
It is... Otherwise you have a race condition when local slicer is stopped, so the wait passes. But in parallel the main pitr loop / supervisor restarted the slicer coz it wasn't yet disabled by the leader.
Ticket
https://perconadev.atlassian.net/browse/PBM-1609
Before:
After: