Skip to content

PBM-1609: Prevent PITR supervisor from restarting slicer during restore#1273

Merged
jcechace merged 4 commits intopercona:devfrom
jcechace:PBM-1609-pitr-stop-during-restore
Mar 10, 2026
Merged

PBM-1609: Prevent PITR supervisor from restarting slicer during restore#1273
jcechace merged 4 commits intopercona:devfrom
jcechace:PBM-1609-pitr-stop-during-restore

Conversation

@jcechace
Copy link
Collaborator

@jcechace jcechace commented Feb 23, 2026

Ticket
https://perconadev.atlassian.net/browse/PBM-1609

Before:

10:09:04  [restore] oplog slicer disabled        ← restore proceeds immediately
10:09:04  [restore] recovery started              ← restore already running
10:09:05  [restore] moving to state cleanupCluster
10:09:07  [restore] moving to state running
10:09:07  [pitr] created chunk 10:08:10 - 10:09:06  ← CHUNK CREATED DURING RESTORE
10:09:07  [pitr] got done signal, stopping
10:09:08  [pitr] created chunk 10:09:06 - 10:09:07  ← ANOTHER CHUNK DURING RESTORE
10:09:08  [pitr] pausing/stopping

After:

09:55:54  [restore] oplog slicer disabled
09:55:59  [restore] waiting for PITR slicer to release OpLock  ← BLOCKS
09:56:04  [restore] waiting for PITR slicer to release OpLock  ← still waiting
09:56:08  [pitr] created chunk ...                              ← final chunk, restore hasn't started
09:56:09  [restore] PITR slicer stopped                         ← poll detects release
09:56:09  [restore] recovery started                            ← NOW restore proceeds

@jcechace jcechace marked this pull request as ready for review February 24, 2026 09:35
@jcechace jcechace force-pushed the PBM-1609-pitr-stop-during-restore branch from b49a97f to aaaf4e0 Compare February 25, 2026 09:48
@jcechace
Copy link
Collaborator Author

jcechace commented Mar 3, 2026

@boris-ilijic Would appreciate if you could have a look today/early tomorrow

Copy link
Member

@boris-ilijic boris-ilijic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue here is that the restore procedure and PITR procedure work in parallel, and I still getting the same issue:

2026-03-03T15:34:17.000+0000 I [restore/2026-03-03T15:34:17.215422964Z] oplog slicer disabled <-------pitr should be disabled after this
2026-03-03T15:34:17.000+0000 I [restore/2026-03-03T15:34:17.215422964Z] backup: 2026-03-03T15:31:39Z
2026-03-03T15:34:17.000+0000 I [restore/2026-03-03T15:34:17.215422964Z] recovery started
2026-03-03T15:34:17.000+0000 D [restore/2026-03-03T15:34:17.215422964Z] waiting for 'starting' status
2026-03-03T15:34:18.000+0000 I [restore/2026-03-03T15:34:17.215422964Z] moving to state cleanupCluster
2026-03-03T15:34:20.000+0000 I [restore/2026-03-03T15:34:17.215422964Z] moving to state running
2026-03-03T15:34:22.000+0000 D [restore/2026-03-03T15:34:17.215422964Z] restoring up to 3 collections in parallel
2026-03-03T15:34:22.000+0000 I [restore/2026-03-03T15:34:17.215422964Z] restoring users and roles
2026-03-03T15:34:22.000+0000 I [restore/2026-03-03T15:34:17.215422964Z] moving to state dumpDone
2026-03-03T15:34:24.000+0000 I [pitr] got done signal, stopping
2026-03-03T15:34:24.000+0000 I [restore/2026-03-03T15:34:17.215422964Z] starting oplog replay
2026-03-03T15:34:24.000+0000 I [pitr] created chunk 2026-03-03T15:34:09 - 2026-03-03T15:34:24 <--------pitr is still creating chunks
2026-03-03T15:34:24.000+0000 I [pitr] pausing/stopping with last_ts 2026-03-03 15:34:24 +0000 UTC
2026-03-03T15:34:24.000+0000 D [restore/2026-03-03T15:34:17.215422964Z] + applying {rs1 2026-03-03T15:31:39Z/rs1/oplog/20260303153140-13.20260303153144-2  {1772551900 13} {1772551904 2} 18316}

PBM should wait for the slicer to be completely disabled, so the restore procedure should block until PITR is disabled (~ oplog slicer disabled place in code). Moreover, this solution will not work when PITR and restores run in different processes (agents), which is usually the case.

@jcechace
Copy link
Collaborator Author

jcechace commented Mar 3, 2026

@boris-ilijic Ah... I see. Thanks. I originally though that this problem happens exactly when the restore happens on the agent running PITR but you are right that doesn't have to be the case (and likely wont).

@jcechace jcechace force-pushed the PBM-1609-pitr-stop-during-restore branch 2 times, most recently from 5230aa7 to c9a32a2 Compare March 4, 2026 10:28
@jcechace jcechace requested a review from boris-ilijic March 4, 2026 12:30
@jcechace
Copy link
Collaborator Author

jcechace commented Mar 4, 2026

@boris-ilijic hopefully good now. See the log snippets in the summary

l.Info("PITR slicer stopped")
return nil
}
l.Debug("waiting for PITR slicer to release OpLock")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move this in front of for{} loop, there's no benefit to log every try.

for {
select {
case <-ctx.Done():
return ctx.Err()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to Wrap this one.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we? It's either nil or context timeout. I would expect downstream to wrap it. Seems like pointless wrapping level. But if you think it would be more correct I don't feel strongly about it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typically it's like that, but in this case it's good error info, so it's something like:
waiting for PITR slicer to stop: context deadline exceeded
vs
waiting for PITR slicer to stop: timeout during waiting oplog slicer to stop: context deadline exceeded

Actually the bast should be to return new error: timeout during waiting oplog slicer to stop. But whatever I am also not feel strongly about it.

}
a.removePitr()
if err := a.waitForPITRSlicerStop(ctx, nodeInfo.SetName, l); err != nil {
l.Warning("waiting for PITR slicer to stop: %v", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PBM tried to stop PITR and failed, do we want to continue or mark restore as failed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say continue, worst case the pitr will record invalid chunks?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That means corrupted PITR, but I am not sure how realistic is that, but there is chance. I'd say if we can fail let's do it.

@jcechace jcechace force-pushed the PBM-1609-pitr-stop-during-restore branch from c9a32a2 to 391f715 Compare March 5, 2026 10:35
After disabling PITR via config and cancelling the local slicer context,
poll the PITR OpLock until it is released (or stale). This ensures the
oplog slicer has finished its last upload before the restore proceeds,
preventing oplog chunks from being created during the restore.

The OpLock check works for both same-process and cross-process slicers,
which is the common deployment where PITR and restore run on different
agents.
@jcechace jcechace force-pushed the PBM-1609-pitr-stop-during-restore branch from 391f715 to 383e193 Compare March 5, 2026 10:37
@jcechace jcechace requested a review from boris-ilijic March 5, 2026 10:37
Copy link
Member

@boris-ilijic boris-ilijic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jcechace Now, when using the timeout error flow (primary node failed), secondary nodes proceed in case of physical restore:

  • pitr is running
  • physical restore is started
  • agent failed due to timeout
  • other agents proceed with physical restore

jcechace added 3 commits March 9, 2026 12:36
Move removePitr() and waitForPITRSlicerStop() outside the primary-only
block so every node stops its local slicer. Add config check (pitr.enabled
== false) before the OpLock check to prevent the race where a slicer
restarts on another node before config is disabled.
Comment on lines +74 to +81
// Config is disabled
cfg, err := config.GetConfig(ctx, a.leadConn)
if err != nil {
return errors.Wrap(err, "get config")
}
if cfg.PITR.Enabled {
continue
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not necessary, but ok...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is... Otherwise you have a race condition when local slicer is stopped, so the wait passes. But in parallel the main pitr loop / supervisor restarted the slicer coz it wasn't yet disabled by the leader.

@jcechace jcechace merged commit 17abae6 into percona:dev Mar 10, 2026
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants