PBM-1609: Prevent PITR supervisor from restarting slicer during restore by jcechace · Pull Request #1273 · percona/percona-backup-mongodb

jcechace · 2026-02-23T10:07:26Z

Ticket
https://perconadev.atlassian.net/browse/PBM-1609

Before:

10:09:04  [restore] oplog slicer disabled        ← restore proceeds immediately
10:09:04  [restore] recovery started              ← restore already running
10:09:05  [restore] moving to state cleanupCluster
10:09:07  [restore] moving to state running
10:09:07  [pitr] created chunk 10:08:10 - 10:09:06  ← CHUNK CREATED DURING RESTORE
10:09:07  [pitr] got done signal, stopping
10:09:08  [pitr] created chunk 10:09:06 - 10:09:07  ← ANOTHER CHUNK DURING RESTORE
10:09:08  [pitr] pausing/stopping

After:

09:55:54  [restore] oplog slicer disabled
09:55:59  [restore] waiting for PITR slicer to release OpLock  ← BLOCKS
09:56:04  [restore] waiting for PITR slicer to release OpLock  ← still waiting
09:56:08  [pitr] created chunk ...                              ← final chunk, restore hasn't started
09:56:09  [restore] PITR slicer stopped                         ← poll detects release
09:56:09  [restore] recovery started                            ← NOW restore proceeds

jcechace · 2026-03-03T12:07:35Z

@boris-ilijic Would appreciate if you could have a look today/early tomorrow

boris-ilijic

The issue here is that the restore procedure and PITR procedure work in parallel, and I still getting the same issue:

2026-03-03T15:34:17.000+0000 I [restore/2026-03-03T15:34:17.215422964Z] oplog slicer disabled <-------pitr should be disabled after this
2026-03-03T15:34:17.000+0000 I [restore/2026-03-03T15:34:17.215422964Z] backup: 2026-03-03T15:31:39Z
2026-03-03T15:34:17.000+0000 I [restore/2026-03-03T15:34:17.215422964Z] recovery started
2026-03-03T15:34:17.000+0000 D [restore/2026-03-03T15:34:17.215422964Z] waiting for 'starting' status
2026-03-03T15:34:18.000+0000 I [restore/2026-03-03T15:34:17.215422964Z] moving to state cleanupCluster
2026-03-03T15:34:20.000+0000 I [restore/2026-03-03T15:34:17.215422964Z] moving to state running
2026-03-03T15:34:22.000+0000 D [restore/2026-03-03T15:34:17.215422964Z] restoring up to 3 collections in parallel
2026-03-03T15:34:22.000+0000 I [restore/2026-03-03T15:34:17.215422964Z] restoring users and roles
2026-03-03T15:34:22.000+0000 I [restore/2026-03-03T15:34:17.215422964Z] moving to state dumpDone
2026-03-03T15:34:24.000+0000 I [pitr] got done signal, stopping
2026-03-03T15:34:24.000+0000 I [restore/2026-03-03T15:34:17.215422964Z] starting oplog replay
2026-03-03T15:34:24.000+0000 I [pitr] created chunk 2026-03-03T15:34:09 - 2026-03-03T15:34:24 <--------pitr is still creating chunks
2026-03-03T15:34:24.000+0000 I [pitr] pausing/stopping with last_ts 2026-03-03 15:34:24 +0000 UTC
2026-03-03T15:34:24.000+0000 D [restore/2026-03-03T15:34:17.215422964Z] + applying {rs1 2026-03-03T15:31:39Z/rs1/oplog/20260303153140-13.20260303153144-2  {1772551900 13} {1772551904 2} 18316}

PBM should wait for the slicer to be completely disabled, so the restore procedure should block until PITR is disabled (~ oplog slicer disabled place in code). Moreover, this solution will not work when PITR and restores run in different processes (agents), which is usually the case.

jcechace · 2026-03-03T22:12:44Z

@boris-ilijic Ah... I see. Thanks. I originally though that this problem happens exactly when the restore happens on the agent running PITR but you are right that doesn't have to be the case (and likely wont).

jcechace · 2026-03-04T12:31:34Z

@boris-ilijic hopefully good now. See the log snippets in the summary

boris-ilijic · 2026-03-04T15:07:50Z

cmd/pbm-agent/pitr.go

+				l.Info("PITR slicer stopped")
+				return nil
+			}
+			l.Debug("waiting for PITR slicer to release OpLock")


Let's move this in front of for{} loop, there's no benefit to log every try.

boris-ilijic · 2026-03-04T15:32:13Z

cmd/pbm-agent/pitr.go

+	for {
+		select {
+		case <-ctx.Done():
+			return ctx.Err()


We want to Wrap this one.

Do we? It's either nil or context timeout. I would expect downstream to wrap it. Seems like pointless wrapping level. But if you think it would be more correct I don't feel strongly about it.

Typically it's like that, but in this case it's good error info, so it's something like:
waiting for PITR slicer to stop: context deadline exceeded
vs
waiting for PITR slicer to stop: timeout during waiting oplog slicer to stop: context deadline exceeded

Actually the bast should be to return new error: timeout during waiting oplog slicer to stop. But whatever I am also not feel strongly about it.

boris-ilijic · 2026-03-04T16:01:55Z

cmd/pbm-agent/restore.go

 		}
 		a.removePitr()
+		if err := a.waitForPITRSlicerStop(ctx, nodeInfo.SetName, l); err != nil {
+			l.Warning("waiting for PITR slicer to stop: %v", err)


PBM tried to stop PITR and failed, do we want to continue or mark restore as failed?

I would say continue, worst case the pitr will record invalid chunks?

That means corrupted PITR, but I am not sure how realistic is that, but there is chance. I'd say if we can fail let's do it.

After disabling PITR via config and cancelling the local slicer context, poll the PITR OpLock until it is released (or stale). This ensures the oplog slicer has finished its last upload before the restore proceeds, preventing oplog chunks from being created during the restore. The OpLock check works for both same-process and cross-process slicers, which is the common deployment where PITR and restore run on different agents.

boris-ilijic

@jcechace Now, when using the timeout error flow (primary node failed), secondary nodes proceed in case of physical restore:

pitr is running
physical restore is started
agent failed due to timeout
other agents proceed with physical restore

Move removePitr() and waitForPITRSlicerStop() outside the primary-only block so every node stops its local slicer. Add config check (pitr.enabled == false) before the OpLock check to prevent the race where a slicer restarts on another node before config is disabled.

boris-ilijic · 2026-03-09T13:02:07Z

cmd/pbm-agent/pitr.go

+			// Config is disabled
+			cfg, err := config.GetConfig(ctx, a.leadConn)
+			if err != nil {
+				return errors.Wrap(err, "get config")
+			}
+			if cfg.PITR.Enabled {
+				continue
+			}


this is not necessary, but ok...

It is... Otherwise you have a race condition when local slicer is stopped, so the wait passes. But in parallel the main pitr loop / supervisor restarted the slicer coz it wasn't yet disabled by the leader.

jcechace marked this pull request as ready for review February 24, 2026 09:35

jcechace requested a review from boris-ilijic as a code owner February 24, 2026 09:35

jcechace force-pushed the PBM-1609-pitr-stop-during-restore branch from b49a97f to aaaf4e0 Compare February 25, 2026 09:48

boris-ilijic reviewed Mar 3, 2026

View reviewed changes

jcechace force-pushed the PBM-1609-pitr-stop-during-restore branch 2 times, most recently from 5230aa7 to c9a32a2 Compare March 4, 2026 10:28

jcechace requested a review from boris-ilijic March 4, 2026 12:30

boris-ilijic reviewed Mar 4, 2026

View reviewed changes

jcechace force-pushed the PBM-1609-pitr-stop-during-restore branch from c9a32a2 to 391f715 Compare March 5, 2026 10:35

jcechace force-pushed the PBM-1609-pitr-stop-during-restore branch from 391f715 to 383e193 Compare March 5, 2026 10:37

jcechace requested a review from boris-ilijic March 5, 2026 10:37

boris-ilijic reviewed Mar 5, 2026

View reviewed changes

jcechace added 3 commits March 9, 2026 12:36

Increase PITR slicer stop timeout

4f474e8

Disable PITR slicer only from leader node during restore process

4c6990c

boris-ilijic approved these changes Mar 9, 2026

View reviewed changes

jcechace merged commit 17abae6 into percona:dev Mar 10, 2026
24 checks passed

Conversation

jcechace commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jcechace commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

boris-ilijic left a comment

Choose a reason for hiding this comment

Uh oh!

jcechace commented Mar 3, 2026

Uh oh!

jcechace commented Mar 4, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

boris-ilijic left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jcechace commented Feb 23, 2026 •

edited

Loading

jcechace commented Mar 3, 2026 •

edited

Loading

boris-ilijic left a comment •

edited

Loading