K8SPS-642: [Proposal] Point-in-time recovery by egegunes · Pull Request #1251 · percona/percona-server-mysql-operator

egegunes · 2026-03-22T15:31:40Z

CHANGE DESCRIPTION

Add enhancement proposal for point-in-time recovery.

CHECKLIST

Jira

Is the Jira ticket created and referenced properly?
Does the Jira ticket have the proper statuses for documentation (Needs Doc) and QA (Needs QA)?
Does the Jira ticket link to the proper milestone (Fix Version field)?

Tests

Is an E2E test/test case added for the new feature/change?
Are unit tests added where appropriate?

Config/Logging/Testability

Are all needed new/changed options added to default YAML files?
Are all needed new/changed options added to the Helm Chart?
Did we add proper logging messages for operator actions?
Did we ensure compatibility with the previous version or cluster upgrade process?
Does the change support oldest and newest supported PS version?
Does the change support oldest and newest supported Kubernetes version?

github-actions · 2026-03-22T15:32:08Z

docs/enhancements/K8SPS-642-point-in-time-recovery.md

+**Relay-log-replay technique for PITR:** Traditional PITR pipes binary logs through `mysqlbinlog`, which is single-threaded and slow. Instead, this implementation downloads binlogs from S3, renames them as relay logs, and replays them using MySQL's replication SQL thread via `CHANGE REPLICATION SOURCE TO RELAY_LOG_FILE=..., SOURCE_HOST='dummy'` followed by `START REPLICA SQL_THREAD UNTIL ...`. This technique is described in detail by [lefred's blog post on faster MySQL PITR](https://lefred.be/content/howto-make-mysql-point-in-time-recovery-faster/). The key advantage is that MySQL's SQL thread can leverage parallel applier workers for concurrent transaction replay.
+
+**GTID-based vs date-based recovery targets:**
+- *GTID-based:* User specifies a GTID set (e.g., `uuid:N`). The SQL thread replays until just before that GTID using `START REPLICA SQL_THREAD UNTIL SQL_BEFORE_GTIDS`.


⚠️ [alex] _{reported by reviewdog 🐶}
just may be insensitive, try not to use it just retext-equality

github-actions · 2026-03-22T15:32:08Z

docs/enhancements/K8SPS-642-point-in-time-recovery.md

+- *GTID-based:* User specifies a GTID set (e.g., `uuid:N`). The SQL thread replays until just before that GTID using `START REPLICA SQL_THREAD UNTIL SQL_BEFORE_GTIDS`.
+- *Date-based:* User specifies a datetime. The last relay log is parsed with `mysqlbinlog --stop-datetime` to find the exact binary log position, then `START REPLICA SQL_THREAD UNTIL RELAY_LOG_FILE, RELAY_LOG_POS` is used.
+
+**Binlog search:** The `binlog_server` binary provides `search_by_gtid_set` and `search_by_timestamp` subcommands that query S3 metadata to return the list of binlog files covering the range from the backup's GTID position to the target recovery point. The operator executes these via `kubectl exec` on the binlog server pod.


⚠️ [alex] _{reported by reviewdog 🐶}
Be careful with executes, it’s profane in some cases executes retext-profanities

github-actions · 2026-03-22T15:32:08Z

docs/enhancements/K8SPS-642-point-in-time-recovery.md

+  Specifies the recovery target type. Determines which search function is called on the binlog server and which `START REPLICA UNTIL` variant is used.
+
+- **`spec.pitr.gtid`** *(required when type is `"gtid"`)*:
+  The GTID set to recover to. Replay stops just *before* this GTID (using `SQL_BEFORE_GTIDS`), so the transaction identified by this GTID is NOT applied. This is the "exclude" semantic — the user specifies the bad transaction they want to skip.


⚠️ [alex] _{reported by reviewdog 🐶}
just may be insensitive, try not to use it just retext-equality

github-actions · 2026-03-22T15:32:08Z

docs/enhancements/K8SPS-642-point-in-time-recovery.md

+### 4.4 User-Facing Behavior Changes
+
+- When `spec.pause` is set to `true` on a cluster, the main reconciler now also disables PITR (`spec.backup.pitr.enabled = false`) to prevent the binlog server from running while the cluster is paused.
+- The binlog server StatefulSet is now cleaned up (deleted) when PITR is disabled, via a new `cleanupBinlogServer` function in the main reconciler.


⚠️ [alex] _{reported by reviewdog 🐶}
disabled may be insensitive, use turned off, has a disability, person with a disability, people with disabilities instead invalid retext-equality

github-actions · 2026-03-22T15:32:08Z

docs/enhancements/K8SPS-642-point-in-time-recovery.md

+
+**Chosen approach:** Set `spec.backup.pitr.enabled = false` when `spec.pause = true` in `CheckNSetDefaults`.
+
+**Why:** A paused cluster has no MySQL pods running. The binlog server would fail to connect and enter a crash loop. Disabling PITR cleanly prevents this.


⚠️ [alex] _{reported by reviewdog 🐶}
Be careful with crash, it’s profane in some cases crash retext-profanities

github-actions · 2026-03-22T15:32:08Z

docs/enhancements/K8SPS-642-point-in-time-recovery.md

+
+### 8.1 No Binlogs Found for Recovery Target
+
+**Scenario:** The user specifies a PITR target (GTID or date) for which no matching binlogs exist in S3. This can happen if the binlog server was not running during the time period, or if the target is before the backup.


⚠️ [alex] _{reported by reviewdog 🐶}
Be careful with period, it’s profane in some cases period retext-profanities

github-actions · 2026-03-22T15:32:08Z

docs/enhancements/K8SPS-642-point-in-time-recovery.md

+- `getStopPosition` returns error: `"no end_log_pos found in mysqlbinlog output"`.
+- The PITR job fails with a clear error message.
+
+### 8.5 S3 Download Failure During Setup


⚠️ [alex] _{reported by reviewdog 🐶}
Be careful with Failure, it’s profane in some cases failure retext-profanities

github-actions · 2026-03-22T15:32:08Z

docs/enhancements/K8SPS-642-point-in-time-recovery.md

+
+### 8.6 mysqld Fails to Start in PITR Job
+
+**Scenario:** The temporary mysqld started by the entrypoint script fails (e.g., corrupted data directory from a failed restore).


⚠️ [alex] _{reported by reviewdog 🐶}
Be careful with failed, it’s profane in some cases failed retext-profanities

github-actions · 2026-03-22T15:32:09Z

docs/enhancements/K8SPS-642-point-in-time-recovery.md

+## 11. Open Questions [REQUIRED]
+
+1. **Binlog server connects to pod-0 only:** The binlog server always connects to `mysql-0` via FQDN. In async replication, if the primary fails over to another pod, the binlog server may connect to a replica and miss transactions (or fail if `log_replica_updates` is off). Should the binlog server connect to the primary service instead?
+   - *Option A:* Connect to pod-0 (current). Simple, but may miss transactions after failover.


⚠️ [alex] _{reported by reviewdog 🐶}
Simple may be insensitive, try not to use it simple retext-equality

Copilot

Pull request overview

Adds an enhancement proposal document describing a Point-in-Time Recovery (PITR) design for the Percona Server for MySQL Operator restore flow, including binlog collection via a binlog server and relay-log based replay during restore.

Changes:

Introduces a new enhancement proposal covering PITR goals, constraints, and architecture.
Documents proposed CRD/interface changes for PerconaServerMySQLRestore.spec.pitr.
Captures edge cases, migration considerations, and an E2E testing strategy.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-22T15:34:08Z

docs/enhancements/K8SPS-642-point-in-time-recovery.md

+### 9.3 Operator Version Skew
+
+- **Operator upgraded, CR unchanged:** The binlog server gets the expanded configuration on next reconciliation. If the binlog server binary doesn't support the new config fields, it may fail.
+- **Restore CR with `pitr` field on old operator:** If CRDs are updated, the old operator ignores the unknown `pitr` field and performs a regular restore without PITR. No error, but the user doesn't get PITR. If CRDs are not updated, `kubectl apply` fails wiith `unknown field: spec.pitr`.


Typo: wiith should be with.

Suggested change

- **Restore CR with `pitr` field on old operator:** If CRDs are updated, the old operator ignores the unknown `pitr` field and performs a regular restore without PITR. No error, but the user doesn't get PITR. If CRDs are not updated, `kubectl apply` fails wiith `unknown field: spec.pitr`.

- **Restore CR with `pitr` field on old operator:** If CRDs are updated, the old operator ignores the unknown `pitr` field and performs a regular restore without PITR. No error, but the user doesn't get PITR. If CRDs are not updated, `kubectl apply` fails with `unknown field: spec.pitr`.

Copilot · 2026-03-22T15:34:09Z

docs/enhancements/K8SPS-642-point-in-time-recovery.md

+| `storage.fs_buffer_directory` | Local buffer directory for in-flight binlog data (`/var/lib/binlogsrv`) |
+| `storage.checkpoint_size` / `checkpoint_interval` | Controls how frequently data is flushed to S3 |
+
+**S3 URI format change:** The S3 URI was changed from `s3://accessKey:secretKey@bucket.region` to `protocol://accessKey:secretKey@host/bucket` to correctly support custom endpoints (e.g., MinIO). The protocol is extracted from the `endpointURL` field.


The proposed S3 URI formats embed access key/secret directly in the URI. Even if stored in a Secret, URIs are often surfaced in logs, error messages, metrics, or kubectl describe, so this increases credential leakage risk. Consider describing (or proposing) a configuration shape that keeps credentials in separate fields/vars (e.g., AWS env vars/Secret refs) and ensure any URI that includes creds is never logged.

Suggested change

**S3 URI format change:** The S3 URI was changed from `s3://accessKey:secretKey@bucket.region` to `protocol://accessKey:secretKey@host/bucket` to correctly support custom endpoints (e.g., MinIO). The protocol is extracted from the `endpointURL` field.

**S3 configuration shape:** To correctly support custom endpoints (e.g., MinIO) without embedding credentials in URIs, the binlog server configuration separates endpoint and credentials. The S3 endpoint is provided as a credential-free URI of the form `protocol://host/bucket` (derived from the `endpointURL` field), while credentials are supplied via dedicated fields or environment variables / Secret refs (for example, `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, and optionally a session token). Any `uri` fields (such as those in `BinlogEntry`) must never contain access keys or secrets and may be safely surfaced in logs and events.

Copilot · 2026-03-22T15:34:09Z

docs/enhancements/K8SPS-642-point-in-time-recovery.md

+)
+```
+
+> [!note]


If the intent is to use GitHub Markdown alerts/callouts, the marker needs to match GitHub's supported syntax (e.g., > [!NOTE]). Using > [!note] may render as a plain blockquote instead of an alert, reducing readability.

Suggested change

> [!note]

> [!NOTE]

Copilot · 2026-03-22T15:34:09Z

docs/enhancements/K8SPS-642-point-in-time-recovery.md

+   - *Recommendation:* Document the exclusion semantic clearly. This is consistent with the behavior in K8SPXC.
+   - *Resolution:* Pending.
+
+5. **Parallel replay workers:** Should the operator let users to configure `replica_parallel_workers`?


Grammar: let users to configure should be let users configure.

Suggested change

5. **Parallel replay workers:** Should the operator let users to configure `replica_parallel_workers`?

5. **Parallel replay workers:** Should the operator let users configure `replica_parallel_workers`?

Copilot · 2026-03-22T15:34:09Z

docs/enhancements/K8SPS-642-point-in-time-recovery.md

+
+6. **Downloading binary logs:** Operator downloads all binary logs collected by PBS without considering if they are needed or not. Should we implement a smarter logic to decide which slice of binary logs needs to be downloaded?
+    - *Option A:* Relay log based recovery already handles transactions that are already applied to database. No need to overcomplicate.
+    - *Option B*: If PBS is collecting binary logs for a long time, `pitr` binary might need to download a lot of files and this might delay the recovery. Operator needs to be smarter to understand last write and/or last GTID of the backup and decide from where it needs to start downloading files.


Markdown formatting: *Option B*: is inconsistent with the surrounding *Option A:* style and looks like a typo. Consider making it *Option B:* for consistent emphasis and rendering.

Suggested change

- *Option B*: If PBS is collecting binary logs for a long time, `pitr` binary might need to download a lot of files and this might delay the recovery. Operator needs to be smarter to understand last write and/or last GTID of the backup and decide from where it needs to start downloading files.

- *Option B:* If PBS is collecting binary logs for a long time, `pitr` binary might need to download a lot of files and this might delay the recovery. Operator needs to be smarter to understand last write and/or last GTID of the backup and decide from where it needs to start downloading files.

Copilot · 2026-03-22T15:34:09Z

docs/enhancements/K8SPS-642-point-in-time-recovery.md

+    - *Option A:* Relay log based recovery already handles transactions that are already applied to database. No need to overcomplicate.
+    - *Option B*: If PBS is collecting binary logs for a long time, `pitr` binary might need to download a lot of files and this might delay the recovery. Operator needs to be smarter to understand last write and/or last GTID of the backup and decide from where it needs to start downloading files.


List indentation under Open Question #6 is inconsistent (Option A/B are indented differently than Recommendation/Resolution). This may break the nested list formatting in rendered Markdown; align indentation so all bullets remain properly nested under item 6.

Suggested change

- *Option A:* Relay log based recovery already handles transactions that are already applied to database. No need to overcomplicate.

- *Option B*: If PBS is collecting binary logs for a long time, `pitr` binary might need to download a lot of files and this might delay the recovery. Operator needs to be smarter to understand last write and/or last GTID of the backup and decide from where it needs to start downloading files.

- *Option A:* Relay log based recovery already handles transactions that are already applied to database. No need to overcomplicate.

- *Option B*: If PBS is collecting binary logs for a long time, `pitr` binary might need to download a lot of files and this might delay the recovery. Operator needs to be smarter to understand last write and/or last GTID of the backup and decide from where it needs to start downloading files.

github-actions · 2026-03-22T15:59:10Z

docs/enhancements/K8SPS-642-point-in-time-recovery.md

+   - *Resolution:* Deferred.
+
+5. **`SQL_BEFORE_GTIDS` semantics clarity:** The GTID target uses `SQL_BEFORE_GTIDS`, meaning the specified transaction is *excluded*. Should the CRD documentation make this explicit, or should we offer both "before" and "at" semantics?
+   - *Recommendation:* Document the exclusion semantic clearly. This is consistent with the behavior in K8SPXC.


⚠️ [alex] _{reported by reviewdog 🐶}
clearly may be insensitive, try not to use it clearly retext-equality

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 7 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-23T11:11:12Z

docs/enhancements/K8SPS-642-point-in-time-recovery.md

+  The GTID set to recover to. Replay stops just *before* this GTID (using `SQL_BEFORE_GTIDS`), so the transaction identified by this GTID is NOT applied. This is the "exclude" semantic — the user specifies the bad transaction they want to skip.
+
+- **`spec.pitr.date`** *(required when type is `"date"`)*:
+  ISO datetime string (e.g., `"2026-03-20 10:30:00"`). Replay stops at the last event before this timestamp. The last relay log is parsed with `mysqlbinlog --stop-datetime` to find the exact position.


spec.pitr.date is described as an "ISO datetime string" but the example (YYYY-MM-DD HH:MM:SS) is not ISO 8601 and it also doesn't specify a timezone. Please clarify the exact accepted format (preferably RFC3339/ISO8601 with timezone) and document whether the operator treats the value as UTC or local time (especially since mysqlbinlog --stop-datetime is used).

Suggested change

ISO datetime string (e.g., `"2026-03-20 10:30:00"`). Replay stops at the last event before this timestamp. The last relay log is parsed with `mysqlbinlog --stop-datetime` to find the exact position.

RFC3339 / ISO 8601 datetime string with timezone (e.g., `"2026-03-20T10:30:00Z"` or `"2026-03-20T11:30:00+01:00"`). The operator parses this value, normalizes it to UTC internally, and uses the corresponding UTC time when determining the stop position (e.g., when invoking `mysqlbinlog --stop-datetime`). Replay stops at the last event **strictly before** this timestamp.

Copilot · 2026-03-23T11:11:12Z

docs/enhancements/K8SPS-642-point-in-time-recovery.md

+
+### 2.2 Key Constraints
+
+1. **S3 only for binlog storage:** The `binlog_server` binary currently supports S3 as its storage backend. The binlog server configuration uses an S3 URI with embedded credentials.


This notes using an S3 URI with embedded credentials for the binlog server. Embedding secrets in URIs tends to leak via logs, errors, metrics, or kubectl describe output. Consider switching the design to pass credentials via env vars/Secret refs (like the PITR job does) and ensure any URI-like values are scrubbed/redacted before logging.

Suggested change

1. **S3 only for binlog storage:** The `binlog_server` binary currently supports S3 as its storage backend. The binlog server configuration uses an S3 URI with embedded credentials.

1. **S3 only for binlog storage:** The `binlog_server` binary currently supports S3 as its storage backend. The binlog server configuration uses an S3 URI without embedded credentials; S3 access keys are provided via environment variables/Kubernetes Secret references, and any URI-like values must be scrubbed/redacted before logging.

Copilot · 2026-03-23T11:11:13Z

docs/enhancements/K8SPS-642-point-in-time-recovery.md

+   - *Resolution:* Pending.
+
+2. **Replicating from primary:** Even if we resolve the question above we also need to decide if PBS will replicate from a secondary or the primary.
+   - *Option A:* Use a headless service that matches all MySQL pods. It's okay if the primary is used to replica from.


Grammar: "used to replica from" should be "used to replicate from" (or "used as the replication source").

Suggested change

- *Option A:* Use a headless service that matches all MySQL pods. It's okay if the primary is used to replica from.

- *Option A:* Use a headless service that matches all MySQL pods. It's okay if the primary is used as the replication source.

Copilot · 2026-03-23T11:11:13Z

docs/enhancements/K8SPS-642-point-in-time-recovery.md

+
+7. **Downloading binary logs:** Operator downloads all binary logs collected by PBS without considering if they are needed or not. Should we implement a smarter logic to decide which slice of binary logs needs to be downloaded?
+    - *Option A:* Relay log based recovery already handles transactions that are already applied to database. No need to overcomplicate.
+    - *Option B*: If PBS is collecting binary logs for a long time, `pitr` binary might need to download a lot of files and this might delay the recovery. Operator needs to be smarter to understand last write and/or last GTID of the backup and decide from where it needs to start downloading files.


Minor formatting inconsistency: this bullet uses *Option B*: (asterisk after B) while surrounding bullets use *Option X:*. Please make the emphasis/punctuation consistent for readability.

Suggested change

- *Option B*: If PBS is collecting binary logs for a long time, `pitr` binary might need to download a lot of files and this might delay the recovery. Operator needs to be smarter to understand last write and/or last GTID of the backup and decide from where it needs to start downloading files.

- *Option B:* If PBS is collecting binary logs for a long time, `pitr` binary might need to download a lot of files and this might delay the recovery. Operator needs to be smarter to understand last write and/or last GTID of the backup and decide from where it needs to start downloading files.

Copilot · 2026-03-23T11:11:13Z

docs/enhancements/K8SPS-642-point-in-time-recovery.md

+- *GTID-based:* User specifies a GTID set (e.g., `uuid:N`). The SQL thread replays until just before that GTID using `START REPLICA SQL_THREAD UNTIL SQL_BEFORE_GTIDS`.
+- *Date-based:* User specifies a datetime. The last relay log is parsed with `mysqlbinlog --stop-datetime` to find the exact binary log position, then `START REPLICA SQL_THREAD UNTIL RELAY_LOG_FILE, RELAY_LOG_POS` is used.
+
+**Binlog search:** The `binlog_server` binary provides `search_by_gtid_set` and `search_by_timestamp` subcommands that query S3 metadata to return the list of binlog files covering the range from the backup's GTID position to the target recovery point. The operator executes these via `kubectl exec` on the binlog server pod.


This says the operator runs the binlog search via kubectl exec, but the operator won't have kubectl available in-container; it typically uses the Kubernetes exec API (SPDY/remotecommand) to achieve the same thing. Suggest rephrasing to "execs into the binlog server pod via the Kubernetes API (equivalent to kubectl exec)" to avoid implying a shell-out dependency.

Suggested change

**Binlog search:** The `binlog_server` binary provides `search_by_gtid_set` and `search_by_timestamp` subcommands that query S3 metadata to return the list of binlog files covering the range from the backup's GTID position to the target recovery point. The operator executes these via `kubectl exec` on the binlog server pod.

**Binlog search:** The `binlog_server` binary provides `search_by_gtid_set` and `search_by_timestamp` subcommands that query S3 metadata to return the list of binlog files covering the range from the backup's GTID position to the target recovery point. The operator execs into the binlog server pod via the Kubernetes API (equivalent to running these commands with `kubectl exec`).

Copilot · 2026-03-23T11:11:13Z

docs/enhancements/K8SPS-642-point-in-time-recovery.md

+   - *Recommendation:* Document the exclusion semantic clearly. This is consistent with the behavior in K8SPXC.
+   - *Resolution:* Pending.
+
+6. **Parallel replay workers:** Should the operator let users to configure `replica_parallel_workers`?


Grammar: "let users to configure" should be "let users configure".

Suggested change

6. **Parallel replay workers:** Should the operator let users to configure `replica_parallel_workers`?

6. **Parallel replay workers:** Should the operator let users configure `replica_parallel_workers`?

Copilot · 2026-03-23T11:11:14Z

docs/enhancements/K8SPS-642-point-in-time-recovery.md

+- When `spec.pause` is set to `true` on a cluster, the main reconciler now also disables PITR (`spec.backup.pitr.enabled = false`) to prevent the binlog server from running while the cluster is paused.
+- The binlog server StatefulSet is now cleaned up (deleted) when PITR is disabled, via a new `cleanupBinlogServer` function in the main reconciler.


This proposes automatically setting spec.backup.pitr.enabled = false when spec.pause = true. That mutates the user's desired configuration and may be surprising (users would need to remember to re-enable PITR after unpausing). Consider keeping spec.backup.pitr.enabled intact and instead treating spec.pause as an override during reconciliation (e.g., skip reconciling / scale down / clean up binlog-server resources while paused, but don't rewrite spec).

Suggested change

- When `spec.pause` is set to `true` on a cluster, the main reconciler now also disables PITR (`spec.backup.pitr.enabled = false`) to prevent the binlog server from running while the cluster is paused.

- The binlog server StatefulSet is now cleaned up (deleted) when PITR is disabled, via a new `cleanupBinlogServer` function in the main reconciler.

- When `spec.pause` is set to `true` on a cluster, the main reconciler treats PITR as effectively disabled to prevent the binlog server from running while the cluster is paused (for example, by skipping binlog-server reconciliation and/or scaling it down), but it does **not** modify `spec.backup.pitr.enabled` in the spec.

- The binlog server StatefulSet is now cleaned up (deleted) when PITR is effectively disabled (either because `spec.backup.pitr.enabled` is `false` or because `spec.pause` is `true`), via a new `cleanupBinlogServer` function in the main reconciler.

JNKPercona · 2026-03-23T12:46:09Z

Test Name	Result	Time
async-ignore-annotations-8-4	passed	00:07:01
async-global-metadata-8-4	passed	00:16:19
async-upgrade-8-0	passed	00:13:30
async-upgrade-8-4	passed	00:13:38
auto-config-8-4	passed	00:24:40
config-8-4	passed	00:17:18
config-router-8-0	passed	00:07:55
config-router-8-4	passed	00:07:51
demand-backup-minio-8-0	passed	00:18:54
demand-backup-minio-8-4	passed	00:20:41
demand-backup-cloud-8-4	passed	00:23:06
demand-backup-retry-8-4	passed	00:16:11
async-data-at-rest-encryption-8-0	passed	00:14:33
async-data-at-rest-encryption-8-4	passed	00:14:55
gr-global-metadata-8-4	passed	00:14:20
gr-data-at-rest-encryption-8-0	passed	00:14:19
gr-data-at-rest-encryption-8-4	passed	00:18:05
gr-demand-backup-minio-8-4	passed	00:12:41
gr-demand-backup-cloud-8-4	passed	00:22:05
gr-demand-backup-haproxy-8-4	passed	00:09:38
gr-finalizer-8-4	passed	00:05:59
gr-haproxy-8-0	passed	00:05:06
gr-haproxy-8-4	passed	00:04:08
gr-ignore-annotations-8-4	passed	00:04:51
gr-init-deploy-8-0	passed	00:10:45
gr-init-deploy-8-4	passed	00:09:52
gr-one-pod-8-4	passed	00:05:48
gr-recreate-8-4	passed	00:17:17
gr-scaling-8-4	passed	00:08:12
gr-scheduled-backup-8-4	passed	00:16:39
gr-security-context-8-4	passed	00:09:55
gr-self-healing-8-4	passed	00:23:04
gr-tls-cert-manager-8-4	passed	00:09:31
gr-users-8-4	passed	00:05:24
gr-upgrade-8-0	passed	00:10:11
gr-upgrade-8-4	passed	00:10:18
haproxy-8-0	passed	00:08:13
haproxy-8-4	passed	00:08:44
init-deploy-8-0	passed	00:06:56
init-deploy-8-4	passed	00:05:26
limits-8-4	passed	00:06:39
monitoring-8-4	passed	00:14:25
one-pod-8-0	passed	00:05:32
one-pod-8-4	passed	00:05:45
operator-self-healing-8-4	passed	00:11:18
pvc-resize-8-4	passed	00:05:28
recreate-8-4	passed	00:12:30
scaling-8-4	passed	00:10:58
scheduled-backup-8-0	passed	00:16:22
scheduled-backup-8-4	passed	00:13:23
service-per-pod-8-4	passed	00:06:41
sidecars-8-4	passed	00:06:32
smart-update-8-4	passed	00:09:14
storage-8-4	passed	00:04:06
telemetry-8-4	passed	00:06:18
tls-cert-manager-8-4	passed	00:11:11
users-8-0	passed	00:08:58
users-8-4	passed	00:07:40
version-service-8-4	passed	00:21:02

Summary	Value
Tests Run	59/59
Job Duration	01:35:42
Total Test Time	11:18:27

commit: a7b88ac
image: perconalab/percona-server-mysql-operator:PR-1251-a7b88ac1

mayankshah1607 · 2026-03-30T12:24:26Z

docs/enhancements/K8SPS-642-point-in-time-recovery.md

+
+---
+
+## 7. User Experience [REQUIRED]


Will we also support restoring from backupSource?

I think we should, yes.

K8SPS-642: [Proposal] Point-in-time recovery

5274636

egegunes added this to the v1.1.0 milestone Mar 22, 2026

egegunes requested review from gkech, hors, nmarukovich and pooknull as code owners March 22, 2026 15:31

Copilot AI review requested due to automatic review settings March 22, 2026 15:31

pull-request-size bot added the size/L 100-499 lines label Mar 22, 2026

Copilot started reviewing on behalf of egegunes March 22, 2026 15:32 View session

github-actions bot reviewed Mar 22, 2026

View reviewed changes

Copilot AI reviewed Mar 22, 2026

View reviewed changes

add one more question

75b6793

github-actions bot reviewed Mar 22, 2026

View reviewed changes

add another question

a7b88ac

Copilot AI review requested due to automatic review settings March 23, 2026 11:07

pull-request-size bot added size/XL 500-999 lines and removed size/L 100-499 lines labels Mar 23, 2026

Copilot started reviewing on behalf of egegunes March 23, 2026 11:07 View session

Copilot AI reviewed Mar 23, 2026

View reviewed changes

mayankshah1607 reviewed Mar 30, 2026

View reviewed changes

gkech mentioned this pull request Apr 2, 2026

K8SPS-642: Point-in-time recovery #1252

Open

11 tasks


		Chosen approach: Set `spec.backup.pitr.enabled = false` when `spec.pause = true` in `CheckNSetDefaults`.

		Why: A paused cluster has no MySQL pods running. The binlog server would fail to connect and enter a crash loop. Disabling PITR cleanly prevents this.


		### 8.1 No Binlogs Found for Recovery Target

		Scenario: The user specifies a PITR target (GTID or date) for which no matching binlogs exist in S3. This can happen if the binlog server was not running during the time period, or if the target is before the backup.


		### 8.6 mysqld Fails to Start in PITR Job

		Scenario: The temporary mysqld started by the entrypoint script fails (e.g., corrupted data directory from a failed restore).

	- Restore CR with `pitr` field on old operator: If CRDs are updated, the old operator ignores the unknown `pitr` field and performs a regular restore without PITR. No error, but the user doesn't get PITR. If CRDs are not updated, `kubectl apply` fails wiith `unknown field: spec.pitr`.
	- Restore CR with `pitr` field on old operator: If CRDs are updated, the old operator ignores the unknown `pitr` field and performs a regular restore without PITR. No error, but the user doesn't get PITR. If CRDs are not updated, `kubectl apply` fails with `unknown field: spec.pitr`.

	5. Parallel replay workers: Should the operator let users to configure `replica_parallel_workers`?
	5. Parallel replay workers: Should the operator let users configure `replica_parallel_workers`?

	- Option B: If PBS is collecting binary logs for a long time, `pitr` binary might need to download a lot of files and this might delay the recovery. Operator needs to be smarter to understand last write and/or last GTID of the backup and decide from where it needs to start downloading files.
	- Option B: If PBS is collecting binary logs for a long time, `pitr` binary might need to download a lot of files and this might delay the recovery. Operator needs to be smarter to understand last write and/or last GTID of the backup and decide from where it needs to start downloading files.

		- Option A: Relay log based recovery already handles transactions that are already applied to database. No need to overcomplicate.
		- Option B: If PBS is collecting binary logs for a long time, `pitr` binary might need to download a lot of files and this might delay the recovery. Operator needs to be smarter to understand last write and/or last GTID of the backup and decide from where it needs to start downloading files.

	ISO datetime string (e.g., `"2026-03-20 10:30:00"`). Replay stops at the last event before this timestamp. The last relay log is parsed with `mysqlbinlog --stop-datetime` to find the exact position.
	RFC3339 / ISO 8601 datetime string with timezone (e.g., `"2026-03-20T10:30:00Z"` or `"2026-03-20T11:30:00+01:00"`). The operator parses this value, normalizes it to UTC internally, and uses the corresponding UTC time when determining the stop position (e.g., when invoking `mysqlbinlog --stop-datetime`). Replay stops at the last event strictly before this timestamp.


		### 2.2 Key Constraints

		1. S3 only for binlog storage: The `binlog_server` binary currently supports S3 as its storage backend. The binlog server configuration uses an S3 URI with embedded credentials.

	- Option A: Use a headless service that matches all MySQL pods. It's okay if the primary is used to replica from.
	- Option A: Use a headless service that matches all MySQL pods. It's okay if the primary is used as the replication source.

	Binlog search: The `binlog_server` binary provides `search_by_gtid_set` and `search_by_timestamp` subcommands that query S3 metadata to return the list of binlog files covering the range from the backup's GTID position to the target recovery point. The operator executes these via `kubectl exec` on the binlog server pod.
	Binlog search: The `binlog_server` binary provides `search_by_gtid_set` and `search_by_timestamp` subcommands that query S3 metadata to return the list of binlog files covering the range from the backup's GTID position to the target recovery point. The operator execs into the binlog server pod via the Kubernetes API (equivalent to running these commands with `kubectl exec`).

	6. Parallel replay workers: Should the operator let users to configure `replica_parallel_workers`?
	6. Parallel replay workers: Should the operator let users configure `replica_parallel_workers`?

		- When `spec.pause` is set to `true` on a cluster, the main reconciler now also disables PITR (`spec.backup.pitr.enabled = false`) to prevent the binlog server from running while the cluster is paused.
		- The binlog server StatefulSet is now cleaned up (deleted) when PITR is disabled, via a new `cleanupBinlogServer` function in the main reconciler.

Conversation

egegunes commented Mar 22, 2026

CHANGE DESCRIPTION

CHECKLIST

Uh oh!

github-actions bot Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 23, 2026