-
Notifications
You must be signed in to change notification settings - Fork 210
PITR restore stuck: binlog collector not recreated after restore job completion causing operator deadlock #2348
Description
Report
Summary
When performing a Point-In-Time Recovery (PITR) on a Percona XtraDB Cluster managed by the Operator, the restore process remains stuck in the "Restoring" status indefinitely, even though the restore job completes successfully and the cluster reports as ready. The operator appears to wait for a binlog collector pod that is not recreated during/after the restore, leading to a deadlock. Deleting the PerconaXtraDBClusterRestore object immediately unblocks the operator and triggers binlog collector creation.
More about the problem
Configuration (restore CR)
apiVersion: pxc.percona.com/v1
kind: PerconaXtraDBClusterRestore
metadata:
name: restore-artur
namespace: xtradb
spec:
pxcCluster: mysql
backupName: cron-mysql-s3-bckp-2026120703-6oioo
pitr:
type: date
date: "2026-01-20T07:40:23Z"
backupSource:
storageName: s3-bckpExpected Behavior
- Restore job completes.
- PXC/ProxySQL pods restart and cluster becomes
ready. - Operator recreates the binlog collector deployment.
PerconaXtraDBClusterRestore.statustransitions fromRestoringtoSucceeded/Ready.- PITR resumes (binlog collector running and uploading).
Actual Behavior
- Restore job completes successfully (e.g.,
restore-job-restore-artur-mysql-6mc5vfinished after ~17 minutes). - Cluster shows
STATUS=readywith all database and ProxySQL pods running. - Binlog collector deployment (
mysql-pitr) is not recreated. PerconaXtraDBClusterRestoreremains inRestoringindefinitely.- Operator logs loop with messages indicating it is waiting for the cluster and cannot find binlog collector pods.
Observations and Evidence
Cluster ready:
kubectl -n xtradb get perconaxtradbcluster mysql
# NAME ENDPOINT STATUS PXC PROXYSQL AGE
# mysql mysql-proxysql.xtradb ready 3 3 45dRestore status stuck:
kubectl -n xtradb get perconaxtradbclusterrestore restore-artur
# NAME CLUSTER STATUS AGE
# restore-artur mysql Restoring 20m+Restore job completed:
kubectl -n xtradb get jobs | grep restore
# restore-job-restore-artur-mysql-6mc5v 1/1 17m 20mBinlog collector missing:
kubectl -n xtradb get deploy,po | grep -i pitr || echo "no binlog collector"
# no binlog collectorOperator logs (repeating):
I0120 09:05:51.234567 restore_controller.go:789] Waiting for cluster to start
E0120 09:05:51.345678 restore_controller.go:456] get binlog collector pod: no binlog collector pods
Diagnosis / Suspected Root Cause
- Restore controller waits for the binlog collector pod as part of restore completion.
- Cluster controller does not reconcile and recreate the binlog collector while the restore is in
Restoring. - This creates a deadlock: restore waits for binlog collector; cluster won’t create it until restore is done.
Workaround
Delete the stuck restore object:
kubectl -n xtradb delete perconaxtradbclusterrestore restore-arturObserved result: operator immediately creates mysql-pitr deployment; binlog collector pod starts; PITR resumes; cluster remains healthy.
Impact
- Severity: High (PITR restore completion blocked; binlog uploads halted)
- Reproducibility: Always (seen on repeated attempts)
- Data risk: Low (data restored; status and PITR resumption blocked until manual intervention)
- Operational: Requires manual deletion of restore object to recover PITR
Reproducibility
- Occurs consistently across attempts with the above steps.
What We Tried
- Restarting operator pod: did not help (same loop).
- Deleting the
PerconaXtraDBClusterRestoreobject: unblocks and fixes.
Suggested Fix / Proposal
- Ensure the operator explicitly reconciles and recreates the binlog collector as part of restore completion flow before waiting on it.
- Or, decouple the binlog collector reconciliation from restore status so it is created once the cluster is
ready. - Improve restore/PITR status transitions and log messages to avoid ambiguous “Waiting for cluster to start” loops when the cluster is already ready.
pxc definition
apiVersion: pxc.percona.com/v1
kind: PerconaXtraDBCluster
metadata:
annotations:
argocd.argoproj.io/sync-wave: '2'
argocd.argoproj.io/tracking-id: >-
percona-mysql-tenant-xk1229kx3:pxc.percona.com/PerconaXtraDBCluster:xtradb/mysql
creationTimestamp: '2026-01-15T08:39:18Z'
finalizers:
- percona.com/delete-pxc-pods-in-order
generation: 1187
name: mysql
namespace: xtradb
resourceVersion: '85036709'
uid: 942180db-9dfe-4063-b97b-23108fe5b336
spec:
backup:
activeDeadlineSeconds: 3600
allowParallel: true
backoffLimit: 6
image: percona/percona-xtrabackup:8.0.35-34.1
pitr:
enabled: true
resources:
limits:
cpu: 700m
memory: 8G
requests:
cpu: 100m
memory: 0.1G
storageName: s3-bckp
timeBetweenUploads: 60
timeoutSeconds: 60
schedule:
- name: daily-bckp
schedule: 0 * * * *
storageName: s3-bckp
serviceAccountName: percona-xtradb-cluster-operator
startingDeadlineSeconds: 300
storages:
s3-bckp:
resources: {}
s3:
bucket: REDACTED/backups2
credentialsSecret: s3-creds
endpointUrl: https://REDACTED-object-storage.example.com
region: default
type: s3
verifyTLS: false
suspendedDeadlineSeconds: 1200
crVersion: 1.18.0
enableVolumeExpansion: true
haproxy:
enabled: false
exposePrimary: {}
lifecycle: {}
livenessProbes: {}
readinessProbes: {}
resources: {}
sidecarResources: {}
initContainer:
image: percona/percona-xtradb-cluster-operator:1.18.0
resources:
limits:
cpu: 50m
memory: 50M
requests:
cpu: 50m
memory: 50M
logcollector:
enabled: true
image: percona/fluentbit:4.0.1
resources:
limits:
cpu: 250m
memory: 200M
requests:
cpu: 200m
memory: 100M
pmm:
enabled: false
resources: {}
proxysql:
affinity:
antiAffinityTopologyKey: kubernetes.io/hostname
enabled: true
expose: {}
gracePeriod: 30
image: percona/proxysql2:2.7.3
lifecycle: {}
livenessProbes: {}
podDisruptionBudget:
maxUnavailable: 1
readinessProbes: {}
resources:
limits:
cpu: 700m
memory: 1G
requests:
cpu: 600m
memory: 1G
sidecarResources:
limits:
cpu: 100m
memory: 100M
requests:
cpu: 50m
memory: 50M
size: 3
volumeSpec:
persistentVolumeClaim:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 2G
pxc:
affinity:
antiAffinityTopologyKey: kubernetes.io/hostname
autoRecovery: true
configuration: >-
[mysqld]
# Fixes MLMD locking: "prohibits use of GET_LOCK with pxc_strict_mode =
ENFORCING"
pxc_strict_mode=MASTER
expose: {}
gracePeriod: 600
image: percona/percona-xtradb-cluster:8.0.42-33.1
lifecycle: {}
livenessProbes: {}
podDisruptionBudget:
maxUnavailable: 1
readinessProbes: {}
resources:
limits:
cpu: '1'
memory: 1G
requests:
cpu: 600m
memory: 1G
sidecarResources: {}
size: 3
volumeSpec:
persistentVolumeClaim:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 6G
secretsName: mysql-system-users
sslInternalSecretName: mysql-ssl-internal
sslSecretName: mysql-ssl
tls:
enabled: true
unsafeFlags:
proxySize: true
pxcSize: true
updateStrategy: SmartUpdate
upgradeOptions:
apply: disabled
schedule: 0 4 * * *
versionServiceEndpoint: https://check.percona.com
status:
backup: {}
conditions:
- lastTransitionTime: '2026-01-20T07:52:03Z'
status: 'True'
type: ready
- lastTransitionTime: '2026-01-20T07:52:53Z'
status: 'True'
type: initializing
- lastTransitionTime: '2026-01-20T07:54:52Z'
status: 'True'
type: ready
- lastTransitionTime: '2026-01-20T08:08:19Z'
message: >-
exec binlog collector pod mysql-pitr-769b8868fc-gfmnd: failed to execute
command in pod: unable to upgrade connection: container not found
("pitr")
reason: ErrorReconcile
status: 'True'
type: error
- lastTransitionTime: '2026-01-20T08:08:29Z'
status: 'True'
type: ready
- lastTransitionTime: '2026-01-20T08:09:59Z'
message: >-
exec binlog collector pod mysql-pitr-769b8868fc-gfmnd: failed to execute
command in pod: unable to upgrade connection: container not found
("pitr")
reason: ErrorReconcile
status: 'True'
type: error
- lastTransitionTime: '2026-01-20T08:10:15Z'
status: 'True'
type: ready
- lastTransitionTime: '2026-01-20T08:10:17Z'
message: >-
exec binlog collector pod mysql-pitr-769b8868fc-gfmnd: failed to execute
command in pod: unable to upgrade connection: container not found
("pitr")
reason: ErrorReconcile
status: 'True'
type: error
- lastTransitionTime: '2026-01-20T08:10:18Z'
status: 'True'
type: readyAdditional Context
- After the workaround, binlog collector deployment (
mysql-pitr) is created and running; PITR metrics and uploads resume as expected. - No errors were observed in PXC pod logs during or after restore.
Steps to reproduce
- Deploy a PXC cluster with PITR enabled and S3 storage configured.
- Ensure binlog collector is running and full backups are present.
- Create a
PerconaXtraDBClusterRestorewith PITR date (as above). - Observe restore job, operator logs, and cluster status.
Versions
- Operator: Percona Operator for MySQL based on Percona XtraDB Cluster
- Namespace:
xtradb - Cluster name:
mysql - Cloud: OpenStack (S3-compatible storage)
- PITR: enabled with
timeBetweenUploads: 60 - Backup storage: S3
- Kubernetes: 1.32.6
- Operator version: 1.18.0
- PXC version: 8.0.42-33.1
Anything else?
No response