Skip to content

PITR restore stuck: binlog collector not recreated after restore job completion causing operator deadlock #2348

@xartii

Description

@xartii

Report

Summary

When performing a Point-In-Time Recovery (PITR) on a Percona XtraDB Cluster managed by the Operator, the restore process remains stuck in the "Restoring" status indefinitely, even though the restore job completes successfully and the cluster reports as ready. The operator appears to wait for a binlog collector pod that is not recreated during/after the restore, leading to a deadlock. Deleting the PerconaXtraDBClusterRestore object immediately unblocks the operator and triggers binlog collector creation.

More about the problem

Configuration (restore CR)

apiVersion: pxc.percona.com/v1
kind: PerconaXtraDBClusterRestore
metadata:
  name: restore-artur
  namespace: xtradb
spec:
  pxcCluster: mysql
  backupName: cron-mysql-s3-bckp-2026120703-6oioo
  pitr:
    type: date
    date: "2026-01-20T07:40:23Z"
    backupSource:
      storageName: s3-bckp

Expected Behavior

  • Restore job completes.
  • PXC/ProxySQL pods restart and cluster becomes ready.
  • Operator recreates the binlog collector deployment.
  • PerconaXtraDBClusterRestore.status transitions from Restoring to Succeeded/Ready.
  • PITR resumes (binlog collector running and uploading).

Actual Behavior

  • Restore job completes successfully (e.g., restore-job-restore-artur-mysql-6mc5v finished after ~17 minutes).
  • Cluster shows STATUS=ready with all database and ProxySQL pods running.
  • Binlog collector deployment (mysql-pitr) is not recreated.
  • PerconaXtraDBClusterRestore remains in Restoring indefinitely.
  • Operator logs loop with messages indicating it is waiting for the cluster and cannot find binlog collector pods.

Observations and Evidence

Cluster ready:

kubectl -n xtradb get perconaxtradbcluster mysql
# NAME   ENDPOINT                    STATUS   PXC   PROXYSQL   AGE
# mysql  mysql-proxysql.xtradb       ready    3     3          45d

Restore status stuck:

kubectl -n xtradb get perconaxtradbclusterrestore restore-artur
# NAME            CLUSTER   STATUS       AGE
# restore-artur   mysql     Restoring    20m+

Restore job completed:

kubectl -n xtradb get jobs | grep restore
# restore-job-restore-artur-mysql-6mc5v   1/1   17m   20m

Binlog collector missing:

kubectl -n xtradb get deploy,po | grep -i pitr || echo "no binlog collector"
# no binlog collector

Operator logs (repeating):

I0120 09:05:51.234567 restore_controller.go:789] Waiting for cluster to start
E0120 09:05:51.345678 restore_controller.go:456] get binlog collector pod: no binlog collector pods

Diagnosis / Suspected Root Cause

  • Restore controller waits for the binlog collector pod as part of restore completion.
  • Cluster controller does not reconcile and recreate the binlog collector while the restore is in Restoring.
  • This creates a deadlock: restore waits for binlog collector; cluster won’t create it until restore is done.

Workaround

Delete the stuck restore object:

kubectl -n xtradb delete perconaxtradbclusterrestore restore-artur

Observed result: operator immediately creates mysql-pitr deployment; binlog collector pod starts; PITR resumes; cluster remains healthy.

Impact

  • Severity: High (PITR restore completion blocked; binlog uploads halted)
  • Reproducibility: Always (seen on repeated attempts)
  • Data risk: Low (data restored; status and PITR resumption blocked until manual intervention)
  • Operational: Requires manual deletion of restore object to recover PITR

Reproducibility

  • Occurs consistently across attempts with the above steps.

What We Tried

  • Restarting operator pod: did not help (same loop).
  • Deleting the PerconaXtraDBClusterRestore object: unblocks and fixes.

Suggested Fix / Proposal

  • Ensure the operator explicitly reconciles and recreates the binlog collector as part of restore completion flow before waiting on it.
  • Or, decouple the binlog collector reconciliation from restore status so it is created once the cluster is ready.
  • Improve restore/PITR status transitions and log messages to avoid ambiguous “Waiting for cluster to start” loops when the cluster is already ready.

pxc definition

apiVersion: pxc.percona.com/v1
kind: PerconaXtraDBCluster
metadata:
  annotations:
    argocd.argoproj.io/sync-wave: '2'
    argocd.argoproj.io/tracking-id: >-
      percona-mysql-tenant-xk1229kx3:pxc.percona.com/PerconaXtraDBCluster:xtradb/mysql
  creationTimestamp: '2026-01-15T08:39:18Z'
  finalizers:
    - percona.com/delete-pxc-pods-in-order
  generation: 1187
  name: mysql
  namespace: xtradb
  resourceVersion: '85036709'
  uid: 942180db-9dfe-4063-b97b-23108fe5b336
spec:
  backup:
    activeDeadlineSeconds: 3600
    allowParallel: true
    backoffLimit: 6
    image: percona/percona-xtrabackup:8.0.35-34.1
    pitr:
      enabled: true
      resources:
        limits:
          cpu: 700m
          memory: 8G
        requests:
          cpu: 100m
          memory: 0.1G
      storageName: s3-bckp
      timeBetweenUploads: 60
      timeoutSeconds: 60
    schedule:
      - name: daily-bckp
        schedule: 0 * * * *
        storageName: s3-bckp
    serviceAccountName: percona-xtradb-cluster-operator
    startingDeadlineSeconds: 300
    storages:
      s3-bckp:
        resources: {}
        s3:
          bucket: REDACTED/backups2
          credentialsSecret: s3-creds
          endpointUrl: https://REDACTED-object-storage.example.com
          region: default
        type: s3
        verifyTLS: false
    suspendedDeadlineSeconds: 1200
  crVersion: 1.18.0
  enableVolumeExpansion: true
  haproxy:
    enabled: false
    exposePrimary: {}
    lifecycle: {}
    livenessProbes: {}
    readinessProbes: {}
    resources: {}
    sidecarResources: {}
  initContainer:
    image: percona/percona-xtradb-cluster-operator:1.18.0
    resources:
      limits:
        cpu: 50m
        memory: 50M
      requests:
        cpu: 50m
        memory: 50M
  logcollector:
    enabled: true
    image: percona/fluentbit:4.0.1
    resources:
      limits:
        cpu: 250m
        memory: 200M
      requests:
        cpu: 200m
        memory: 100M
  pmm:
    enabled: false
    resources: {}
  proxysql:
    affinity:
      antiAffinityTopologyKey: kubernetes.io/hostname
    enabled: true
    expose: {}
    gracePeriod: 30
    image: percona/proxysql2:2.7.3
    lifecycle: {}
    livenessProbes: {}
    podDisruptionBudget:
      maxUnavailable: 1
    readinessProbes: {}
    resources:
      limits:
        cpu: 700m
        memory: 1G
      requests:
        cpu: 600m
        memory: 1G
    sidecarResources:
      limits:
        cpu: 100m
        memory: 100M
      requests:
        cpu: 50m
        memory: 50M
    size: 3
    volumeSpec:
      persistentVolumeClaim:
        accessModes:
          - ReadWriteOnce
        resources:
          requests:
            storage: 2G
  pxc:
    affinity:
      antiAffinityTopologyKey: kubernetes.io/hostname
    autoRecovery: true
    configuration: >-
      [mysqld]

      # Fixes MLMD locking: "prohibits use of GET_LOCK with pxc_strict_mode =
      ENFORCING"

      pxc_strict_mode=MASTER
    expose: {}
    gracePeriod: 600
    image: percona/percona-xtradb-cluster:8.0.42-33.1
    lifecycle: {}
    livenessProbes: {}
    podDisruptionBudget:
      maxUnavailable: 1
    readinessProbes: {}
    resources:
      limits:
        cpu: '1'
        memory: 1G
      requests:
        cpu: 600m
        memory: 1G
    sidecarResources: {}
    size: 3
    volumeSpec:
      persistentVolumeClaim:
        accessModes:
          - ReadWriteOnce
        resources:
          requests:
            storage: 6G
  secretsName: mysql-system-users
  sslInternalSecretName: mysql-ssl-internal
  sslSecretName: mysql-ssl
  tls:
    enabled: true
  unsafeFlags:
    proxySize: true
    pxcSize: true
  updateStrategy: SmartUpdate
  upgradeOptions:
    apply: disabled
    schedule: 0 4 * * *
    versionServiceEndpoint: https://check.percona.com
status:
  backup: {}
  conditions:
    - lastTransitionTime: '2026-01-20T07:52:03Z'
      status: 'True'
      type: ready
    - lastTransitionTime: '2026-01-20T07:52:53Z'
      status: 'True'
      type: initializing
    - lastTransitionTime: '2026-01-20T07:54:52Z'
      status: 'True'
      type: ready
    - lastTransitionTime: '2026-01-20T08:08:19Z'
      message: >-
        exec binlog collector pod mysql-pitr-769b8868fc-gfmnd: failed to execute
        command in pod: unable to upgrade connection: container not found
        ("pitr")
      reason: ErrorReconcile
      status: 'True'
      type: error
    - lastTransitionTime: '2026-01-20T08:08:29Z'
      status: 'True'
      type: ready
    - lastTransitionTime: '2026-01-20T08:09:59Z'
      message: >-
        exec binlog collector pod mysql-pitr-769b8868fc-gfmnd: failed to execute
        command in pod: unable to upgrade connection: container not found
        ("pitr")
      reason: ErrorReconcile
      status: 'True'
      type: error
    - lastTransitionTime: '2026-01-20T08:10:15Z'
      status: 'True'
      type: ready
    - lastTransitionTime: '2026-01-20T08:10:17Z'
      message: >-
        exec binlog collector pod mysql-pitr-769b8868fc-gfmnd: failed to execute
        command in pod: unable to upgrade connection: container not found
        ("pitr")
      reason: ErrorReconcile
      status: 'True'
      type: error
    - lastTransitionTime: '2026-01-20T08:10:18Z'
      status: 'True'
      type: ready

Additional Context

  • After the workaround, binlog collector deployment (mysql-pitr) is created and running; PITR metrics and uploads resume as expected.
  • No errors were observed in PXC pod logs during or after restore.

Steps to reproduce

  1. Deploy a PXC cluster with PITR enabled and S3 storage configured.
  2. Ensure binlog collector is running and full backups are present.
  3. Create a PerconaXtraDBClusterRestore with PITR date (as above).
  4. Observe restore job, operator logs, and cluster status.

Versions

  • Operator: Percona Operator for MySQL based on Percona XtraDB Cluster
  • Namespace: xtradb
  • Cluster name: mysql
  • Cloud: OpenStack (S3-compatible storage)
  • PITR: enabled with timeBetweenUploads: 60
  • Backup storage: S3
  • Kubernetes: 1.32.6
  • Operator version: 1.18.0
  • PXC version: 8.0.42-33.1

Anything else?

No response

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions