PITR restore stuck: binlog collector not recreated after restore job completion causing operator deadlock

### Report

## Summary
When performing a Point-In-Time Recovery (PITR) on a Percona XtraDB Cluster managed by the Operator, the restore process remains stuck in the "Restoring" status indefinitely, even though the restore job completes successfully and the cluster reports as ready. The operator appears to wait for a binlog collector pod that is not recreated during/after the restore, leading to a deadlock. Deleting the `PerconaXtraDBClusterRestore` object immediately unblocks the operator and triggers binlog collector creation.



### More about the problem

## Configuration (restore CR)
```yaml
apiVersion: pxc.percona.com/v1
kind: PerconaXtraDBClusterRestore
metadata:
  name: restore-artur
  namespace: xtradb
spec:
  pxcCluster: mysql
  backupName: cron-mysql-s3-bckp-2026120703-6oioo
  pitr:
    type: date
    date: "2026-01-20T07:40:23Z"
    backupSource:
      storageName: s3-bckp
```

## Expected Behavior
- Restore job completes.
- PXC/ProxySQL pods restart and cluster becomes `ready`.
- Operator recreates the binlog collector deployment.
- `PerconaXtraDBClusterRestore.status` transitions from `Restoring` to `Succeeded/Ready`.
- PITR resumes (binlog collector running and uploading).

## Actual Behavior
- Restore job completes successfully (e.g., `restore-job-restore-artur-mysql-6mc5v` finished after ~17 minutes).
- Cluster shows `STATUS=ready` with all database and ProxySQL pods running.
- Binlog collector deployment (`mysql-pitr`) is not recreated.
- `PerconaXtraDBClusterRestore` remains in `Restoring` indefinitely.
- Operator logs loop with messages indicating it is waiting for the cluster and cannot find binlog collector pods.

## Observations and Evidence

Cluster ready:
```bash
kubectl -n xtradb get perconaxtradbcluster mysql
# NAME   ENDPOINT                    STATUS   PXC   PROXYSQL   AGE
# mysql  mysql-proxysql.xtradb       ready    3     3          45d
```

Restore status stuck:
```bash
kubectl -n xtradb get perconaxtradbclusterrestore restore-artur
# NAME            CLUSTER   STATUS       AGE
# restore-artur   mysql     Restoring    20m+
```

Restore job completed:
```bash
kubectl -n xtradb get jobs | grep restore
# restore-job-restore-artur-mysql-6mc5v   1/1   17m   20m
```

Binlog collector missing:
```bash
kubectl -n xtradb get deploy,po | grep -i pitr || echo "no binlog collector"
# no binlog collector
```

Operator logs (repeating):
```text
I0120 09:05:51.234567 restore_controller.go:789] Waiting for cluster to start
E0120 09:05:51.345678 restore_controller.go:456] get binlog collector pod: no binlog collector pods
```

## Diagnosis / Suspected Root Cause
- Restore controller waits for the binlog collector pod as part of restore completion.
- Cluster controller does not reconcile and recreate the binlog collector while the restore is in `Restoring`.
- This creates a deadlock: restore waits for binlog collector; cluster won’t create it until restore is done.

## Workaround
Delete the stuck restore object:
```bash
kubectl -n xtradb delete perconaxtradbclusterrestore restore-artur
```
Observed result: operator immediately creates `mysql-pitr` deployment; binlog collector pod starts; PITR resumes; cluster remains healthy.

## Impact
- Severity: High (PITR restore completion blocked; binlog uploads halted)
- Reproducibility: Always (seen on repeated attempts)
- Data risk: Low (data restored; status and PITR resumption blocked until manual intervention)
- Operational: Requires manual deletion of restore object to recover PITR

## Reproducibility
- Occurs consistently across attempts with the above steps.

## What We Tried
- Restarting operator pod: did not help (same loop).
- Deleting the `PerconaXtraDBClusterRestore` object: unblocks and fixes.

## Suggested Fix / Proposal
- Ensure the operator explicitly reconciles and recreates the binlog collector as part of restore completion flow before waiting on it.
- Or, decouple the binlog collector reconciliation from restore status so it is created once the cluster is `ready`.
- Improve restore/PITR status transitions and log messages to avoid ambiguous “Waiting for cluster to start” loops when the cluster is already ready.



## pxc definition
```yaml
apiVersion: pxc.percona.com/v1
kind: PerconaXtraDBCluster
metadata:
  annotations:
    argocd.argoproj.io/sync-wave: '2'
    argocd.argoproj.io/tracking-id: >-
      percona-mysql-tenant-xk1229kx3:pxc.percona.com/PerconaXtraDBCluster:xtradb/mysql
  creationTimestamp: '2026-01-15T08:39:18Z'
  finalizers:
    - percona.com/delete-pxc-pods-in-order
  generation: 1187
  name: mysql
  namespace: xtradb
  resourceVersion: '85036709'
  uid: 942180db-9dfe-4063-b97b-23108fe5b336
spec:
  backup:
    activeDeadlineSeconds: 3600
    allowParallel: true
    backoffLimit: 6
    image: percona/percona-xtrabackup:8.0.35-34.1
    pitr:
      enabled: true
      resources:
        limits:
          cpu: 700m
          memory: 8G
        requests:
          cpu: 100m
          memory: 0.1G
      storageName: s3-bckp
      timeBetweenUploads: 60
      timeoutSeconds: 60
    schedule:
      - name: daily-bckp
        schedule: 0 * * * *
        storageName: s3-bckp
    serviceAccountName: percona-xtradb-cluster-operator
    startingDeadlineSeconds: 300
    storages:
      s3-bckp:
        resources: {}
        s3:
          bucket: REDACTED/backups2
          credentialsSecret: s3-creds
          endpointUrl: https://REDACTED-object-storage.example.com
          region: default
        type: s3
        verifyTLS: false
    suspendedDeadlineSeconds: 1200
  crVersion: 1.18.0
  enableVolumeExpansion: true
  haproxy:
    enabled: false
    exposePrimary: {}
    lifecycle: {}
    livenessProbes: {}
    readinessProbes: {}
    resources: {}
    sidecarResources: {}
  initContainer:
    image: percona/percona-xtradb-cluster-operator:1.18.0
    resources:
      limits:
        cpu: 50m
        memory: 50M
      requests:
        cpu: 50m
        memory: 50M
  logcollector:
    enabled: true
    image: percona/fluentbit:4.0.1
    resources:
      limits:
        cpu: 250m
        memory: 200M
      requests:
        cpu: 200m
        memory: 100M
  pmm:
    enabled: false
    resources: {}
  proxysql:
    affinity:
      antiAffinityTopologyKey: kubernetes.io/hostname
    enabled: true
    expose: {}
    gracePeriod: 30
    image: percona/proxysql2:2.7.3
    lifecycle: {}
    livenessProbes: {}
    podDisruptionBudget:
      maxUnavailable: 1
    readinessProbes: {}
    resources:
      limits:
        cpu: 700m
        memory: 1G
      requests:
        cpu: 600m
        memory: 1G
    sidecarResources:
      limits:
        cpu: 100m
        memory: 100M
      requests:
        cpu: 50m
        memory: 50M
    size: 3
    volumeSpec:
      persistentVolumeClaim:
        accessModes:
          - ReadWriteOnce
        resources:
          requests:
            storage: 2G
  pxc:
    affinity:
      antiAffinityTopologyKey: kubernetes.io/hostname
    autoRecovery: true
    configuration: >-
      [mysqld]

      # Fixes MLMD locking: "prohibits use of GET_LOCK with pxc_strict_mode =
      ENFORCING"

      pxc_strict_mode=MASTER
    expose: {}
    gracePeriod: 600
    image: percona/percona-xtradb-cluster:8.0.42-33.1
    lifecycle: {}
    livenessProbes: {}
    podDisruptionBudget:
      maxUnavailable: 1
    readinessProbes: {}
    resources:
      limits:
        cpu: '1'
        memory: 1G
      requests:
        cpu: 600m
        memory: 1G
    sidecarResources: {}
    size: 3
    volumeSpec:
      persistentVolumeClaim:
        accessModes:
          - ReadWriteOnce
        resources:
          requests:
            storage: 6G
  secretsName: mysql-system-users
  sslInternalSecretName: mysql-ssl-internal
  sslSecretName: mysql-ssl
  tls:
    enabled: true
  unsafeFlags:
    proxySize: true
    pxcSize: true
  updateStrategy: SmartUpdate
  upgradeOptions:
    apply: disabled
    schedule: 0 4 * * *
    versionServiceEndpoint: https://check.percona.com
status:
  backup: {}
  conditions:
    - lastTransitionTime: '2026-01-20T07:52:03Z'
      status: 'True'
      type: ready
    - lastTransitionTime: '2026-01-20T07:52:53Z'
      status: 'True'
      type: initializing
    - lastTransitionTime: '2026-01-20T07:54:52Z'
      status: 'True'
      type: ready
    - lastTransitionTime: '2026-01-20T08:08:19Z'
      message: >-
        exec binlog collector pod mysql-pitr-769b8868fc-gfmnd: failed to execute
        command in pod: unable to upgrade connection: container not found
        ("pitr")
      reason: ErrorReconcile
      status: 'True'
      type: error
    - lastTransitionTime: '2026-01-20T08:08:29Z'
      status: 'True'
      type: ready
    - lastTransitionTime: '2026-01-20T08:09:59Z'
      message: >-
        exec binlog collector pod mysql-pitr-769b8868fc-gfmnd: failed to execute
        command in pod: unable to upgrade connection: container not found
        ("pitr")
      reason: ErrorReconcile
      status: 'True'
      type: error
    - lastTransitionTime: '2026-01-20T08:10:15Z'
      status: 'True'
      type: ready
    - lastTransitionTime: '2026-01-20T08:10:17Z'
      message: >-
        exec binlog collector pod mysql-pitr-769b8868fc-gfmnd: failed to execute
        command in pod: unable to upgrade connection: container not found
        ("pitr")
      reason: ErrorReconcile
      status: 'True'
      type: error
    - lastTransitionTime: '2026-01-20T08:10:18Z'
      status: 'True'
      type: ready
```

## Additional Context
- After the workaround, binlog collector deployment (`mysql-pitr`) is created and running; PITR metrics and uploads resume as expected.
- No errors were observed in PXC pod logs during or after restore.

### Steps to reproduce


1. Deploy a PXC cluster with PITR enabled and S3 storage configured.
2. Ensure binlog collector is running and full backups are present.
3. Create a `PerconaXtraDBClusterRestore` with PITR date (as above).
4. Observe restore job, operator logs, and cluster status.

### Versions

- Operator: Percona Operator for MySQL based on Percona XtraDB Cluster
- Namespace: `xtradb`
- Cluster name: `mysql`
- Cloud: OpenStack (S3-compatible storage)
- PITR: enabled with `timeBetweenUploads: 60`
- Backup storage: S3
- Kubernetes: 1.32.6
- Operator version: 1.18.0
- PXC version: 8.0.42-33.1

### Anything else?

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PITR restore stuck: binlog collector not recreated after restore job completion causing operator deadlock #2348

Report

Summary

More about the problem

Configuration (restore CR)

Expected Behavior

Actual Behavior

Observations and Evidence

Diagnosis / Suspected Root Cause

Workaround

Impact

Reproducibility

What We Tried

Suggested Fix / Proposal

pxc definition

Additional Context

Steps to reproduce

Versions

Anything else?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

PITR restore stuck: binlog collector not recreated after restore job completion causing operator deadlock #2348

Description

Report

Summary

More about the problem

Configuration (restore CR)

Expected Behavior

Actual Behavior

Observations and Evidence

Diagnosis / Suspected Root Cause

Workaround

Impact

Reproducibility

What We Tried

Suggested Fix / Proposal

pxc definition

Additional Context

Steps to reproduce

Versions

Anything else?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions