Skip to content

Operator panics with "Malformed version" in reconcileReplication when creating a new PXC cluster with replicationChannels #2401

@davidcharbonnier

Description

@davidcharbonnier

Report

When creating a new PerconaXtraDBCluster CR that includes spec.pxc.replicationChannels (with isSource: false), the operator enters an infinite panic-recover loop during reconcileReplication. The panic occurs because CompareMySQLVersion() is called before .status.pxc.version has been populated, and go-version.Must() panics on the empty string. This permanently blocks the cluster from ever reaching ready state, creating a deadlock where the operator can't populate the version because it panics before completing reconciliation.

More about the problem

The operator panics every ~40-80 seconds with:

ERROR  Observed a panic  {
  "controller": "pxc-controller",
  "name": "rsi-heap-platform-wr-0",
  "panic": "Malformed version: ",
  "panicGoValue": "&errors.errorString{s:\"Malformed version: \"}",
  "stacktrace": "...
    github.com/hashicorp/go-version.Must
        /go/pkg/mod/github.com/hashicorp/go-version@v1.7.0/version.go:105
    github.com/percona/percona-xtradb-cluster-operator/pkg/apis/pxc/v1.(*PerconaXtraDBCluster).CompareMySQLVersion
        /go/src/.../pxc_types.go:1401
    github.com/percona/percona-xtradb-cluster-operator/pkg/controller/pxc.(*ReconcilePerconaXtraDBCluster).reconcileReplication
        /go/src/.../replication.go:236
    github.com/percona/percona-xtradb-cluster-operator/pkg/controller/pxc.(*ReconcilePerconaXtraDBCluster).Reconcile
        /go/src/.../controller.go:467
  ..."
}
ERROR  Reconciler error  {"error": "panic: Malformed version:  [recovered]"}

The cluster is stuck in initializing with an empty version, while the PXC pod itself is healthy:

$ kubectl get pxc rsi-heap-platform-wr-0 -o jsonpath='{.status.state}'
initializing
 
$ kubectl get pxc rsi-heap-platform-wr-0 -o jsonpath='{.status.pxc.version}'
# (empty)
 
$ kubectl get pods -l app.kubernetes.io/instance=rsi-heap-platform-wr-0
NAME                              READY   STATUS    RESTARTS   AGE
rsi-heap-platform-wr-0-pxc-0      3/3     Running   0          24m

Expected behavior: The operator should ensure the cluster is ready and reporting a MySQL version before attempting to reconcile replication. reconcileReplication should skip execution when .status.pxc.version is empty or when the cluster has not reached ready state. This is consistent with how scheduled backups are already deferred until the cluster is healthy (K8SPXC-1597).

Root cause: In pxc_types.go:1401, CompareMySQLVersion unconditionally calls v.Must() on cr.Status.PXC.Version:

func (cr *PerconaXtraDBCluster) CompareMySQLVersion(ver string) int {
    return v.Must(v.NewVersion(cr.Status.PXC.Version)).Compare(v.Must(v.NewVersion(ver)))
}

When a cluster is freshly created, .status.pxc.version is empty. go-version.NewVersion("") returns an error, and v.Must() panics. This is called from reconcileReplication at replication.go:236 on the very first reconcile loop — before the operator has populated the status.

Suggested fix: Add an early return at the top of reconcileReplication when the cluster is not yet ready:

if cr.Status.PXC.Version == "" || cr.Status.State != api.AppStateReady {
    log.Info("Skipping replication reconciliation, cluster not ready yet",
        "cluster", cr.Name, "state", cr.Status.State)
    return nil
}

Additionally, CompareMySQLVersion itself should be hardened to not panic on empty input.

Steps to reproduce

  1. Deploy the PXC operator v1.18.0 (namespace-scoped).
  2. Create a new PerconaXtraDBCluster CR with replicationChannels defined from the start:
apiVersion: pxc.percona.com/v1
kind: PerconaXtraDBCluster
metadata:
  name: my-reader
spec:
  crVersion: 1.18.0
  secretsName: my-cluster-secrets
  unsafeFlags:
    pxcSize: true
    tls: true
  tls:
    enabled: false
  pxc:
    size: 1
    image: percona/percona-xtradb-cluster:8.0.42-33.1
    replicationChannels:
      - name: platform
        isSource: false
        configuration:
          sourceConnectRetry: 5
          sourceRetryCount: 0
        sourcesList:
          - host: my-source-pxc-0.my-source-pxc-unready
            port: 3306
            weight: 100
    resources:
      limits:
        memory: 8Gi
      requests:
        cpu: 1000m
        memory: 8Gi
    volumeSpec:
      persistentVolumeClaim:
        resources:
          requests:
            storage: 10Gi
  1. Observe the operator logs: the operator enters an infinite panic loop with "Malformed version: " and the cluster never reaches ready.

Versions

  1. Kubernetes: GKE 1.31
  2. Operator: 1.18.0 (likely also affects 1.19.0 — the code path appears unchanged on main)
  3. Database: Percona XtraDB Cluster 8.0.42-33.1

Anything else?

  • This bug does not affect clusters that already have .status.pxc.version populated (e.g., clusters where replication channels are added after initial setup). Only fresh creation with replicationChannels in the spec triggers it.
  • Workaround: Temporarily remove replicationChannels from the CR, wait for the cluster to reach ready, then re-add them. This is not viable in GitOps workflows where the full desired state must be applied atomically.
  • This is a blocker for GitOps deployments (ArgoCD, Flux) where the CR is declared once in Git with the complete spec including replication channels.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions