Report
When creating a new PerconaXtraDBCluster CR that includes spec.pxc.replicationChannels (with isSource: false), the operator enters an infinite panic-recover loop during reconcileReplication. The panic occurs because CompareMySQLVersion() is called before .status.pxc.version has been populated, and go-version.Must() panics on the empty string. This permanently blocks the cluster from ever reaching ready state, creating a deadlock where the operator can't populate the version because it panics before completing reconciliation.
More about the problem
The operator panics every ~40-80 seconds with:
ERROR Observed a panic {
"controller": "pxc-controller",
"name": "rsi-heap-platform-wr-0",
"panic": "Malformed version: ",
"panicGoValue": "&errors.errorString{s:\"Malformed version: \"}",
"stacktrace": "...
github.com/hashicorp/go-version.Must
/go/pkg/mod/github.com/hashicorp/go-version@v1.7.0/version.go:105
github.com/percona/percona-xtradb-cluster-operator/pkg/apis/pxc/v1.(*PerconaXtraDBCluster).CompareMySQLVersion
/go/src/.../pxc_types.go:1401
github.com/percona/percona-xtradb-cluster-operator/pkg/controller/pxc.(*ReconcilePerconaXtraDBCluster).reconcileReplication
/go/src/.../replication.go:236
github.com/percona/percona-xtradb-cluster-operator/pkg/controller/pxc.(*ReconcilePerconaXtraDBCluster).Reconcile
/go/src/.../controller.go:467
..."
}
ERROR Reconciler error {"error": "panic: Malformed version: [recovered]"}
The cluster is stuck in initializing with an empty version, while the PXC pod itself is healthy:
$ kubectl get pxc rsi-heap-platform-wr-0 -o jsonpath='{.status.state}'
initializing
$ kubectl get pxc rsi-heap-platform-wr-0 -o jsonpath='{.status.pxc.version}'
# (empty)
$ kubectl get pods -l app.kubernetes.io/instance=rsi-heap-platform-wr-0
NAME READY STATUS RESTARTS AGE
rsi-heap-platform-wr-0-pxc-0 3/3 Running 0 24m
Expected behavior: The operator should ensure the cluster is ready and reporting a MySQL version before attempting to reconcile replication. reconcileReplication should skip execution when .status.pxc.version is empty or when the cluster has not reached ready state. This is consistent with how scheduled backups are already deferred until the cluster is healthy (K8SPXC-1597).
Root cause: In pxc_types.go:1401, CompareMySQLVersion unconditionally calls v.Must() on cr.Status.PXC.Version:
func (cr *PerconaXtraDBCluster) CompareMySQLVersion(ver string) int {
return v.Must(v.NewVersion(cr.Status.PXC.Version)).Compare(v.Must(v.NewVersion(ver)))
}
When a cluster is freshly created, .status.pxc.version is empty. go-version.NewVersion("") returns an error, and v.Must() panics. This is called from reconcileReplication at replication.go:236 on the very first reconcile loop — before the operator has populated the status.
Suggested fix: Add an early return at the top of reconcileReplication when the cluster is not yet ready:
if cr.Status.PXC.Version == "" || cr.Status.State != api.AppStateReady {
log.Info("Skipping replication reconciliation, cluster not ready yet",
"cluster", cr.Name, "state", cr.Status.State)
return nil
}
Additionally, CompareMySQLVersion itself should be hardened to not panic on empty input.
Steps to reproduce
- Deploy the PXC operator v1.18.0 (namespace-scoped).
- Create a new
PerconaXtraDBCluster CR with replicationChannels defined from the start:
apiVersion: pxc.percona.com/v1
kind: PerconaXtraDBCluster
metadata:
name: my-reader
spec:
crVersion: 1.18.0
secretsName: my-cluster-secrets
unsafeFlags:
pxcSize: true
tls: true
tls:
enabled: false
pxc:
size: 1
image: percona/percona-xtradb-cluster:8.0.42-33.1
replicationChannels:
- name: platform
isSource: false
configuration:
sourceConnectRetry: 5
sourceRetryCount: 0
sourcesList:
- host: my-source-pxc-0.my-source-pxc-unready
port: 3306
weight: 100
resources:
limits:
memory: 8Gi
requests:
cpu: 1000m
memory: 8Gi
volumeSpec:
persistentVolumeClaim:
resources:
requests:
storage: 10Gi
- Observe the operator logs: the operator enters an infinite panic loop with
"Malformed version: " and the cluster never reaches ready.
Versions
- Kubernetes: GKE 1.31
- Operator: 1.18.0 (likely also affects 1.19.0 — the code path appears unchanged on main)
- Database: Percona XtraDB Cluster 8.0.42-33.1
Anything else?
- This bug does not affect clusters that already have
.status.pxc.version populated (e.g., clusters where replication channels are added after initial setup). Only fresh creation with replicationChannels in the spec triggers it.
- Workaround: Temporarily remove
replicationChannels from the CR, wait for the cluster to reach ready, then re-add them. This is not viable in GitOps workflows where the full desired state must be applied atomically.
- This is a blocker for GitOps deployments (ArgoCD, Flux) where the CR is declared once in Git with the complete spec including replication channels.
Report
When creating a new
PerconaXtraDBClusterCR that includesspec.pxc.replicationChannels(withisSource: false), the operator enters an infinite panic-recover loop duringreconcileReplication. The panic occurs becauseCompareMySQLVersion()is called before.status.pxc.versionhas been populated, andgo-version.Must()panics on the empty string. This permanently blocks the cluster from ever reachingreadystate, creating a deadlock where the operator can't populate the version because it panics before completing reconciliation.More about the problem
The operator panics every ~40-80 seconds with:
The cluster is stuck in
initializingwith an empty version, while the PXC pod itself is healthy:Expected behavior: The operator should ensure the cluster is ready and reporting a MySQL version before attempting to reconcile replication.
reconcileReplicationshould skip execution when.status.pxc.versionis empty or when the cluster has not reachedreadystate. This is consistent with how scheduled backups are already deferred until the cluster is healthy (K8SPXC-1597).Root cause: In
pxc_types.go:1401,CompareMySQLVersionunconditionally callsv.Must()oncr.Status.PXC.Version:When a cluster is freshly created,
.status.pxc.versionis empty.go-version.NewVersion("")returns an error, andv.Must()panics. This is called fromreconcileReplicationatreplication.go:236on the very first reconcile loop — before the operator has populated the status.Suggested fix: Add an early return at the top of
reconcileReplicationwhen the cluster is not yet ready:Additionally,
CompareMySQLVersionitself should be hardened to not panic on empty input.Steps to reproduce
PerconaXtraDBClusterCR withreplicationChannelsdefined from the start:"Malformed version: "and the cluster never reachesready.Versions
Anything else?
.status.pxc.versionpopulated (e.g., clusters where replication channels are added after initial setup). Only fresh creation withreplicationChannelsin the spec triggers it.replicationChannelsfrom the CR, wait for the cluster to reachready, then re-add them. This is not viable in GitOps workflows where the full desired state must be applied atomically.