-
Notifications
You must be signed in to change notification settings - Fork 36
Async replication: primary retains stale replica config after node restart, HAProxy rejects it #1223
Description
Description
After a Kubernetes node restart, the MySQL primary pod retains stale replication configuration from its PVC. MySQL auto-starts the SQL applier thread, which causes the HAProxy haproxy_check_primary.sh health check to reject the primary as NOT OK, resulting in <NOSRV> for the mysql-primary backend.
Environment
- Operator version: 1.0.0
- MySQL version: Percona Server 8.4.7-7.1
- Cluster type: async (2 replicas)
- Kubernetes: K3s v1.35.1
- HAProxy: percona/haproxy:2.8.18
Steps to Reproduce
- Deploy a 2-node async replication cluster with HAProxy
- Restart the Kubernetes node hosting the primary MySQL pod
- After pod recreation, the primary starts with stale replica configuration
Observed Behavior
After node restart, the primary (mysql-0) has a stale SHOW REPLICA STATUS pointing to the old replica:
Source_Host: main-mysql-mysql-1.main-mysql-mysql.databases
Replica_IO_Running: No
Replica_SQL_Running: Yes
Seconds_Behind_Source: NULL
The HAProxy haproxy_check_primary.sh script (check_async function) requires both REP_IO_STATUS != 'ON' and REP_SQL_STATUS != 'ON' for a healthy primary:
if [[ ${SUPER_RO} == '0' ]] && [[ ${READ_ONLY} == '0' ]] && [[ ${REP_IO_STATUS} != 'ON' ]] && [[ ${REP_SQL_STATUS} != 'ON' ]]; then
# OKSince Replica_SQL_Running: Yes (stale thread), the check fails → <NOSRV> → all MySQL clients get "Connection lost: The server closed the connection."
HAProxy logs show:
{"frontend_name": "mysql-primary-in", "backend_name": "mysql-primary", "server_name":"<NOSRV>", "termination_state": "SC"}The Percona CR stays in state: initializing indefinitely. Orchestrator sees the primary with Problems: ["not_replicating"].
Expected Behavior
After a node restart, the operator or Orchestrator should either:
- Detect and clean up stale replica configuration on the primary (
STOP REPLICA; RESET REPLICA ALL;) - Or configure
skip-replica-startin the MySQL config to prevent auto-starting stale replication threads
Workaround
Adding skip-replica-start to the MySQL configuration via the CR prevents the issue:
mysql:
configuration: |
[mysqld]
skip-replica-startThis prevents MySQL from auto-starting replication threads on startup. Orchestrator then properly handles START REPLICA only on the actual replica node.
Manual fix without config change: connect to the primary and run STOP REPLICA; RESET REPLICA ALL;
Root Cause Analysis
The MySQL data directory (on PVC) persists relay log info and replica configuration. When the primary pod restarts, MySQL reads this persisted state and starts the SQL applier thread. The Percona operator and Orchestrator do not detect or remediate this stale state on the primary.
Suggestion
Consider one of:
- Operator-level fix: Add
skip-replica-startto the default MySQL config for async clusters, since Orchestrator manages replication lifecycle - Orchestrator-level fix: Detect when the primary has active replication threads with no IO thread and run
RESET REPLICA ALL - Init container fix: Add a step to the
mysql-initcontainer that cleans up stale replica config before MySQL starts