Skip to content

Async replication: primary retains stale replica config after node restart, HAProxy rejects it #1223

@AKhozya

Description

@AKhozya

Description

After a Kubernetes node restart, the MySQL primary pod retains stale replication configuration from its PVC. MySQL auto-starts the SQL applier thread, which causes the HAProxy haproxy_check_primary.sh health check to reject the primary as NOT OK, resulting in <NOSRV> for the mysql-primary backend.

Environment

  • Operator version: 1.0.0
  • MySQL version: Percona Server 8.4.7-7.1
  • Cluster type: async (2 replicas)
  • Kubernetes: K3s v1.35.1
  • HAProxy: percona/haproxy:2.8.18

Steps to Reproduce

  1. Deploy a 2-node async replication cluster with HAProxy
  2. Restart the Kubernetes node hosting the primary MySQL pod
  3. After pod recreation, the primary starts with stale replica configuration

Observed Behavior

After node restart, the primary (mysql-0) has a stale SHOW REPLICA STATUS pointing to the old replica:

Source_Host: main-mysql-mysql-1.main-mysql-mysql.databases
Replica_IO_Running: No
Replica_SQL_Running: Yes
Seconds_Behind_Source: NULL

The HAProxy haproxy_check_primary.sh script (check_async function) requires both REP_IO_STATUS != 'ON' and REP_SQL_STATUS != 'ON' for a healthy primary:

if [[ ${SUPER_RO} == '0' ]] && [[ ${READ_ONLY} == '0' ]] && [[ ${REP_IO_STATUS} != 'ON' ]] && [[ ${REP_SQL_STATUS} != 'ON' ]]; then
    # OK

Since Replica_SQL_Running: Yes (stale thread), the check fails → <NOSRV> → all MySQL clients get "Connection lost: The server closed the connection."

HAProxy logs show:

{"frontend_name": "mysql-primary-in", "backend_name": "mysql-primary", "server_name":"<NOSRV>", "termination_state": "SC"}

The Percona CR stays in state: initializing indefinitely. Orchestrator sees the primary with Problems: ["not_replicating"].

Expected Behavior

After a node restart, the operator or Orchestrator should either:

  1. Detect and clean up stale replica configuration on the primary (STOP REPLICA; RESET REPLICA ALL;)
  2. Or configure skip-replica-start in the MySQL config to prevent auto-starting stale replication threads

Workaround

Adding skip-replica-start to the MySQL configuration via the CR prevents the issue:

mysql:
  configuration: |
    [mysqld]
    skip-replica-start

This prevents MySQL from auto-starting replication threads on startup. Orchestrator then properly handles START REPLICA only on the actual replica node.

Manual fix without config change: connect to the primary and run STOP REPLICA; RESET REPLICA ALL;

Root Cause Analysis

The MySQL data directory (on PVC) persists relay log info and replica configuration. When the primary pod restarts, MySQL reads this persisted state and starts the SQL applier thread. The Percona operator and Orchestrator do not detect or remediate this stale state on the primary.

Suggestion

Consider one of:

  1. Operator-level fix: Add skip-replica-start to the default MySQL config for async clusters, since Orchestrator manages replication lifecycle
  2. Orchestrator-level fix: Detect when the primary has active replication threads with no IO thread and run RESET REPLICA ALL
  3. Init container fix: Add a step to the mysql-init container that cleans up stale replica config before MySQL starts

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions