Async replication: primary retains stale replica config after node restart, HAProxy rejects it

### Description

After a Kubernetes node restart, the MySQL primary pod retains stale replication configuration from its PVC. MySQL auto-starts the SQL applier thread, which causes the HAProxy `haproxy_check_primary.sh` health check to reject the primary as `NOT OK`, resulting in `<NOSRV>` for the `mysql-primary` backend.

### Environment

- **Operator version**: 1.0.0
- **MySQL version**: Percona Server 8.4.7-7.1
- **Cluster type**: async (2 replicas)
- **Kubernetes**: K3s v1.35.1
- **HAProxy**: percona/haproxy:2.8.18

### Steps to Reproduce

1. Deploy a 2-node async replication cluster with HAProxy
2. Restart the Kubernetes node hosting the primary MySQL pod
3. After pod recreation, the primary starts with stale replica configuration

### Observed Behavior

After node restart, the primary (mysql-0) has a stale `SHOW REPLICA STATUS` pointing to the old replica:

```
Source_Host: main-mysql-mysql-1.main-mysql-mysql.databases
Replica_IO_Running: No
Replica_SQL_Running: Yes
Seconds_Behind_Source: NULL
```

The HAProxy `haproxy_check_primary.sh` script (`check_async` function) requires both `REP_IO_STATUS != 'ON'` **and** `REP_SQL_STATUS != 'ON'` for a healthy primary:

```bash
if [[ ${SUPER_RO} == '0' ]] && [[ ${READ_ONLY} == '0' ]] && [[ ${REP_IO_STATUS} != 'ON' ]] && [[ ${REP_SQL_STATUS} != 'ON' ]]; then
    # OK
```

Since `Replica_SQL_Running: Yes` (stale thread), the check fails → `<NOSRV>` → all MySQL clients get "Connection lost: The server closed the connection."

HAProxy logs show:
```json
{"frontend_name": "mysql-primary-in", "backend_name": "mysql-primary", "server_name":"<NOSRV>", "termination_state": "SC"}
```

The Percona CR stays in `state: initializing` indefinitely. Orchestrator sees the primary with `Problems: ["not_replicating"]`.

### Expected Behavior

After a node restart, the operator or Orchestrator should either:
1. Detect and clean up stale replica configuration on the primary (`STOP REPLICA; RESET REPLICA ALL;`)
2. Or configure `skip-replica-start` in the MySQL config to prevent auto-starting stale replication threads

### Workaround

Adding `skip-replica-start` to the MySQL configuration via the CR prevents the issue:

```yaml
mysql:
  configuration: |
    [mysqld]
    skip-replica-start
```

This prevents MySQL from auto-starting replication threads on startup. Orchestrator then properly handles `START REPLICA` only on the actual replica node.

Manual fix without config change: connect to the primary and run `STOP REPLICA; RESET REPLICA ALL;`

### Root Cause Analysis

The MySQL data directory (on PVC) persists relay log info and replica configuration. When the primary pod restarts, MySQL reads this persisted state and starts the SQL applier thread. The Percona operator and Orchestrator do not detect or remediate this stale state on the primary.

### Suggestion

Consider one of:
1. **Operator-level fix**: Add `skip-replica-start` to the default MySQL config for async clusters, since Orchestrator manages replication lifecycle
2. **Orchestrator-level fix**: Detect when the primary has active replication threads with no IO thread and run `RESET REPLICA ALL`
3. **Init container fix**: Add a step to the `mysql-init` container that cleans up stale replica config before MySQL starts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Async replication: primary retains stale replica config after node restart, HAProxy rejects it #1223

Description

Environment

Steps to Reproduce

Observed Behavior

Expected Behavior

Workaround

Root Cause Analysis

Suggestion

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Async replication: primary retains stale replica config after node restart, HAProxy rejects it #1223

Description

Description

Environment

Steps to Reproduce

Observed Behavior

Expected Behavior

Workaround

Root Cause Analysis

Suggestion

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions