Skip to content

Intermittent network error causes permanent server outage #1224

@gugu

Description

@gugu

Report

I've started the cluster in alibaba cloud and start getting "Connection refused" when try to connect to it:

More about the problem

What happens:

  1. cluster runs in mysql group replication mode:
k exec -n mysql pods/ps-db-mysql-2 -it mysql -- cat /var/lib/mysql/mysql.state
ready
  1. something happens, probably network error, nothing in logs, no restart
  2. server switches to startup mode, but stays fully functional if connected directly
❯❯❯ k exec -n mysql pods/ps-db-mysql-2 -it mysql -- cat /var/lib/mysql/mysql.state
startup
  1. readiness probe does not allow traffic to it, so kubernetes service stops routing traffic to it
  2. if it happens on master - switchover never happens, master just becomes inaccessible.
  3. echo -n ready > /var/lib/mysql/mysql.state fixes the problem

Steps to reproduce

I'll add more details when I have, but for now I don't know how to reproduce it

Versions

  1. Kubernetes 1.35
  2. Operator 1.0
  3. Database: percona-server:8.4.6-6.1

Anything else?

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions