Return accepted if a new servier is already in cluster#634
Return accepted if a new servier is already in cluster#634greensky00 merged 1 commit intoeBay:masterfrom
Conversation
src/handle_join_leave.cxx
Outdated
| ptr<cluster_config> cur_config = get_config(); | ||
| if (cur_config->get_servers().size() > 1) { | ||
| p_in("this server is already in a cluster, ignore the request"); | ||
| resp->accept( quick_commit_index_.load() + 1 ); |
There was a problem hiding this comment.
We can't always accept the request. The main purpose of this if condition is to reject requests coming from a different cluster; it should never accept such requests.
Please change the logic so that the carried cluster_config is validated, and the request is accepted only if the cluster_config in the request exactly matches the cluster configuration of this server.
There was a problem hiding this comment.
Actually the cluster_config carried in the request cannot match exactly. The request should be accepted only if:
req->cluster_config == this->cluster_config - this_server
In other words, please update the logic to accept the request only when the request’s cluster configuration matches the server’s cluster configuration excluding the server itself.
There was a problem hiding this comment.
Make sense, will update it.
We found a corner case that the new member rejected the join_cluster_req during adding new member. Here is what happened:
T1. The first time leader invited new member to join cluster. The follower received the request, then saved state and called reconfigure to apply cluster config. However the leader didn't receive the resp(timeout).
leader's logs
follower's logs
T2 We retried add member operation, leader sent join_cluster_req, while the follower thought it's already in the cluster, and return response with
accept=false. The leader receivedaccept=falseand considered that the follower rejected the req. Then we are trapped into an endless retry.follower:
leader:
Since the follower also saved
is_catching_up=true, it skipped vote in handle_election_timeout, as a result, it doesn't have a chance to realize it is not in the cluster at all.