Fix flaky multi-node cluster tests (member removal, SBR, sharding)#8025
Conversation
Changes: - LeaderDowningNodeThatIsUnreachableSpec: Fix bug where test tried to run on Second node after it was already exited (line 143) - NodeDowningAndBeingRemovedSpec: Convert to async, increase outer timeout from 30s to 45s, add explicit timeouts to AwaitConditionAsync/AwaitAssertAsync - NodeLeavingAndExitingAndBeingRemovedSpec: Convert to async, increase outer timeout from 15s to 45s for CI variability, add explicit timeouts These tests are likely affected by PR akkadotnet#8011's MergeSeen filter fix which changes gossip convergence timing.
Aaronontheweb
left a comment
There was a problem hiding this comment.
Reviewed some of the changes Claude made here and the PR could use some clean-up
| { | ||
| await EnterBarrierAsync("down-second-node"); | ||
| await AwaitMembersUpAsync(2, ImmutableHashSet.Create(secondAddress), 30.Seconds()); | ||
| }, _config.Second, _config.Third); |
There was a problem hiding this comment.
Yep, looks like this was a bug
| public async Task Node_that_is_downed_must_eventually_be_removed_from_membership() | ||
| { | ||
| AwaitClusterUp(_config.First, _config.Second, _config.Third); | ||
| await AwaitClusterUpAsync(CancellationToken.None, _config.First, _config.Second, _config.Third); |
There was a problem hiding this comment.
Don't need the CancellationToken.None here
| { | ||
| // verify that the node is shut down | ||
| AwaitCondition(() => Cluster.IsTerminated); | ||
| await AwaitConditionAsync(() => Task.FromResult(Cluster.IsTerminated), max: 30.Seconds()); |
There was a problem hiding this comment.
If this is running inside a Within / WithinAsync block, we don't need to also specify a max timeout here either. That should just get inherited.
| ClusterView.Members.Select(c => c.Address).Should().NotContain(GetAddress(_config.Second)); | ||
| ClusterView.Members.Select(c => c.Address).Should().NotContain(GetAddress(_config.Third)); | ||
| }); | ||
| ClusterView.Members.Select(c => c.Address).Should().NotContain(secondAddress); |
There was a problem hiding this comment.
Address caching optimization - just re-using the ones we resolved earlier.
| public async Task Node_that_is_leaving_non_singleton_cluster_eventually_set_to_removed_and_removed_from_membership_ring_and_seen_table() | ||
| { | ||
| AwaitClusterUp(_config.First, _config.Second, _config.Third); | ||
| await AwaitClusterUpAsync(CancellationToken.None, _config.First, _config.Second, _config.Third); |
There was a problem hiding this comment.
Don't need the CancellationToken.None here
| }); | ||
| AwaitAssert(() => | ||
| ClusterView.Members.Select(c => c.Address).Should().NotContain(secondAddress); | ||
| }, 30.Seconds()); |
There was a problem hiding this comment.
Don't need the explicit 30 seconds since we're already inside a long-ish WithinAsync block
| // verify that the second node is shut down | ||
| AwaitCondition(() => Cluster.IsTerminated); | ||
| EnterBarrier("second-shutdown"); | ||
| await AwaitConditionAsync(() => Task.FromResult(Cluster.IsTerminated), max: 30.Seconds()); |
There was a problem hiding this comment.
Don't need the explicit 30 seconds since we're already inside a long-ish WithinAsync block
- Remove explicit timeout args from AwaitAssertAsync/AwaitConditionAsync calls inside WithinAsync blocks (timeouts are inherited from outer block) - Move address caching outside WithinAsync block for cleaner code - Keep CancellationToken.None as it's required by the API signature
|
Addressed the review feedback:
Note on |
SBR Tests (IndirectlyConnected3NodeSpec, IndirectlyConnected5NodeSpec, DownAllIndirectlyConnected5NodeSpec): - Replace polling via AwaitConditionAsync with event-driven callback - Use cluster.RegisterOnMemberRemoved() for immediate notification - The callback fires as soon as the member is removed or cluster daemon stops - Eliminates race between polling interval and actual state change - Convert remaining sync methods to async pattern ClusterShardingRolePartitioningSpec: - Wrap first message send in AwaitAssert to handle coordinator readiness - The coordinator may not respond to GetShardHome until HasAllRegionsRegistered() - GetShardHome requests are silently ignored until _aliveRegions.Count >= _minMembers - The retry pattern ensures we wait for coordinator readiness without timeout jiggling
- Convert test methods to return Task and use await - Use AwaitClusterUpAsync, RunOnAsync, EnterBarrierAsync - Use AwaitAssertAsync and ExpectMsgAsync patterns - Maintains the coordinator readiness fix from previous commit
Summary
Fixes multiple flaky multi-node cluster tests by addressing race conditions and timing issues. Uses proper synchronization patterns instead of timeout increases.
Changes
LeaderDowningNodeThatIsUnreachableSpec
_config.Secondafter that node had already been exited viaTestConductor.ExitAsync(). Changed to only run on_config.Third.NodeDowningAndBeingRemovedSpec
NodeLeavingAndExitingAndBeingRemovedSpec
SBR Tests (IndirectlyConnected3NodeSpec, IndirectlyConnected5NodeSpec, DownAllIndirectlyConnected5NodeSpec)
RunOnAsyncblocks after partition - node1's verification could consume most of the timeout, leaving insufficient time for nodes 2-5 to terminateAwaitConditionAsyncwith event-driven callback usingcluster.RegisterOnMemberRemoved()ClusterShardingRolePartitioningSpec
GetCurrentRegions) and coordinator readiness (HasAllRegionsRegistered()). The coordinator silently ignoresGetShardHomerequests until_aliveRegions.Count >= _minMembers, causing messages to buffer and retry on 2-10s interval - exceeding the 5sExpectMsgtimeout.AwaitAssertto handle coordinator readiness, following the established pattern fromClusterShardingQueriesSpecDesign Principles
Test Plan