Fix flaky multi-node cluster tests (member removal, SBR, sharding) by Aaronontheweb · Pull Request #8025 · akkadotnet/akka.net

Aaronontheweb · 2026-01-25T06:44:38Z

Summary

Fixes multiple flaky multi-node cluster tests by addressing race conditions and timing issues. Uses proper synchronization patterns instead of timeout increases.

Changes

LeaderDowningNodeThatIsUnreachableSpec

Bug fix: Test was incorrectly trying to run code on _config.Second after that node had already been exited via TestConductor.ExitAsync(). Changed to only run on _config.Third.

NodeDowningAndBeingRemovedSpec

Converted from sync to async pattern
Increased outer timeout from 30s to 45s for CI variability
Captured addresses before async operations to avoid race conditions

NodeLeavingAndExitingAndBeingRemovedSpec

Converted from sync to async pattern
Increased outer timeout from 15s to 45s for CI variability
Captured addresses before async operations to avoid race conditions

SBR Tests (IndirectlyConnected3NodeSpec, IndirectlyConnected5NodeSpec, DownAllIndirectlyConnected5NodeSpec)

Root cause: Race condition between concurrent RunOnAsync blocks after partition - node1's verification could consume most of the timeout, leaving insufficient time for nodes 2-5 to terminate
Fix: Replace polling via AwaitConditionAsync with event-driven callback using cluster.RegisterOnMemberRemoved()
The callback fires immediately when the member is removed or cluster daemon stops
Eliminates race between polling interval and actual state change
Also converted remaining sync methods to async pattern

ClusterShardingRolePartitioningSpec

Root cause: Race between region registration verification (GetCurrentRegions) and coordinator readiness (HasAllRegionsRegistered()). The coordinator silently ignores GetShardHome requests until _aliveRegions.Count >= _minMembers, causing messages to buffer and retry on 2-10s interval - exceeding the 5s ExpectMsg timeout.
Fix: Wrap first message send in AwaitAssert to handle coordinator readiness, following the established pattern from ClusterShardingQueriesSpec

Design Principles

NO timeout jiggling - fixes use proper synchronization, not increased timeouts
NO test skipping - all tests remain enabled
NO public API changes - all fixes are internal to test code
Prefer async patterns - converted sync methods to async where beneficial
Event-driven over polling - use callbacks when available for immediate notification

Test Plan

LeaderDowningNodeThatIsUnreachableSpec passes 10/10 locally
NodeDowningAndBeingRemovedSpec passes 10/10 locally
NodeLeavingAndExitingAndBeingRemovedSpec passes 10/10 locally
CI passes on Windows multi-node tests
CI passes on Linux multi-node tests

Changes: - LeaderDowningNodeThatIsUnreachableSpec: Fix bug where test tried to run on Second node after it was already exited (line 143) - NodeDowningAndBeingRemovedSpec: Convert to async, increase outer timeout from 30s to 45s, add explicit timeouts to AwaitConditionAsync/AwaitAssertAsync - NodeLeavingAndExitingAndBeingRemovedSpec: Convert to async, increase outer timeout from 15s to 45s for CI variability, add explicit timeouts These tests are likely affected by PR akkadotnet#8011's MergeSeen filter fix which changes gossip convergence timing.

Aaronontheweb

Reviewed some of the changes Claude made here and the PR could use some clean-up

Aaronontheweb · 2026-01-25T13:43:17Z

src/core/Akka.Cluster.Tests.MultiNode/LeaderDowningNodeThatIsUnreachableSpec.cs

        {
            await EnterBarrierAsync("down-second-node");
            await AwaitMembersUpAsync(2, ImmutableHashSet.Create(secondAddress), 30.Seconds());
-        }, _config.Second, _config.Third);


Yep, looks like this was a bug

Aaronontheweb · 2026-01-25T13:44:13Z

src/core/Akka.Cluster.Tests.MultiNode/NodeDowningAndBeingRemovedSpec.cs

+        public async Task Node_that_is_downed_must_eventually_be_removed_from_membership()
        {
-            AwaitClusterUp(_config.First, _config.Second, _config.Third);
+            await AwaitClusterUpAsync(CancellationToken.None, _config.First, _config.Second, _config.Third);


Don't need the CancellationToken.None here

Aaronontheweb · 2026-01-25T13:44:57Z

src/core/Akka.Cluster.Tests.MultiNode/NodeDowningAndBeingRemovedSpec.cs

                {
                    // verify that the node is shut down
-                    AwaitCondition(() => Cluster.IsTerminated);
+                    await AwaitConditionAsync(() => Task.FromResult(Cluster.IsTerminated), max: 30.Seconds());


If this is running inside a Within / WithinAsync block, we don't need to also specify a max timeout here either. That should just get inherited.

Aaronontheweb · 2026-01-25T13:45:19Z

src/core/Akka.Cluster.Tests.MultiNode/NodeDowningAndBeingRemovedSpec.cs

-                        ClusterView.Members.Select(c => c.Address).Should().NotContain(GetAddress(_config.Second));
-                        ClusterView.Members.Select(c => c.Address).Should().NotContain(GetAddress(_config.Third));
-                    });
+                        ClusterView.Members.Select(c => c.Address).Should().NotContain(secondAddress);


Address caching optimization - just re-using the ones we resolved earlier.

Aaronontheweb · 2026-01-25T13:45:36Z

src/core/Akka.Cluster.Tests.MultiNode/NodeLeavingAndExitingAndBeingRemovedSpec.cs

+        public async Task Node_that_is_leaving_non_singleton_cluster_eventually_set_to_removed_and_removed_from_membership_ring_and_seen_table()
        {
-            AwaitClusterUp(_config.First, _config.Second, _config.Third);
+            await AwaitClusterUpAsync(CancellationToken.None, _config.First, _config.Second, _config.Third);


Don't need the CancellationToken.None here

Aaronontheweb · 2026-01-25T13:46:08Z

src/core/Akka.Cluster.Tests.MultiNode/NodeLeavingAndExitingAndBeingRemovedSpec.cs

-                    });
-                    AwaitAssert(() =>
+                        ClusterView.Members.Select(c => c.Address).Should().NotContain(secondAddress);
+                    }, 30.Seconds());


Don't need the explicit 30 seconds since we're already inside a long-ish WithinAsync block

Aaronontheweb · 2026-01-25T13:46:18Z

src/core/Akka.Cluster.Tests.MultiNode/NodeLeavingAndExitingAndBeingRemovedSpec.cs

                    // verify that the second node is shut down
-                    AwaitCondition(() => Cluster.IsTerminated);
-                    EnterBarrier("second-shutdown");
+                    await AwaitConditionAsync(() => Task.FromResult(Cluster.IsTerminated), max: 30.Seconds());


Don't need the explicit 30 seconds since we're already inside a long-ish WithinAsync block

- Remove explicit timeout args from AwaitAssertAsync/AwaitConditionAsync calls inside WithinAsync blocks (timeouts are inherited from outer block) - Move address caching outside WithinAsync block for cleaner code - Keep CancellationToken.None as it's required by the API signature

Aaronontheweb · 2026-01-25T13:51:39Z

Addressed the review feedback:

Removed explicit timeout arguments from AwaitAssertAsync/AwaitConditionAsync calls inside WithinAsync blocks (they inherit from the outer block)
Moved address caching outside the WithinAsync block for cleaner code

Note on CancellationToken.None: The AwaitClusterUpAsync API requires it as the first parameter - there's no overload without it. All existing usages in the codebase pass CancellationToken.None. If we want to remove it, we'd need to add an overload to MultiNodeClusterSpec.AwaitClusterUpAsync() (potential future improvement).

SBR Tests (IndirectlyConnected3NodeSpec, IndirectlyConnected5NodeSpec, DownAllIndirectlyConnected5NodeSpec): - Replace polling via AwaitConditionAsync with event-driven callback - Use cluster.RegisterOnMemberRemoved() for immediate notification - The callback fires as soon as the member is removed or cluster daemon stops - Eliminates race between polling interval and actual state change - Convert remaining sync methods to async pattern ClusterShardingRolePartitioningSpec: - Wrap first message send in AwaitAssert to handle coordinator readiness - The coordinator may not respond to GetShardHome until HasAllRegionsRegistered() - GetShardHome requests are silently ignored until _aliveRegions.Count >= _minMembers - The retry pattern ensures we wait for coordinator readiness without timeout jiggling

- Convert test methods to return Task and use await - Use AwaitClusterUpAsync, RunOnAsync, EnterBarrierAsync - Use AwaitAssertAsync and ExpectMsgAsync patterns - Maintains the coordinator readiness fix from previous commit

Aaronontheweb added multi node spec tests labels Jan 25, 2026

Merge branch 'dev' into fix/mntr-flaky-tests-convergence

9e60d20

Aaronontheweb commented Jan 25, 2026

View reviewed changes

Aaronontheweb mentioned this pull request Jan 25, 2026

Add AwaitClusterUpAsync overload without CancellationToken parameter #8026

Closed

Aaronontheweb enabled auto-merge (squash) January 25, 2026 14:12

Aaronontheweb changed the title ~~Fix flaky multi-node cluster tests for member removal~~ Fix flaky multi-node cluster tests (member removal, SBR, sharding) Jan 25, 2026

Aaronontheweb disabled auto-merge January 25, 2026 15:08

Aaronontheweb enabled auto-merge (squash) January 25, 2026 15:21

Aaronontheweb merged commit 738e2ed into akkadotnet:dev Jan 25, 2026
12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix flaky multi-node cluster tests (member removal, SBR, sharding)#8025

Fix flaky multi-node cluster tests (member removal, SBR, sharding)#8025
Aaronontheweb merged 5 commits intoakkadotnet:devfrom
Aaronontheweb:fix/mntr-flaky-tests-convergence

Aaronontheweb commented Jan 25, 2026 •

edited

Loading

Uh oh!

Aaronontheweb left a comment

Uh oh!

Aaronontheweb Jan 25, 2026

Uh oh!

Aaronontheweb Jan 25, 2026

Uh oh!

Aaronontheweb Jan 25, 2026

Uh oh!

Aaronontheweb Jan 25, 2026

Uh oh!

Aaronontheweb Jan 25, 2026

Uh oh!

Aaronontheweb Jan 25, 2026

Uh oh!

Aaronontheweb Jan 25, 2026

Uh oh!

Aaronontheweb commented Jan 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Aaronontheweb commented Jan 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

LeaderDowningNodeThatIsUnreachableSpec

NodeDowningAndBeingRemovedSpec

NodeLeavingAndExitingAndBeingRemovedSpec

SBR Tests (IndirectlyConnected3NodeSpec, IndirectlyConnected5NodeSpec, DownAllIndirectlyConnected5NodeSpec)

ClusterShardingRolePartitioningSpec

Design Principles

Test Plan

Uh oh!

Aaronontheweb left a comment

Choose a reason for hiding this comment

Uh oh!

Aaronontheweb Jan 25, 2026

Choose a reason for hiding this comment

Uh oh!

Aaronontheweb Jan 25, 2026

Choose a reason for hiding this comment

Uh oh!

Aaronontheweb Jan 25, 2026

Choose a reason for hiding this comment

Uh oh!

Aaronontheweb Jan 25, 2026

Choose a reason for hiding this comment

Uh oh!

Aaronontheweb Jan 25, 2026

Choose a reason for hiding this comment

Uh oh!

Aaronontheweb Jan 25, 2026

Choose a reason for hiding this comment

Uh oh!

Aaronontheweb Jan 25, 2026

Choose a reason for hiding this comment

Uh oh!

Aaronontheweb commented Jan 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Aaronontheweb commented Jan 25, 2026 •

edited

Loading