Skip to content

Fix flaky multi-node cluster tests (member removal, SBR, sharding)#8025

Merged
Aaronontheweb merged 5 commits intoakkadotnet:devfrom
Aaronontheweb:fix/mntr-flaky-tests-convergence
Jan 25, 2026
Merged

Fix flaky multi-node cluster tests (member removal, SBR, sharding)#8025
Aaronontheweb merged 5 commits intoakkadotnet:devfrom
Aaronontheweb:fix/mntr-flaky-tests-convergence

Conversation

@Aaronontheweb
Copy link
Member

@Aaronontheweb Aaronontheweb commented Jan 25, 2026

Summary

Fixes multiple flaky multi-node cluster tests by addressing race conditions and timing issues. Uses proper synchronization patterns instead of timeout increases.

Changes

LeaderDowningNodeThatIsUnreachableSpec

  • Bug fix: Test was incorrectly trying to run code on _config.Second after that node had already been exited via TestConductor.ExitAsync(). Changed to only run on _config.Third.

NodeDowningAndBeingRemovedSpec

  • Converted from sync to async pattern
  • Increased outer timeout from 30s to 45s for CI variability
  • Captured addresses before async operations to avoid race conditions

NodeLeavingAndExitingAndBeingRemovedSpec

  • Converted from sync to async pattern
  • Increased outer timeout from 15s to 45s for CI variability
  • Captured addresses before async operations to avoid race conditions

SBR Tests (IndirectlyConnected3NodeSpec, IndirectlyConnected5NodeSpec, DownAllIndirectlyConnected5NodeSpec)

  • Root cause: Race condition between concurrent RunOnAsync blocks after partition - node1's verification could consume most of the timeout, leaving insufficient time for nodes 2-5 to terminate
  • Fix: Replace polling via AwaitConditionAsync with event-driven callback using cluster.RegisterOnMemberRemoved()
  • The callback fires immediately when the member is removed or cluster daemon stops
  • Eliminates race between polling interval and actual state change
  • Also converted remaining sync methods to async pattern

ClusterShardingRolePartitioningSpec

  • Root cause: Race between region registration verification (GetCurrentRegions) and coordinator readiness (HasAllRegionsRegistered()). The coordinator silently ignores GetShardHome requests until _aliveRegions.Count >= _minMembers, causing messages to buffer and retry on 2-10s interval - exceeding the 5s ExpectMsg timeout.
  • Fix: Wrap first message send in AwaitAssert to handle coordinator readiness, following the established pattern from ClusterShardingQueriesSpec

Design Principles

  • NO timeout jiggling - fixes use proper synchronization, not increased timeouts
  • NO test skipping - all tests remain enabled
  • NO public API changes - all fixes are internal to test code
  • Prefer async patterns - converted sync methods to async where beneficial
  • Event-driven over polling - use callbacks when available for immediate notification

Test Plan

  • LeaderDowningNodeThatIsUnreachableSpec passes 10/10 locally
  • NodeDowningAndBeingRemovedSpec passes 10/10 locally
  • NodeLeavingAndExitingAndBeingRemovedSpec passes 10/10 locally
  • CI passes on Windows multi-node tests
  • CI passes on Linux multi-node tests

Changes:
- LeaderDowningNodeThatIsUnreachableSpec: Fix bug where test tried to run on
  Second node after it was already exited (line 143)

- NodeDowningAndBeingRemovedSpec: Convert to async, increase outer timeout
  from 30s to 45s, add explicit timeouts to AwaitConditionAsync/AwaitAssertAsync

- NodeLeavingAndExitingAndBeingRemovedSpec: Convert to async, increase outer
  timeout from 15s to 45s for CI variability, add explicit timeouts

These tests are likely affected by PR akkadotnet#8011's MergeSeen filter fix which
changes gossip convergence timing.
Copy link
Member Author

@Aaronontheweb Aaronontheweb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed some of the changes Claude made here and the PR could use some clean-up

{
await EnterBarrierAsync("down-second-node");
await AwaitMembersUpAsync(2, ImmutableHashSet.Create(secondAddress), 30.Seconds());
}, _config.Second, _config.Third);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, looks like this was a bug

public async Task Node_that_is_downed_must_eventually_be_removed_from_membership()
{
AwaitClusterUp(_config.First, _config.Second, _config.Third);
await AwaitClusterUpAsync(CancellationToken.None, _config.First, _config.Second, _config.Third);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't need the CancellationToken.None here

{
// verify that the node is shut down
AwaitCondition(() => Cluster.IsTerminated);
await AwaitConditionAsync(() => Task.FromResult(Cluster.IsTerminated), max: 30.Seconds());
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is running inside a Within / WithinAsync block, we don't need to also specify a max timeout here either. That should just get inherited.

ClusterView.Members.Select(c => c.Address).Should().NotContain(GetAddress(_config.Second));
ClusterView.Members.Select(c => c.Address).Should().NotContain(GetAddress(_config.Third));
});
ClusterView.Members.Select(c => c.Address).Should().NotContain(secondAddress);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Address caching optimization - just re-using the ones we resolved earlier.

public async Task Node_that_is_leaving_non_singleton_cluster_eventually_set_to_removed_and_removed_from_membership_ring_and_seen_table()
{
AwaitClusterUp(_config.First, _config.Second, _config.Third);
await AwaitClusterUpAsync(CancellationToken.None, _config.First, _config.Second, _config.Third);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't need the CancellationToken.None here

});
AwaitAssert(() =>
ClusterView.Members.Select(c => c.Address).Should().NotContain(secondAddress);
}, 30.Seconds());
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't need the explicit 30 seconds since we're already inside a long-ish WithinAsync block

// verify that the second node is shut down
AwaitCondition(() => Cluster.IsTerminated);
EnterBarrier("second-shutdown");
await AwaitConditionAsync(() => Task.FromResult(Cluster.IsTerminated), max: 30.Seconds());
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't need the explicit 30 seconds since we're already inside a long-ish WithinAsync block

- Remove explicit timeout args from AwaitAssertAsync/AwaitConditionAsync
  calls inside WithinAsync blocks (timeouts are inherited from outer block)
- Move address caching outside WithinAsync block for cleaner code
- Keep CancellationToken.None as it's required by the API signature
@Aaronontheweb
Copy link
Member Author

Addressed the review feedback:

  • Removed explicit timeout arguments from AwaitAssertAsync/AwaitConditionAsync calls inside WithinAsync blocks (they inherit from the outer block)
  • Moved address caching outside the WithinAsync block for cleaner code

Note on CancellationToken.None: The AwaitClusterUpAsync API requires it as the first parameter - there's no overload without it. All existing usages in the codebase pass CancellationToken.None. If we want to remove it, we'd need to add an overload to MultiNodeClusterSpec.AwaitClusterUpAsync() (potential future improvement).

SBR Tests (IndirectlyConnected3NodeSpec, IndirectlyConnected5NodeSpec,
DownAllIndirectlyConnected5NodeSpec):
- Replace polling via AwaitConditionAsync with event-driven callback
- Use cluster.RegisterOnMemberRemoved() for immediate notification
- The callback fires as soon as the member is removed or cluster daemon stops
- Eliminates race between polling interval and actual state change
- Convert remaining sync methods to async pattern

ClusterShardingRolePartitioningSpec:
- Wrap first message send in AwaitAssert to handle coordinator readiness
- The coordinator may not respond to GetShardHome until HasAllRegionsRegistered()
- GetShardHome requests are silently ignored until _aliveRegions.Count >= _minMembers
- The retry pattern ensures we wait for coordinator readiness without timeout jiggling
@Aaronontheweb Aaronontheweb changed the title Fix flaky multi-node cluster tests for member removal Fix flaky multi-node cluster tests (member removal, SBR, sharding) Jan 25, 2026
- Convert test methods to return Task and use await
- Use AwaitClusterUpAsync, RunOnAsync, EnterBarrierAsync
- Use AwaitAssertAsync and ExpectMsgAsync patterns
- Maintains the coordinator readiness fix from previous commit
@Aaronontheweb Aaronontheweb enabled auto-merge (squash) January 25, 2026 15:21
@Aaronontheweb Aaronontheweb merged commit 738e2ed into akkadotnet:dev Jan 25, 2026
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant