Skip to content

Fix a flaky test for issue 19916 ClusterShardLimitIT#20252

Open
liuguoqingfz wants to merge 1 commit intoopensearch-project:mainfrom
liuguoqingfz:flakytese-19916-ClusterShardLimitIT
Open

Fix a flaky test for issue 19916 ClusterShardLimitIT#20252
liuguoqingfz wants to merge 1 commit intoopensearch-project:mainfrom
liuguoqingfz:flakytese-19916-ClusterShardLimitIT

Conversation

@liuguoqingfz
Copy link
Contributor

@liuguoqingfz liuguoqingfz commented Dec 16, 2025

Description

Fix a timing/ack timeout flake: prepareClose(...).get() returns an AcknowledgedResponse that can be false if the close cluster-state update times out waiting for acknowledgements, even though the close may still eventually apply.

Related Issues

Resolves #19916
Related CI test run: https://build.ci.opensearch.org/job/gradle-check/66791/testReport/org.opensearch.cluster.shards/ClusterShardLimitIT/testOpenIndexOverLimit__p0___opensearch_experimental_feature_writable_warm_index_enabled___true___/

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Summary by CodeRabbit

  • Tests
    • Improved cluster shard limit test reliability by implementing polling-based state verification instead of direct acknowledgments.
    • Enhanced test assertions with explicit health checks and busy-wait conditions for better cluster state validation.
    • Updated test method signatures to properly handle exception declarations.

✏️ Tip: You can customize this high-level summary in your review settings.

… until the index is actually CLOSED before continuing.

Signed-off-by: Joe Liu <guoqing4@illinois.edu>
@liuguoqingfz liuguoqingfz requested a review from a team as a code owner December 16, 2025 10:49
@github-actions github-actions bot added >test-failure Test failure from CI, local build, etc. autocut flaky-test Random test failure that succeeds on second run labels Dec 16, 2025
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 16, 2025

📝 Walkthrough

Walkthrough

This change fixes a flaky test in ClusterShardLimitIT by replacing synchronous acknowledgement checks with explicit asynchronous polling to reliably wait for index state transitions, adding explicit cluster health verification, and updating method signatures to declare exception handling.

Changes

Cohort / File(s) Change Summary
Test reliability improvements
server/src/internalClusterTest/java/org/opensearch/cluster/shards/ClusterShardLimitIT.java
Replaced AcknowledgedResponse-based index closing with direct close call and busy-wait polling for CLOSED state confirmation; added explicit cluster health check after filling shards; replaced prepareState().get().getState() with prepareState().execute().actionGet().getState(); updated testOpenIndexOverLimit() signature to throw Exception; added imports for TimeUnit and assertThat

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

  • Single file modification focused on test reliability improvements
  • Pattern changes involve replacing synchronous checks with polling—straightforward conceptual shift
  • Main attention: Verify the polling logic correctly handles timeout scenarios and that health check addition doesn't introduce new race conditions

Suggested labels

bug, flaky-test

Suggested reviewers

  • dbwiddis
  • andrross
  • msfroh
  • gbbafna
  • cwperks
  • sachinpkale
  • shwetathareja
  • saratvemulapalli

Poem

🐰 A flaky test once danced in the night,
But now with polling, we've set things right!
No rushed acknowledgments, just patient waits—
The state will confirm what the cluster awaits! ✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically identifies the main change: fixing a flaky test in ClusterShardLimitIT related to issue 19916.
Linked Issues check ✅ Passed The code changes (asynchronous polling, busy-wait for CLOSED state, explicit health checks) directly address the flaky test issue caused by unreliable AcknowledgedResponse from prepareClose timeout scenarios.
Out of Scope Changes check ✅ Passed All changes are limited to the ClusterShardLimitIT test file and directly target fixing the flakiness issue; no out-of-scope modifications detected.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Description check ✅ Passed The pull request description provides a clear and concise explanation of the fix, references the related issue (#19916), and confirms compliance with the Apache 2.0 license and DCO requirements.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Contributor

✅ Gradle check result for b129421: SUCCESS

@codecov
Copy link

codecov bot commented Dec 16, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.26%. Comparing base (930ae74) to head (b129421).
⚠️ Report is 54 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main   #20252      +/-   ##
============================================
+ Coverage     73.20%   73.26%   +0.05%     
- Complexity    71745    71754       +9     
============================================
  Files          5795     5795              
  Lines        328304   328304              
  Branches      47283    47283              
============================================
+ Hits         240334   240517     +183     
+ Misses        68663    68467     -196     
- Partials      19307    19320      +13     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@opensearch-trigger-bot
Copy link
Contributor

This PR is stalled because it has been open for 30 days with no activity.

@opensearch-trigger-bot opensearch-trigger-bot bot added stalled Issues that have stalled and removed stalled Issues that have stalled labels Jan 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

autocut flaky-test Random test failure that succeeds on second run >test-failure Test failure from CI, local build, etc.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[AUTOCUT] Gradle Check Flaky Test Report for ClusterShardLimitIT

1 participant