-
Notifications
You must be signed in to change notification settings - Fork 413
[CELEBORN-2222][CIP-14] Support Retrying when createReader failed for CelebornInputStream in CppClient #3583
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
@HolyLow @SteNicholas @FMX @RexXiong Could you please help review this PR? Appreciate your help in improving this as needed! |
|
|
||
| int clientFetchMaxRetriesForEachReplica() const; | ||
|
|
||
| Timeout dataIoRetryWait() const; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this method name align with CelebornConf#networkIoRetryWaitMs for io wait conf of all modules?
| try { | ||
| VLOG(1) << "Create reader for location " << currentLocation->host << ":" | ||
| << currentLocation->fetchPort; | ||
| auto reader = createReader(*currentLocation); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should check whether the partition location is excluded, which aligns with the logic of CelebornInputStream#createReaderWithRetry.
| return reader; | ||
| } catch (const std::exception& e) { | ||
| lastException = std::current_exception(); | ||
| fetchChunkRetryCnt_++; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The shuffle client should exclude failed fetch location.
| std::this_thread::sleep_for( | ||
| std::chrono::milliseconds(retryWait_.count())); | ||
| } | ||
| LOG(WARNING) << "CreatePartitionReader failed " << fetchChunkRetryCnt_ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this aligin with the failure handling of CelebornInputStream#createReaderWithRetry?
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #3583 +/- ##
==========================================
- Coverage 67.13% 67.04% -0.09%
==========================================
Files 357 357
Lines 21860 21924 +64
Branches 1943 1949 +6
==========================================
+ Hits 14674 14696 +22
- Misses 6166 6213 +47
+ Partials 1020 1015 -5 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Thank you for your comments @SteNicholas , I will take a look over the next couple of days. I suspect some refactoring may need to be done to this PR, I will notify you once done. |
This PR implements retry support for createReader failures in the C++ client, matching the behavior of the Java implementation. The implementation includes:
Added configuration properties:
Added peer location support methods to PartitionLocation:
Implemented retry logic in createReaderWithRetry():
Added unit tests for new functionality
How was this patch tested?
Unit tests and compiling