Skip to content

Conversation

@afterincomparableyum
Copy link

This PR implements retry support for createReader failures in the C++ client, matching the behavior of the Java implementation. The implementation includes:

  • Added configuration properties:

    • clientFetchMaxRetriesForEachReplica (default: 3)
    • dataIoRetryWait (default: 5s)
    • clientPushReplicateEnabled (default: false)
  • Added peer location support methods to PartitionLocation:

    • hasPeer() - Check if location has a peer replica
    • getPeer() - Get the peer location
    • hostAndFetchPort() - Get host:port string for logging
  • Implemented retry logic in createReaderWithRetry():

    • Retries up to fetchChunkMaxRetry_ times (doubled if replication enabled)[which is why I added this parameter in this PR]
    • Switches to peer location on failure when available
    • Sleeps between retries when both replicas tried or no peer exists
    • Resets retry counter when moving to new location or on success
  • Added unit tests for new functionality

How was this patch tested?

Unit tests and compiling

@afterincomparableyum
Copy link
Author

@HolyLow @SteNicholas @FMX @RexXiong Could you please help review this PR? Appreciate your help in improving this as needed!


int clientFetchMaxRetriesForEachReplica() const;

Timeout dataIoRetryWait() const;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this method name align with CelebornConf#networkIoRetryWaitMs for io wait conf of all modules?

try {
VLOG(1) << "Create reader for location " << currentLocation->host << ":"
<< currentLocation->fetchPort;
auto reader = createReader(*currentLocation);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should check whether the partition location is excluded, which aligns with the logic of CelebornInputStream#createReaderWithRetry.

return reader;
} catch (const std::exception& e) {
lastException = std::current_exception();
fetchChunkRetryCnt_++;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The shuffle client should exclude failed fetch location.

std::this_thread::sleep_for(
std::chrono::milliseconds(retryWait_.count()));
}
LOG(WARNING) << "CreatePartitionReader failed " << fetchChunkRetryCnt_
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this aligin with the failure handling of CelebornInputStream#createReaderWithRetry?

@codecov
Copy link

codecov bot commented Jan 20, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 67.04%. Comparing base (2dd1b7a) to head (5d32d94).
⚠️ Report is 6 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3583      +/-   ##
==========================================
- Coverage   67.13%   67.04%   -0.09%     
==========================================
  Files         357      357              
  Lines       21860    21924      +64     
  Branches     1943     1949       +6     
==========================================
+ Hits        14674    14696      +22     
- Misses       6166     6213      +47     
+ Partials     1020     1015       -5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@afterincomparableyum
Copy link
Author

Thank you for your comments @SteNicholas , I will take a look over the next couple of days. I suspect some refactoring may need to be done to this PR, I will notify you once done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants