Skip to content

Conversation

@nvartolomei
Copy link
Contributor

Replace the shuffle-based algorithm with rejection-free direct sampling.

Improvements:

  • O(1) instead of O(n) shuffle per call
  • Reuses thread-local RNG instead of creating new std::random_device and std::mt19937 on every invocation
  • Exactly one random draw per selection, no retries
  • No heap allocation (old version used std::vector)

🤖 Generated with Claude Code

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v25.3.x
  • v25.2.x
  • v25.1.x

Release Notes

  • none

Replace the shuffle-based algorithm with rejection-free direct sampling.

Improvements:
- O(1) instead of O(n) shuffle per call
- Reuses thread-local RNG instead of creating new std::random_device
  and std::mt19937 on every invocation
- Exactly one random draw per selection, no retries
- No heap allocation (old version used std::vector)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Copilot AI review requested due to automatic review settings December 23, 2025 00:00
@nvartolomei nvartolomei requested review from Lazin, bashtanov and oleiman and removed request for Copilot December 23, 2025 00:00
@vbotbuildovich
Copy link
Collaborator

CI test results

test results on build#78308
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
DatalakeClusterRestoreTest test_restore_partition_spec {"catalog_type": "rest_hadoop", "cloud_storage_type": 1} integration https://buildkite.com/redpanda/redpanda/builds/78308#019b489f-27a5-4aa1-b898-c26b12eef6a2 FLAKY 10/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0042, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=DatalakeClusterRestoreTest&test_method=test_restore_partition_spec
WriteCachingFailureInjectionE2ETest test_crash_all {"use_transactions": false} integration https://buildkite.com/redpanda/redpanda/builds/78308#019b489f-ec85-4df2-914c-fb66d841a83a FLAKY 9/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0697, p0=0.5145, reject_threshold=0.0100. adj_baseline=0.1949, p1=0.3915, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=WriteCachingFailureInjectionE2ETest&test_method=test_crash_all

[[nodiscard]] std::pair<ss::shard_id, ss::shard_id> pick_two_random_shards() {
using dist_t = std::uniform_int_distribution<ss::shard_id>;

static thread_local std::mt19937 gen{std::random_device{}()};
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gemini

Random Number Generation: std::random_device is often implemented as a read from /dev/urandom (syscall), which is too slow for a "Zero-Cost" abstraction, and std::mt19937 is large (2.5KB state).

Fix: Use a lightweight PRNG (like PCG or Xorshift) or simply a thread_local std::default_random_engine seeded once (not on every jitter call).

[[nodiscard]] std::pair<ss::shard_id, ss::shard_id> pick_two_random_shards() {
using dist_t = std::uniform_int_distribution<ss::shard_id>;

static thread_local std::mt19937 gen{std::random_device{}()};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use our random generator library? Travis updated it a few months ago to have different seeding policies to improve reproducibility, along with other improvements.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants