fix: prevent GET retry storms cycling through same peers#3584
fix: prevent GET retry storms cycling through same peers#3584iduartgomez merged 4 commits intomainfrom
Conversation
…eers Three changes to prevent the retry storm pattern where a single GET transaction generates 20-40+ requests cycling through the same peers: 1. Reduce HTL on retry instead of resetting to max_hops_to_live. Retries target peers at similar distance, so full-depth traversal wastes network resources. Now uses min(max_htl, current_hop) with a floor of 3 hops. 2. Filter fallback peers through the visited bloom filter, not just tried_peers. The tried_peers HashSet only tracks the current hop, so peers tried at previous hops could be re-injected as fallbacks. The bloom filter tracks ALL visited peers across all hops. 3. Mark retry targets in the visited bloom filter so downstream peers won't forward back to them and future retries won't re-select them. Simulation showed 48 transactions with retry storms (max 41 requests per tx) even in ideal conditions. Production telemetry showed 787 transactions (8%) with 20-30 requests each. Also fixes pre-existing clippy warnings in get.rs and subscribe/tests.rs. Refs #3570
✅ Rule Review: No issues foundRules checked: No rule violations detected — merge is not blocked by this check. Checks performed and rationale:
Automated review against |
Address rule review: hardcoded `3` in retry HTL calculation should be a named constant per code-style.md numeric thresholds rule.
Address review findings from PR #3584: - The original HTL reduction (min(max_htl, current_hop)) was a no-op at the originator where current_hop == max_hops_to_live. Changed to divide by attempts_at_hop: max_htl / attempts, floored at MIN_RETRY_HTL. Each successive retry now has progressively shorter reach. - Add 4 new unit tests covering the review gaps: - retry_htl_decreases_with_attempts: verifies HTL reduces each retry - retry_htl_floor_at_min: verifies MIN_RETRY_HTL floor - retry_dbf_fallback_skips_bloom_visited: verifies bloom filter filtering - retry_marks_target_in_bloom_filter: verifies bloom filter marking - Update gc_retry_full_flow test to assert reduced HTL (was asserting htl == max_htl which is the old behavior). 51 GET tests pass (47 existing + 4 new). Refs #3570
sanity
left a comment
There was a problem hiding this comment.
Review Findings
Ran four-perspective review (code-first, testing, skeptical, big-picture). The code changes are correct and well-targeted. Three findings worth noting:
1. No direct test coverage for the three new behaviors (all reviewers)
Existing tests pass by coincidence — make_awaiting_op hardcodes current_hop=7 and creates empty bloom filters, so:
- HTL reduction:
min(7, max(7, 3))= 7 =max_hops_to_live— same result as old code - Bloom filter filtering: empty bloom → no peers filtered
mark_visited: never verified in any assertion
Suggested tests (all pure unit, sub-millisecond):
- HTL reduction:
make_awaiting_opwithcurrent_hop=4, retry withmax_htl=10, asserthtl == 4 - HTL floor:
current_hop=1, asserthtl == MIN_RETRY_HTL(3) - Bloom filtering: Pre-mark a fallback peer in
data.visited, verify it's excluded from alternatives - mark_visited: After retry, verify
visited.probably_visited(retry_target_addr)is true in the emittedGetMsg::Request
2. SUBSCRIBE has the identical three bugs (big-picture + skeptical)
subscribe.rs lines 649-737 (retry_with_next_alternative) has:
- Line 660: fallback peers only filtered through
tried_peers, not bloom filter - Line 702: retry target not marked in bloom filter
- Line 726:
htl: max_hops_to_liveresets HTL to max on retry
Understood this PR is scoped to GET ("Refs" not "Closes" #3570), but worth a follow-up.
3. handle_abort (line 717) uses raw current_hop without MIN_RETRY_HTL floor
Different code path (connection failure vs timeout retry), so the asymmetry may be intentional. Worth a brief comment documenting why if so.
Overall: code is correct and surgical. The test gaps are the main concern — the three behavioral changes could be reverted without any test failing.
[AI-assisted - Claude]
|
All three review findings addressed: Finding 1 (tests + HTL no-op): Fixed in d93f172:
Finding 2 (Subscribe): Agreed — follow-up. Subscribe has the same pattern at Finding 3 ( |
Apply the identical retry storm prevention from PR #3584 (GET) to the SUBSCRIBE operation's retry_with_next_alternative: 1. Filter fallback peers through visited bloom filter (not just tried_peers) 2. Mark retry targets in bloom filter so downstream won't forward back 3. Reduce HTL on retry (max_htl / attempts) instead of resetting to max Also adds comprehensive tests for all three behaviors, plus updates the existing test that asserted the old full-HTL retry behavior. Refs #3570 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Problem
GET operations can generate 20-40+ requests per transaction cycling through the same peers, creating retry storms that waste network resources and contribute to the 73% GET timeout rate (#3570).
Production telemetry shows 787 transactions (8%) with 20-30 requests each. Simulation confirmed 48 transactions with up to 41 requests per tx even in ideal conditions (no latency, no loss).
Root causes:
retry_with_next_alternativeresets HTL tomax_hops_to_live, causing each retry to traverse the full network depth againtried_peers(current hop), not the bloom filter (all hops), so peers visited at previous hops get re-injectedSolution
Three targeted changes in
retry_with_next_alternative:Reduce HTL on retry — use
min(max_htl, current_hop)with floor of 3, instead of resetting to max. Retries target peers at similar distance, so full-depth traversal is wasteful.Filter fallback peers through visited bloom filter — in addition to
tried_peers, also checkvisited.probably_visited(addr)before injecting fallback peers. This prevents re-trying peers from earlier hops.Mark retry targets in bloom filter — call
visited.mark_visited(addr)for each retry target, so downstream peers won't forward back and future retries won't re-select them.Testing
All 47 GET operation unit tests pass. The fix is minimal and surgical — only modifies
retry_with_next_alternativeinget.rs.Diagnostic simulation test (in PR #3583) can verify the retry storm count decreases after this fix.
Fixes
Refs #3570