Skip to content

fix: prevent GET retry storms cycling through same peers#3584

Merged
iduartgomez merged 4 commits intomainfrom
fix/get-retry-storm-3570
Mar 17, 2026
Merged

fix: prevent GET retry storms cycling through same peers#3584
iduartgomez merged 4 commits intomainfrom
fix/get-retry-storm-3570

Conversation

@iduartgomez
Copy link
Collaborator

Problem

GET operations can generate 20-40+ requests per transaction cycling through the same peers, creating retry storms that waste network resources and contribute to the 73% GET timeout rate (#3570).

Production telemetry shows 787 transactions (8%) with 20-30 requests each. Simulation confirmed 48 transactions with up to 41 requests per tx even in ideal conditions (no latency, no loss).

Root causes:

  1. retry_with_next_alternative resets HTL to max_hops_to_live, causing each retry to traverse the full network depth again
  2. Fallback peers are only filtered through tried_peers (current hop), not the bloom filter (all hops), so peers visited at previous hops get re-injected
  3. Retry targets aren't marked in the visited bloom filter, so downstream peers can forward back to them

Solution

Three targeted changes in retry_with_next_alternative:

  1. Reduce HTL on retry — use min(max_htl, current_hop) with floor of 3, instead of resetting to max. Retries target peers at similar distance, so full-depth traversal is wasteful.

  2. Filter fallback peers through visited bloom filter — in addition to tried_peers, also check visited.probably_visited(addr) before injecting fallback peers. This prevents re-trying peers from earlier hops.

  3. Mark retry targets in bloom filter — call visited.mark_visited(addr) for each retry target, so downstream peers won't forward back and future retries won't re-select them.

Testing

All 47 GET operation unit tests pass. The fix is minimal and surgical — only modifies retry_with_next_alternative in get.rs.

cargo test -p freenet --lib -- operations::get::tests
# 47 passed; 0 failed

Diagnostic simulation test (in PR #3583) can verify the retry storm count decreases after this fix.

Fixes

Refs #3570

…eers

Three changes to prevent the retry storm pattern where a single GET
transaction generates 20-40+ requests cycling through the same peers:

1. Reduce HTL on retry instead of resetting to max_hops_to_live.
   Retries target peers at similar distance, so full-depth traversal
   wastes network resources. Now uses min(max_htl, current_hop) with
   a floor of 3 hops.

2. Filter fallback peers through the visited bloom filter, not just
   tried_peers. The tried_peers HashSet only tracks the current hop,
   so peers tried at previous hops could be re-injected as fallbacks.
   The bloom filter tracks ALL visited peers across all hops.

3. Mark retry targets in the visited bloom filter so downstream peers
   won't forward back to them and future retries won't re-select them.

Simulation showed 48 transactions with retry storms (max 41 requests
per tx) even in ideal conditions. Production telemetry showed 787
transactions (8%) with 20-30 requests each.

Also fixes pre-existing clippy warnings in get.rs and subscribe/tests.rs.

Refs #3570
@github-actions
Copy link
Contributor

github-actions bot commented Mar 17, 2026

✅ Rule Review: No issues found

Rules checked: git-workflow.md, code-style.md, testing.md, operations.md
Files reviewed: 2 (crates/core/src/operations/get.rs, crates/core/src/operations/subscribe/tests.rs)

No rule violations detected — merge is not blocked by this check.


Checks performed and rationale:

Check Result
std::time::Instant::now() / tokio::time::sleep() in new code None found
rand::random() / rand::thread_rng() in new code None found
tokio::net::UdpSocket in new code None found
Push-before-send ordering retry_with_next_alternative returns (GetOp, GetMsg) — ordering enforced at call site, no violation
Deleted/commented-out tests None; the gc-retry assertion is updated to reflect the semantic change, not suppressed
.unwrap() in production code None in production paths; test code uses `.unwrap_or_else(
Retry/backoff jitter HTL reduction is not a timed backoff loop — jitter rule doesn't apply
MIN_RETRY_HTL as a hardcoded threshold Properly defined as a named constant with doc comment explaining the topological rationale; not derived from a configurable value, so the "derive from config" rule doesn't apply
Catch-all _ => in match arms The PR removes catch-all other => arms and replaces them with exhaustive matches — an improvement
New public APIs without doc comments No new public APIs added
Test coverage for changed paths 4 new targeted unit tests cover all three new behaviors (HTL reduction, bloom-filter fallback filtering, bloom-filter marking)

Automated review against .claude/rules/. Critical and Warning findings block merge — check boxes or post /ack to acknowledge.

Address rule review: hardcoded `3` in retry HTL calculation should be
a named constant per code-style.md numeric thresholds rule.
Address review findings from PR #3584:

- The original HTL reduction (min(max_htl, current_hop)) was a no-op
  at the originator where current_hop == max_hops_to_live. Changed to
  divide by attempts_at_hop: max_htl / attempts, floored at MIN_RETRY_HTL.
  Each successive retry now has progressively shorter reach.

- Add 4 new unit tests covering the review gaps:
  - retry_htl_decreases_with_attempts: verifies HTL reduces each retry
  - retry_htl_floor_at_min: verifies MIN_RETRY_HTL floor
  - retry_dbf_fallback_skips_bloom_visited: verifies bloom filter filtering
  - retry_marks_target_in_bloom_filter: verifies bloom filter marking

- Update gc_retry_full_flow test to assert reduced HTL (was asserting
  htl == max_htl which is the old behavior).

51 GET tests pass (47 existing + 4 new).

Refs #3570
Copy link
Collaborator

@sanity sanity left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Findings

Ran four-perspective review (code-first, testing, skeptical, big-picture). The code changes are correct and well-targeted. Three findings worth noting:

1. No direct test coverage for the three new behaviors (all reviewers)

Existing tests pass by coincidence — make_awaiting_op hardcodes current_hop=7 and creates empty bloom filters, so:

  • HTL reduction: min(7, max(7, 3)) = 7 = max_hops_to_live — same result as old code
  • Bloom filter filtering: empty bloom → no peers filtered
  • mark_visited: never verified in any assertion

Suggested tests (all pure unit, sub-millisecond):

  • HTL reduction: make_awaiting_op with current_hop=4, retry with max_htl=10, assert htl == 4
  • HTL floor: current_hop=1, assert htl == MIN_RETRY_HTL (3)
  • Bloom filtering: Pre-mark a fallback peer in data.visited, verify it's excluded from alternatives
  • mark_visited: After retry, verify visited.probably_visited(retry_target_addr) is true in the emitted GetMsg::Request

2. SUBSCRIBE has the identical three bugs (big-picture + skeptical)

subscribe.rs lines 649-737 (retry_with_next_alternative) has:

  • Line 660: fallback peers only filtered through tried_peers, not bloom filter
  • Line 702: retry target not marked in bloom filter
  • Line 726: htl: max_hops_to_live resets HTL to max on retry

Understood this PR is scoped to GET ("Refs" not "Closes" #3570), but worth a follow-up.

3. handle_abort (line 717) uses raw current_hop without MIN_RETRY_HTL floor

Different code path (connection failure vs timeout retry), so the asymmetry may be intentional. Worth a brief comment documenting why if so.


Overall: code is correct and surgical. The test gaps are the main concern — the three behavioral changes could be reverted without any test failing.

[AI-assisted - Claude]

Add comment explaining why handle_abort uses current_hop directly
instead of the attempts-based reduction from retry_with_next_alternative.
Connection aborts are immediate failures, not timeout-based retries.

Addresses review finding #3 from PR #3584.
@iduartgomez
Copy link
Collaborator Author

All three review findings addressed:

Finding 1 (tests + HTL no-op): Fixed in d93f172:

  • Changed HTL formula to max_htl / attempts_at_hop (works at originator where current_hop == max_hops_to_live)
  • Added 4 regression tests covering HTL reduction, MIN_RETRY_HTL floor, bloom filter filtering, and bloom filter marking
  • 51 tests pass (47 existing + 4 new)

Finding 2 (Subscribe): Agreed — follow-up. Subscribe has the same pattern at subscribe.rs:649.

Finding 3 (handle_abort): Added clarifying comment in 25f7a08 explaining that connection aborts are immediate failures, not timeout-based retries.

@iduartgomez iduartgomez enabled auto-merge March 17, 2026 14:26
@iduartgomez iduartgomez added this pull request to the merge queue Mar 17, 2026
Merged via the queue into main with commit cb51c62 Mar 17, 2026
10 of 11 checks passed
@iduartgomez iduartgomez deleted the fix/get-retry-storm-3570 branch March 17, 2026 14:55
sanity added a commit that referenced this pull request Mar 17, 2026
Apply the identical retry storm prevention from PR #3584 (GET) to the
SUBSCRIBE operation's retry_with_next_alternative:

1. Filter fallback peers through visited bloom filter (not just tried_peers)
2. Mark retry targets in bloom filter so downstream won't forward back
3. Reduce HTL on retry (max_htl / attempts) instead of resetting to max

Also adds comprehensive tests for all three behaviors, plus updates the
existing test that asserted the old full-HTL retry behavior.

Refs #3570

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants