fix: prevent GET retry storms cycling through same peers by iduartgomez · Pull Request #3584 · freenet/freenet-core

iduartgomez · 2026-03-17T12:31:06Z

Problem

GET operations can generate 20-40+ requests per transaction cycling through the same peers, creating retry storms that waste network resources and contribute to the 73% GET timeout rate (#3570).

Production telemetry shows 787 transactions (8%) with 20-30 requests each. Simulation confirmed 48 transactions with up to 41 requests per tx even in ideal conditions (no latency, no loss).

Root causes:

retry_with_next_alternative resets HTL to max_hops_to_live, causing each retry to traverse the full network depth again
Fallback peers are only filtered through tried_peers (current hop), not the bloom filter (all hops), so peers visited at previous hops get re-injected
Retry targets aren't marked in the visited bloom filter, so downstream peers can forward back to them

Solution

Three targeted changes in retry_with_next_alternative:

Reduce HTL on retry — use min(max_htl, current_hop) with floor of 3, instead of resetting to max. Retries target peers at similar distance, so full-depth traversal is wasteful.
Filter fallback peers through visited bloom filter — in addition to tried_peers, also check visited.probably_visited(addr) before injecting fallback peers. This prevents re-trying peers from earlier hops.
Mark retry targets in bloom filter — call visited.mark_visited(addr) for each retry target, so downstream peers won't forward back and future retries won't re-select them.

Testing

All 47 GET operation unit tests pass. The fix is minimal and surgical — only modifies retry_with_next_alternative in get.rs.

cargo test -p freenet --lib -- operations::get::tests
# 47 passed; 0 failed

Diagnostic simulation test (in PR #3583) can verify the retry storm count decreases after this fix.

Fixes

Refs #3570

…eers Three changes to prevent the retry storm pattern where a single GET transaction generates 20-40+ requests cycling through the same peers: 1. Reduce HTL on retry instead of resetting to max_hops_to_live. Retries target peers at similar distance, so full-depth traversal wastes network resources. Now uses min(max_htl, current_hop) with a floor of 3 hops. 2. Filter fallback peers through the visited bloom filter, not just tried_peers. The tried_peers HashSet only tracks the current hop, so peers tried at previous hops could be re-injected as fallbacks. The bloom filter tracks ALL visited peers across all hops. 3. Mark retry targets in the visited bloom filter so downstream peers won't forward back to them and future retries won't re-select them. Simulation showed 48 transactions with retry storms (max 41 requests per tx) even in ideal conditions. Production telemetry showed 787 transactions (8%) with 20-30 requests each. Also fixes pre-existing clippy warnings in get.rs and subscribe/tests.rs. Refs #3570

github-actions · 2026-03-17T12:32:26Z

✅ Rule Review: No issues found

Rules checked: git-workflow.md, code-style.md, testing.md, operations.md
Files reviewed: 2 (crates/core/src/operations/get.rs, crates/core/src/operations/subscribe/tests.rs)

No rule violations detected — merge is not blocked by this check.

Checks performed and rationale:

Check	Result
`std::time::Instant::now()` / `tokio::time::sleep()` in new code	None found
`rand::random()` / `rand::thread_rng()` in new code	None found
`tokio::net::UdpSocket` in new code	None found
Push-before-send ordering	`retry_with_next_alternative` returns `(GetOp, GetMsg)` — ordering enforced at call site, no violation
Deleted/commented-out tests	None; the gc-retry assertion is updated to reflect the semantic change, not suppressed
`.unwrap()` in production code	None in production paths; test code uses `.unwrap_or_else(
Retry/backoff jitter	HTL reduction is not a timed backoff loop — jitter rule doesn't apply
`MIN_RETRY_HTL` as a hardcoded threshold	Properly defined as a named constant with doc comment explaining the topological rationale; not derived from a configurable value, so the "derive from config" rule doesn't apply
Catch-all `_ =>` in match arms	The PR removes catch-all `other =>` arms and replaces them with exhaustive matches — an improvement
New public APIs without doc comments	No new public APIs added
Test coverage for changed paths	4 new targeted unit tests cover all three new behaviors (HTL reduction, bloom-filter fallback filtering, bloom-filter marking)

Automated review against .claude/rules/. Critical and Warning findings block merge — check boxes or post /ack to acknowledge.

Address rule review: hardcoded `3` in retry HTL calculation should be a named constant per code-style.md numeric thresholds rule.

Address review findings from PR #3584: - The original HTL reduction (min(max_htl, current_hop)) was a no-op at the originator where current_hop == max_hops_to_live. Changed to divide by attempts_at_hop: max_htl / attempts, floored at MIN_RETRY_HTL. Each successive retry now has progressively shorter reach. - Add 4 new unit tests covering the review gaps: - retry_htl_decreases_with_attempts: verifies HTL reduces each retry - retry_htl_floor_at_min: verifies MIN_RETRY_HTL floor - retry_dbf_fallback_skips_bloom_visited: verifies bloom filter filtering - retry_marks_target_in_bloom_filter: verifies bloom filter marking - Update gc_retry_full_flow test to assert reduced HTL (was asserting htl == max_htl which is the old behavior). 51 GET tests pass (47 existing + 4 new). Refs #3570

sanity

Review Findings

Ran four-perspective review (code-first, testing, skeptical, big-picture). The code changes are correct and well-targeted. Three findings worth noting:

1. No direct test coverage for the three new behaviors (all reviewers)

Existing tests pass by coincidence — make_awaiting_op hardcodes current_hop=7 and creates empty bloom filters, so:

HTL reduction: min(7, max(7, 3)) = 7 = max_hops_to_live — same result as old code
Bloom filter filtering: empty bloom → no peers filtered
mark_visited: never verified in any assertion

Suggested tests (all pure unit, sub-millisecond):

HTL reduction: make_awaiting_op with current_hop=4, retry with max_htl=10, assert htl == 4
HTL floor: current_hop=1, assert htl == MIN_RETRY_HTL (3)
Bloom filtering: Pre-mark a fallback peer in data.visited, verify it's excluded from alternatives
mark_visited: After retry, verify visited.probably_visited(retry_target_addr) is true in the emitted GetMsg::Request

2. SUBSCRIBE has the identical three bugs (big-picture + skeptical)

subscribe.rs lines 649-737 (retry_with_next_alternative) has:

Line 660: fallback peers only filtered through tried_peers, not bloom filter
Line 702: retry target not marked in bloom filter
Line 726: htl: max_hops_to_live resets HTL to max on retry

Understood this PR is scoped to GET ("Refs" not "Closes" #3570), but worth a follow-up.

3. `handle_abort` (line 717) uses raw `current_hop` without `MIN_RETRY_HTL` floor

Different code path (connection failure vs timeout retry), so the asymmetry may be intentional. Worth a brief comment documenting why if so.

Overall: code is correct and surgical. The test gaps are the main concern — the three behavioral changes could be reverted without any test failing.

[AI-assisted - Claude]

Add comment explaining why handle_abort uses current_hop directly instead of the attempts-based reduction from retry_with_next_alternative. Connection aborts are immediate failures, not timeout-based retries. Addresses review finding #3 from PR #3584.

iduartgomez · 2026-03-17T14:20:38Z

All three review findings addressed:

Finding 1 (tests + HTL no-op): Fixed in d93f172:

Changed HTL formula to max_htl / attempts_at_hop (works at originator where current_hop == max_hops_to_live)
Added 4 regression tests covering HTL reduction, MIN_RETRY_HTL floor, bloom filter filtering, and bloom filter marking
51 tests pass (47 existing + 4 new)

Finding 2 (Subscribe): Agreed — follow-up. Subscribe has the same pattern at subscribe.rs:649.

Finding 3 (handle_abort): Added clarifying comment in 25f7a08 explaining that connection aborts are immediate failures, not timeout-based retries.

Apply the identical retry storm prevention from PR #3584 (GET) to the SUBSCRIBE operation's retry_with_next_alternative: 1. Filter fallback peers through visited bloom filter (not just tried_peers) 2. Mark retry targets in bloom filter so downstream won't forward back 3. Reduce HTL on retry (max_htl / attempts) instead of resetting to max Also adds comprehensive tests for all three behaviors, plus updates the existing test that asserted the old full-HTL retry behavior. Refs #3570 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

iduartgomez added 2 commits March 17, 2026 13:39

refactor: extract MIN_RETRY_HTL constant for retry floor

5570260

Address rule review: hardcoded `3` in retry HTL calculation should be a named constant per code-style.md numeric thresholds rule.

sanity reviewed Mar 17, 2026

View reviewed changes

iduartgomez enabled auto-merge March 17, 2026 14:26

iduartgomez added this pull request to the merge queue Mar 17, 2026

Merged via the queue into main with commit cb51c62 Mar 17, 2026
10 of 11 checks passed

iduartgomez deleted the fix/get-retry-storm-3570 branch March 17, 2026 14:55

sanity mentioned this pull request Mar 17, 2026

fix: prevent SUBSCRIBE retry storms with same three fixes as GET #3587

Merged

This was referenced Mar 18, 2026

fix: GET operations have 73% network-wide timeout rate due to silent message drops and missing retry logic #3570

Closed

fix: GET responses lost in relay chain due to connection churn (#3570 follow-up) #3590

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: prevent GET retry storms cycling through same peers#3584

fix: prevent GET retry storms cycling through same peers#3584
iduartgomez merged 4 commits intomainfrom
fix/get-retry-storm-3570

iduartgomez commented Mar 17, 2026

Uh oh!

github-actions bot commented Mar 17, 2026 •

edited

Loading

Uh oh!

sanity left a comment

Uh oh!

iduartgomez commented Mar 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

iduartgomez commented Mar 17, 2026

Problem

Solution

Testing

Fixes

Uh oh!

github-actions bot commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Rule Review: No issues found

Uh oh!

sanity left a comment

Choose a reason for hiding this comment

Review Findings

1. No direct test coverage for the three new behaviors (all reviewers)

2. SUBSCRIBE has the identical three bugs (big-picture + skeptical)

3. handle_abort (line 717) uses raw current_hop without MIN_RETRY_HTL floor

Uh oh!

iduartgomez commented Mar 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions bot commented Mar 17, 2026 •

edited

Loading

3. `handle_abort` (line 717) uses raw `current_hop` without `MIN_RETRY_HTL` floor