Conversation
The cwnd_wait loop in send_stream and pipe_stream had no timeout, causing indefinite hangs when the FixedRateController's loss_pause was active and retransmission ACKs never arrived (dead/degraded outbound connection). For large transfers (737 fragments for ~1MB), this manifested as "stream assembly timed out after 30s" on the receiver with no diagnostic from the sender. Two fixes: 1. Add CWND_WAIT_TIMEOUT (20s) to both send_stream and pipe_stream. The sender now fails with a clear diagnostic before the receiver's 30s inactivity timeout fires. 2. Fix FixedRateController::current_cwnd() during loss_pause: both flightsize() and current_cwnd() read the same AtomicUsize, making the check `flightsize + packet_size <= cwnd` permanently false (X + positive <= X). Adding a one-packet margin allows flow restart when ACKs reduce flightsize, without waiting for loss_pause to fully clear. Closes #3608 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
🔍 Rule Review: cwnd timeout for degraded connectionsRules checked: CriticalNone. WarningsNone. Info
No Critical or Warning findings — merge is not blocked by this check. Automated review against |
The LOSS_PAUSE_CWND_MARGIN constant doc and CWND_WAIT_TIMEOUT doc already explain the rationale; inline comments were restating the same information. Replaced with brief references to the constants. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
The margin (1500 bytes) effectively disabled loss_pause entirely — since each fragment is ~1422 bytes, the margin always allowed one fragment through, and since cwnd dynamically tracks flightsize, every subsequent fragment also passed. Loss_pause became a no-op. The CWND_WAIT_TIMEOUT (20s) alone is the correct fix: - Normal loss: ACKs clear loss_pause in 1-2s, sends resume - Dead connection: timeout fires at 20s, clean error Also adds a regression test that verifies send_stream returns ConnectionClosed and fires completion_tx when cwnd wait exceeds the timeout, and fixes a telemetry overestimate (bytes_sent was not capped at bytes_to_send). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Avoid two consecutive clock reads per timeout check — use one snapshot for both the timeout condition and elapsed calculation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
iduartgomez
left a comment
There was a problem hiding this comment.
address #3608 (comment)
Problem
GET requests for large contracts (~1MB, e.g. the River UI container) consistently fail because streaming transfers die mid-transfer. The responding peer starts sending fragments but they stop arriving, causing stream assembly timeouts (30s).
User impact: River UI cannot load. Users see "GET request timed out after 30s". Affects users with otherwise healthy nodes (16+ ring peers). Multiple diagnostic reports confirm the pattern (E7NJG5, 23TN9A).
Root cause: The cwnd_wait loop in
send_streamandpipe_streamhas no timeout. When theFixedRateController'sloss_pauseactivates (on packet loss/RTO), it caps cwnd at flightsize, blocking all new data until an ACK clears it. This is correct for transient loss (ACKs arrive in 1-2s), but when the outbound connection degrades (consecutive retransmission failures with exponential RTO backoff: 1s→2s→4s→8s→16s = 31s), the sender stalls for >30s and the receiver'sSTREAM_INACTIVITY_TIMEOUTfires with no diagnostic from the sender.Large transfers (737 fragments for ~1MB) are disproportionately affected because more fragments = more chances for loss events and longer total transfer time.
Approach
Add
CWND_WAIT_TIMEOUT(20s) to bothsend_streamandpipe_stream. Set below the 30sSTREAM_INACTIVITY_TIMEOUTso the sender fails first with a clear diagnostic message ("cwnd wait timed out — outbound connection likely dead"), rather than the receiver timing out silently.The fix preserves the existing loss_pause behavior (complete block until ACK clears it) for transient loss recovery, while adding a safety net for dead connections.
Testing
test_send_stream_cwnd_wait_timeout: verifies send_stream returnsConnectionClosedwhen cwnd wait exceeds timeout, and thatcompletion_txfires on timeoutbytes_sentoverestimate (now capped atbytes_to_send)Closes #3608
[AI-assisted - Claude]