Skip to content

Attempt to fixing hanging reconnect#3261

Open
Ellerbach wants to merge 4 commits intomainfrom
fix-reconnect-hanging
Open

Attempt to fixing hanging reconnect#3261
Ellerbach wants to merge 4 commits intomainfrom
fix-reconnect-hanging

Conversation

@Ellerbach
Copy link
Copy Markdown
Member

@Ellerbach Ellerbach commented Mar 4, 2026

Description

Attempt to fixing hanging reconnect

Motivation and Context

How Has This Been Tested?

Screenshots

Types of changes

  • Improvement (non-breaking change that improves a feature, code or algorithm)
  • Bug fix (non-breaking change which fixes an issue with code or algorithm)
  • New feature (non-breaking change which adds functionality to code)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Config and build (change in the configuration and build system, has no impact on code or features)
  • Dev Containers (changes related with Dev Containers, has no impact on code or features)
  • Dependencies/declarations (update dependencies or assembly declarations and changes associated, has no impact on code or features)
  • Documentation (changes or updates in the documentation, has no impact on code or features)

Checklist

  • My code follows the code style of this project (only if there are changes in source code).
  • My changes require an update to the documentation (there are changes that require the docs website to be updated).
  • I have updated the documentation accordingly (the changes require an update on the docs in this repo).
  • I have read the CONTRIBUTING document.
  • I have tested everything locally and all new and existing tests passed (only if there are changes in source code).

Summary by CodeRabbit

  • Bug Fixes
    • Non-blocking socket connects now respect configured timeouts and avoid busy-waiting, improving reliability.
    • Connect waits poll socket state and surface real socket errors instead of spinning.
    • Send/receive paths yield to the network stack when operations would block, reducing CPU usage.
    • Disposed or closed sockets detected during async waits and return appropriate errors.
    • Improved cleanup to prevent stale event handling and related failures.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 4, 2026

Walkthrough

Adds timeout-driven waits and event-yielding to non-blocking connect/send/recv in the native socket layer: introduces SetupTimeoutFromTicks, select-based polling for writability/readability, WaitEvents yields on EWOULDBLOCK, SO_ERROR checks on failures, disposal checks, and ETIMEDOUT error paths.

Changes

Cohort / File(s) Summary
Non-blocking socket operations
src/DeviceInterfaces/System.Net/sys_net_native_System_Net_Sockets_NativeSocket.cpp
BindConnectHelper: adds CLR_INT64 hbTimeout, SetupTimeoutFromTicks, select-based writability polling loop, re-entry/disposed-handle validation, SO_ERROR retrieval on error, and SOCK_ETIMEDOUT handling. SendRecvHelper: on EWOULDBLOCK yields via WaitEvents with timeout and raises ETIMEDOUT on expiry; overall reduces busy-wait spinning and synchronizes with network stack.

Sequence Diagram(s)

sequenceDiagram
  participant Caller
  participant NativeSocket
  participant SelectHelper
  participant WaitEvents
  Caller->>NativeSocket: initiate non-blocking Connect/Send/Recv
  NativeSocket->>NativeSocket: SetupTimeoutFromTicks(hbTimeout)
  loop poll until timeout/disposed/ready
    NativeSocket->>SelectHelper: poll socket writability/readability
    SelectHelper-->>NativeSocket: ready / not ready / error
    alt not ready
      NativeSocket->>WaitEvents: wait with timeout
      WaitEvents-->>NativeSocket: resumed (ready / timeout / disposed)
      alt timeout
        NativeSocket-->>Caller: return SOCK_ETIMEDOUT / error
      else disposed
        NativeSocket-->>Caller: return DISPOSED_HANDLE error
      else ready
        NativeSocket->>NativeSocket: gets SO_ERROR
        NativeSocket-->>Caller: success (S_OK) or error
      end
    else error
      NativeSocket->>NativeSocket: retrieve SO_ERROR
      NativeSocket-->>Caller: error
    else ready
      NativeSocket->>NativeSocket: continue flow
      NativeSocket-->>Caller: success (S_OK)
    end
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Attempt to fixing hanging reconnect' directly addresses the main change—implementing timeout-driven wait mechanisms in socket connection and send/receive operations to prevent hanging reconnect scenarios.
Linked Issues check ✅ Passed The PR implements timeout-driven polling mechanisms in the native socket layer (BindConnectHelper and SendRecvHelper) to prevent blocking operations during MQTT reconnect, directly addressing the hanging reconnect issue described in #1692.
Out of Scope Changes check ✅ Passed All changes are scoped to the native socket implementation (sys_net_native_System_Net_Sockets_NativeSocket.cpp) and directly support the reconnect hanging fix objective without introducing unrelated modifications.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix-reconnect-hanging

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@src/DeviceInterfaces/System.Net/sys_net_native_System_Net_Sockets_NativeSocket.cpp`:
- Around line 497-500: The code directly clears the global bit
g_CLR_RT_ExecutionEngine.m_raisedEvents &= ~Event_Socket (around usages with
Event_Socket, WaitEvents, stack.m_owningThread, timeout, fRes), which can
consume wakeups for unrelated sockets and cause missed-wakeup races; remove any
direct mutations of m_raisedEvents in the socket helper paths and instead rely
on the execution engine's atomic event handling: call
g_CLR_RT_ExecutionEngine.WaitEvents(...) to perform the wait and let the engine
clear or mask Event_Socket atomically, and if per-socket/state bookkeeping is
needed maintain a local or thread-scoped flag (or use an API on
g_CLR_RT_ExecutionEngine) to mark consumption without touching m_raisedEvents
directly in functions that reference Event_Socket, stack.m_owningThread,
timeout, and fRes.
- Around line 466-511: When Helper__SelectSocket(handle, 1) returns
SOCK_SOCKET_ERROR inside BindConnectHelper, replace the current
ThrowOnError(stack, SOCK_SOCKET_ERROR) usage with a direct query of the
socket-specific error via SOCK_getsockopt(handle, SOL_SOCKET, SO_ERROR, ...) and
pass that error value into ThrowOnError (or a suitable throw helper) so you use
the socket's SO_ERROR rather than a possibly-stale global last-error; also add
the same DISPOSED_HANDLE guard check (as used in accept()) immediately after the
g_CLR_RT_ExecutionEngine.WaitEvents reschedule to avoid operating on a
rescheduled/disposed socket before continuing.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: fda0a325-5b93-408a-9175-ddbd6d58e86b

📥 Commits

Reviewing files that changed from the base of the PR and between a2f1127 and 01935e0.

📒 Files selected for processing (1)
  • src/DeviceInterfaces/System.Net/sys_net_native_System_Net_Sockets_NativeSocket.cpp

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
src/DeviceInterfaces/System.Net/sys_net_native_System_Net_Sockets_NativeSocket.cpp (1)

471-480: ⚠️ Potential issue | 🟠 Major

m_customState is advanced, but SOCK_connect() still re-runs on re-entry.

Because WaitEvents() re-enters this native method from the top, Line 458 still executes on every wake. Updating m_customState here only changes bookkeeping; it does not stop the pending-connect path from issuing a second SOCK_connect(), which can then bypass the timeout flow created at Line 474.

Based on learnings: WaitEvents() is a cooperative rescheduling mechanism, so first-call-only setup must be guarded with stack.m_customState.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@src/DeviceInterfaces/System.Net/sys_net_native_System_Net_Sockets_NativeSocket.cpp`
around lines 471 - 480, The native method currently advances stack.m_customState
after calling stack.SetupTimeoutFromTicks, but SOCK_connect is still executed on
every re-entry; wrap the one-time setup and the call to SOCK_connect in a guard
based on stack.m_customState so they only run on the first entry. Concretely,
check if (stack.m_customState == 0) before performing the pending-connect path
(including the SOCK_connect call), set stack.m_customState to 1 (or 2 per
existing flow) after the first-call setup so subsequent re-entrances (from
WaitEvents) skip re-invoking SOCK_connect and only execute the
timeout/eval-stack logic (e.g., leave stack.SetupTimeoutFromTicks and the
re-entry handling untouched).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@src/DeviceInterfaces/System.Net/sys_net_native_System_Net_Sockets_NativeSocket.cpp`:
- Around line 498-524: The early NANOCLR_SET_AND_LEAVE(S_OK) in the
selectResult==0 path prematurely exits before popping the timeout value and
re-polling the socket, leaving the timeout slot on the eval stack and treating a
global Event_Socket wake as a successful connect; modify the selectResult==0
branch in sys_net_native_System_Net_Sockets_NativeSocket.cpp so that after
WaitEvents returns and fRes is true you first pop the timeout value (call
stack.PopValue()), re-read and revalidate the socket handle (check
socket[FIELD__m_Handle].NumericByRef().s4 against DISPOSED_HANDLE), and only if
the handle is valid continue to re-poll for writability instead of immediately
calling NANOCLR_SET_AND_LEAVE(S_OK) — ensure NANOCLR_SET_AND_LEAVE is only used
after proper cleanup and revalidation to avoid leaving the timeout on the stack
and misinterpreting Event_Socket wakes.

---

Duplicate comments:
In
`@src/DeviceInterfaces/System.Net/sys_net_native_System_Net_Sockets_NativeSocket.cpp`:
- Around line 471-480: The native method currently advances stack.m_customState
after calling stack.SetupTimeoutFromTicks, but SOCK_connect is still executed on
every re-entry; wrap the one-time setup and the call to SOCK_connect in a guard
based on stack.m_customState so they only run on the first entry. Concretely,
check if (stack.m_customState == 0) before performing the pending-connect path
(including the SOCK_connect call), set stack.m_customState to 1 (or 2 per
existing flow) after the first-call setup so subsequent re-entrances (from
WaitEvents) skip re-invoking SOCK_connect and only execute the
timeout/eval-stack logic (e.g., leave stack.SetupTimeoutFromTicks and the
re-entry handling untouched).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 6db51845-b936-4e33-9b90-e6efb14e87cf

📥 Commits

Reviewing files that changed from the base of the PR and between 01935e0 and d6d4041.

📒 Files selected for processing (1)
  • src/DeviceInterfaces/System.Net/sys_net_native_System_Net_Sockets_NativeSocket.cpp

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
src/DeviceInterfaces/System.Net/sys_net_native_System_Net_Sockets_NativeSocket.cpp (1)

471-534: ⚠️ Potential issue | 🔴 Critical

m_customState is updated too late to prevent re-entered SOCK_connect() calls.

WaitEvents() re-enters this method from the top, so Line 458 still runs before this guard. That means a wake can invoke SOCK_connect() again on an in-progress connect, and a disposed socket can be passed into SOCK_connect() once before Lines 527-534 detect it. Move the DISPOSED_HANDLE check and the first-call-only branch ahead of the initial SOCK_connect() call so re-entry goes straight back into the select/timeout path.

Based on learnings: In nf-interpreter C++ native code, g_CLR_RT_ExecutionEngine.WaitEvents() is a cooperative rescheduling mechanism (not a blocking wait). It suspends the CLR thread and re-enters from the top on the next scheduler cycle. If you use a while (fRes) loop with WaitEvents, ensure you guard the first-call-only setup (e.g., SOCK_connect, PushValueI4) with stack.m_customState so those setups don’t run again on subsequent re-entries.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@src/DeviceInterfaces/System.Net/sys_net_native_System_Net_Sockets_NativeSocket.cpp`
around lines 471 - 534, The first-call setup (SOCK_connect and bumping
stack.m_customState) and the DISPOSED_HANDLE check must run before any work that
can be re-entered by g_CLR_RT_ExecutionEngine.WaitEvents(); move the
DISPOSED_HANDLE validation and the “first-call-only” branch that sets
stack.m_customState from 1→2 ahead of the initial SOCK_connect() call (and
ensure stack.SetupTimeoutFromTicks(timeout) still runs only once as currently
intended), so re-entries skip SOCK_connect and immediately enter the
select/timeout loop; update code around SOCK_connect, stack.m_customState,
DISPOSED_HANDLE, WaitEvents, and Helper__SelectSocket to reflect this ordering.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In
`@src/DeviceInterfaces/System.Net/sys_net_native_System_Net_Sockets_NativeSocket.cpp`:
- Around line 471-534: The first-call setup (SOCK_connect and bumping
stack.m_customState) and the DISPOSED_HANDLE check must run before any work that
can be re-entered by g_CLR_RT_ExecutionEngine.WaitEvents(); move the
DISPOSED_HANDLE validation and the “first-call-only” branch that sets
stack.m_customState from 1→2 ahead of the initial SOCK_connect() call (and
ensure stack.SetupTimeoutFromTicks(timeout) still runs only once as currently
intended), so re-entries skip SOCK_connect and immediately enter the
select/timeout loop; update code around SOCK_connect, stack.m_customState,
DISPOSED_HANDLE, WaitEvents, and Helper__SelectSocket to reflect this ordering.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: e04a664e-47aa-40e4-9ed5-7d1dd0d40ba2

📥 Commits

Reviewing files that changed from the base of the PR and between d6d4041 and f86de8d.

📒 Files selected for processing (1)
  • src/DeviceInterfaces/System.Net/sys_net_native_System_Net_Sockets_NativeSocket.cpp

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
src/DeviceInterfaces/System.Net/sys_net_native_System_Net_Sockets_NativeSocket.cpp (1)

471-480: ⚠️ Potential issue | 🟠 Major

m_customState is bumped too late to protect the non-blocking connect path.

Line 458 still calls SOCK_connect() on every WaitEvents() re-entry before this guard runs, so the resumed path can hit a second connect() instead of going straight back to the select/poll loop. That can turn an in-progress reconnect into a spurious socket error rather than a retry.

Based on learnings: g_CLR_RT_ExecutionEngine.WaitEvents() is a cooperative rescheduling mechanism, so first-call-only setup like SOCK_connect() must be guarded with stack.m_customState before re-entry can happen.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@src/DeviceInterfaces/System.Net/sys_net_native_System_Net_Sockets_NativeSocket.cpp`
around lines 471 - 480, The non-blocking connect path must mark first-call state
before any cooperative reschedule to prevent a second SOCK_connect() on
re-entry: change the logic so that the first-call-only guard sets
stack.m_customState to the "in-progress" value immediately after performing the
initial SOCK_connect() (or before the call that can trigger WaitEvents()) rather
than after SetupTimeoutFromTicks; i.e., ensure the connect() call is wrapped
like if (stack.m_customState == 0) { /* perform SOCK_connect() */
stack.m_customState = 1; } and keep the existing transition from 1->2 after
SetupTimeoutFromTicks so resumed execution skips calling SOCK_connect() again.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In
`@src/DeviceInterfaces/System.Net/sys_net_native_System_Net_Sockets_NativeSocket.cpp`:
- Around line 471-480: The non-blocking connect path must mark first-call state
before any cooperative reschedule to prevent a second SOCK_connect() on
re-entry: change the logic so that the first-call-only guard sets
stack.m_customState to the "in-progress" value immediately after performing the
initial SOCK_connect() (or before the call that can trigger WaitEvents()) rather
than after SetupTimeoutFromTicks; i.e., ensure the connect() call is wrapped
like if (stack.m_customState == 0) { /* perform SOCK_connect() */
stack.m_customState = 1; } and keep the existing transition from 1->2 after
SetupTimeoutFromTicks so resumed execution skips calling SOCK_connect() again.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: d8250de5-15ca-4262-8e78-823c3d76ccea

📥 Commits

Reviewing files that changed from the base of the PR and between f86de8d and b0fcf31.

📒 Files selected for processing (1)
  • src/DeviceInterfaces/System.Net/sys_net_native_System_Net_Sockets_NativeSocket.cpp

@alberk8
Copy link
Copy Markdown
Contributor

alberk8 commented Mar 22, 2026

@Ellerbach, Can the PR be merged or is there any other thing that need fixing or testing?

@Ellerbach
Copy link
Copy Markdown
Member Author

@Ellerbach, Can the PR be merged or is there any other thing that need fixing or testing?

I let @josesimoes decide on this one. I know it helps to fix the issue. But is there a way to get out of it just from the managed side?

@josesimoes
Copy link
Copy Markdown
Member

I'm still concerned about assertig the global flag and the hijacking the wait pattern like this. Want to test this myself. Unfortunately I won't be able to do it during this week. Please be patient.

@networkfusion
Copy link
Copy Markdown
Member

@alberk8 are you able to verify the last commit (including IDF version) that was able to handle this previously. I would like to read up on it when I get the chance, and it would also be useful for the PR description.

@alberk8
Copy link
Copy Markdown
Contributor

alberk8 commented Mar 25, 2026

@alberk8 are you able to verify the last commit (including IDF version) that was able to handle this previously. I would like to read up on it when I get the chance, and it would also be useful for the PR description.

Yes, it did work even with the global flag setting removed (from this PR, the IDF version is V5.5.3 which is the same as the current one).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug when reconnecting to MQTT

5 participants