Conversation
WalkthroughAdds timeout-driven waits and event-yielding to non-blocking connect/send/recv in the native socket layer: introduces SetupTimeoutFromTicks, select-based polling for writability/readability, WaitEvents yields on EWOULDBLOCK, SO_ERROR checks on failures, disposal checks, and ETIMEDOUT error paths. Changes
Sequence Diagram(s)sequenceDiagram
participant Caller
participant NativeSocket
participant SelectHelper
participant WaitEvents
Caller->>NativeSocket: initiate non-blocking Connect/Send/Recv
NativeSocket->>NativeSocket: SetupTimeoutFromTicks(hbTimeout)
loop poll until timeout/disposed/ready
NativeSocket->>SelectHelper: poll socket writability/readability
SelectHelper-->>NativeSocket: ready / not ready / error
alt not ready
NativeSocket->>WaitEvents: wait with timeout
WaitEvents-->>NativeSocket: resumed (ready / timeout / disposed)
alt timeout
NativeSocket-->>Caller: return SOCK_ETIMEDOUT / error
else disposed
NativeSocket-->>Caller: return DISPOSED_HANDLE error
else ready
NativeSocket->>NativeSocket: gets SO_ERROR
NativeSocket-->>Caller: success (S_OK) or error
end
else error
NativeSocket->>NativeSocket: retrieve SO_ERROR
NativeSocket-->>Caller: error
else ready
NativeSocket->>NativeSocket: continue flow
NativeSocket-->>Caller: success (S_OK)
end
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@src/DeviceInterfaces/System.Net/sys_net_native_System_Net_Sockets_NativeSocket.cpp`:
- Around line 497-500: The code directly clears the global bit
g_CLR_RT_ExecutionEngine.m_raisedEvents &= ~Event_Socket (around usages with
Event_Socket, WaitEvents, stack.m_owningThread, timeout, fRes), which can
consume wakeups for unrelated sockets and cause missed-wakeup races; remove any
direct mutations of m_raisedEvents in the socket helper paths and instead rely
on the execution engine's atomic event handling: call
g_CLR_RT_ExecutionEngine.WaitEvents(...) to perform the wait and let the engine
clear or mask Event_Socket atomically, and if per-socket/state bookkeeping is
needed maintain a local or thread-scoped flag (or use an API on
g_CLR_RT_ExecutionEngine) to mark consumption without touching m_raisedEvents
directly in functions that reference Event_Socket, stack.m_owningThread,
timeout, and fRes.
- Around line 466-511: When Helper__SelectSocket(handle, 1) returns
SOCK_SOCKET_ERROR inside BindConnectHelper, replace the current
ThrowOnError(stack, SOCK_SOCKET_ERROR) usage with a direct query of the
socket-specific error via SOCK_getsockopt(handle, SOL_SOCKET, SO_ERROR, ...) and
pass that error value into ThrowOnError (or a suitable throw helper) so you use
the socket's SO_ERROR rather than a possibly-stale global last-error; also add
the same DISPOSED_HANDLE guard check (as used in accept()) immediately after the
g_CLR_RT_ExecutionEngine.WaitEvents reschedule to avoid operating on a
rescheduled/disposed socket before continuing.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: fda0a325-5b93-408a-9175-ddbd6d58e86b
📒 Files selected for processing (1)
src/DeviceInterfaces/System.Net/sys_net_native_System_Net_Sockets_NativeSocket.cpp
src/DeviceInterfaces/System.Net/sys_net_native_System_Net_Sockets_NativeSocket.cpp
Show resolved
Hide resolved
src/DeviceInterfaces/System.Net/sys_net_native_System_Net_Sockets_NativeSocket.cpp
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
src/DeviceInterfaces/System.Net/sys_net_native_System_Net_Sockets_NativeSocket.cpp (1)
471-480:⚠️ Potential issue | 🟠 Major
m_customStateis advanced, butSOCK_connect()still re-runs on re-entry.Because
WaitEvents()re-enters this native method from the top, Line 458 still executes on every wake. Updatingm_customStatehere only changes bookkeeping; it does not stop the pending-connect path from issuing a secondSOCK_connect(), which can then bypass the timeout flow created at Line 474.Based on learnings:
WaitEvents()is a cooperative rescheduling mechanism, so first-call-only setup must be guarded withstack.m_customState.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/DeviceInterfaces/System.Net/sys_net_native_System_Net_Sockets_NativeSocket.cpp` around lines 471 - 480, The native method currently advances stack.m_customState after calling stack.SetupTimeoutFromTicks, but SOCK_connect is still executed on every re-entry; wrap the one-time setup and the call to SOCK_connect in a guard based on stack.m_customState so they only run on the first entry. Concretely, check if (stack.m_customState == 0) before performing the pending-connect path (including the SOCK_connect call), set stack.m_customState to 1 (or 2 per existing flow) after the first-call setup so subsequent re-entrances (from WaitEvents) skip re-invoking SOCK_connect and only execute the timeout/eval-stack logic (e.g., leave stack.SetupTimeoutFromTicks and the re-entry handling untouched).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@src/DeviceInterfaces/System.Net/sys_net_native_System_Net_Sockets_NativeSocket.cpp`:
- Around line 498-524: The early NANOCLR_SET_AND_LEAVE(S_OK) in the
selectResult==0 path prematurely exits before popping the timeout value and
re-polling the socket, leaving the timeout slot on the eval stack and treating a
global Event_Socket wake as a successful connect; modify the selectResult==0
branch in sys_net_native_System_Net_Sockets_NativeSocket.cpp so that after
WaitEvents returns and fRes is true you first pop the timeout value (call
stack.PopValue()), re-read and revalidate the socket handle (check
socket[FIELD__m_Handle].NumericByRef().s4 against DISPOSED_HANDLE), and only if
the handle is valid continue to re-poll for writability instead of immediately
calling NANOCLR_SET_AND_LEAVE(S_OK) — ensure NANOCLR_SET_AND_LEAVE is only used
after proper cleanup and revalidation to avoid leaving the timeout on the stack
and misinterpreting Event_Socket wakes.
---
Duplicate comments:
In
`@src/DeviceInterfaces/System.Net/sys_net_native_System_Net_Sockets_NativeSocket.cpp`:
- Around line 471-480: The native method currently advances stack.m_customState
after calling stack.SetupTimeoutFromTicks, but SOCK_connect is still executed on
every re-entry; wrap the one-time setup and the call to SOCK_connect in a guard
based on stack.m_customState so they only run on the first entry. Concretely,
check if (stack.m_customState == 0) before performing the pending-connect path
(including the SOCK_connect call), set stack.m_customState to 1 (or 2 per
existing flow) after the first-call setup so subsequent re-entrances (from
WaitEvents) skip re-invoking SOCK_connect and only execute the
timeout/eval-stack logic (e.g., leave stack.SetupTimeoutFromTicks and the
re-entry handling untouched).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 6db51845-b936-4e33-9b90-e6efb14e87cf
📒 Files selected for processing (1)
src/DeviceInterfaces/System.Net/sys_net_native_System_Net_Sockets_NativeSocket.cpp
src/DeviceInterfaces/System.Net/sys_net_native_System_Net_Sockets_NativeSocket.cpp
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
♻️ Duplicate comments (1)
src/DeviceInterfaces/System.Net/sys_net_native_System_Net_Sockets_NativeSocket.cpp (1)
471-534:⚠️ Potential issue | 🔴 Critical
m_customStateis updated too late to prevent re-enteredSOCK_connect()calls.
WaitEvents()re-enters this method from the top, so Line 458 still runs before this guard. That means a wake can invokeSOCK_connect()again on an in-progress connect, and a disposed socket can be passed intoSOCK_connect()once before Lines 527-534 detect it. Move theDISPOSED_HANDLEcheck and the first-call-only branch ahead of the initialSOCK_connect()call so re-entry goes straight back into the select/timeout path.Based on learnings: In nf-interpreter C++ native code,
g_CLR_RT_ExecutionEngine.WaitEvents()is a cooperative rescheduling mechanism (not a blocking wait). It suspends the CLR thread and re-enters from the top on the next scheduler cycle. If you use a while (fRes) loop with WaitEvents, ensure you guard the first-call-only setup (e.g.,SOCK_connect,PushValueI4) withstack.m_customStateso those setups don’t run again on subsequent re-entries.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/DeviceInterfaces/System.Net/sys_net_native_System_Net_Sockets_NativeSocket.cpp` around lines 471 - 534, The first-call setup (SOCK_connect and bumping stack.m_customState) and the DISPOSED_HANDLE check must run before any work that can be re-entered by g_CLR_RT_ExecutionEngine.WaitEvents(); move the DISPOSED_HANDLE validation and the “first-call-only” branch that sets stack.m_customState from 1→2 ahead of the initial SOCK_connect() call (and ensure stack.SetupTimeoutFromTicks(timeout) still runs only once as currently intended), so re-entries skip SOCK_connect and immediately enter the select/timeout loop; update code around SOCK_connect, stack.m_customState, DISPOSED_HANDLE, WaitEvents, and Helper__SelectSocket to reflect this ordering.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In
`@src/DeviceInterfaces/System.Net/sys_net_native_System_Net_Sockets_NativeSocket.cpp`:
- Around line 471-534: The first-call setup (SOCK_connect and bumping
stack.m_customState) and the DISPOSED_HANDLE check must run before any work that
can be re-entered by g_CLR_RT_ExecutionEngine.WaitEvents(); move the
DISPOSED_HANDLE validation and the “first-call-only” branch that sets
stack.m_customState from 1→2 ahead of the initial SOCK_connect() call (and
ensure stack.SetupTimeoutFromTicks(timeout) still runs only once as currently
intended), so re-entries skip SOCK_connect and immediately enter the
select/timeout loop; update code around SOCK_connect, stack.m_customState,
DISPOSED_HANDLE, WaitEvents, and Helper__SelectSocket to reflect this ordering.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: e04a664e-47aa-40e4-9ed5-7d1dd0d40ba2
📒 Files selected for processing (1)
src/DeviceInterfaces/System.Net/sys_net_native_System_Net_Sockets_NativeSocket.cpp
There was a problem hiding this comment.
♻️ Duplicate comments (1)
src/DeviceInterfaces/System.Net/sys_net_native_System_Net_Sockets_NativeSocket.cpp (1)
471-480:⚠️ Potential issue | 🟠 Major
m_customStateis bumped too late to protect the non-blocking connect path.Line 458 still calls
SOCK_connect()on everyWaitEvents()re-entry before this guard runs, so the resumed path can hit a secondconnect()instead of going straight back to the select/poll loop. That can turn an in-progress reconnect into a spurious socket error rather than a retry.Based on learnings:
g_CLR_RT_ExecutionEngine.WaitEvents()is a cooperative rescheduling mechanism, so first-call-only setup likeSOCK_connect()must be guarded withstack.m_customStatebefore re-entry can happen.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/DeviceInterfaces/System.Net/sys_net_native_System_Net_Sockets_NativeSocket.cpp` around lines 471 - 480, The non-blocking connect path must mark first-call state before any cooperative reschedule to prevent a second SOCK_connect() on re-entry: change the logic so that the first-call-only guard sets stack.m_customState to the "in-progress" value immediately after performing the initial SOCK_connect() (or before the call that can trigger WaitEvents()) rather than after SetupTimeoutFromTicks; i.e., ensure the connect() call is wrapped like if (stack.m_customState == 0) { /* perform SOCK_connect() */ stack.m_customState = 1; } and keep the existing transition from 1->2 after SetupTimeoutFromTicks so resumed execution skips calling SOCK_connect() again.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In
`@src/DeviceInterfaces/System.Net/sys_net_native_System_Net_Sockets_NativeSocket.cpp`:
- Around line 471-480: The non-blocking connect path must mark first-call state
before any cooperative reschedule to prevent a second SOCK_connect() on
re-entry: change the logic so that the first-call-only guard sets
stack.m_customState to the "in-progress" value immediately after performing the
initial SOCK_connect() (or before the call that can trigger WaitEvents()) rather
than after SetupTimeoutFromTicks; i.e., ensure the connect() call is wrapped
like if (stack.m_customState == 0) { /* perform SOCK_connect() */
stack.m_customState = 1; } and keep the existing transition from 1->2 after
SetupTimeoutFromTicks so resumed execution skips calling SOCK_connect() again.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: d8250de5-15ca-4262-8e78-823c3d76ccea
📒 Files selected for processing (1)
src/DeviceInterfaces/System.Net/sys_net_native_System_Net_Sockets_NativeSocket.cpp
|
@Ellerbach, Can the PR be merged or is there any other thing that need fixing or testing? |
I let @josesimoes decide on this one. I know it helps to fix the issue. But is there a way to get out of it just from the managed side? |
|
I'm still concerned about assertig the global flag and the hijacking the wait pattern like this. Want to test this myself. Unfortunately I won't be able to do it during this week. Please be patient. |
|
@alberk8 are you able to verify the last commit (including IDF version) that was able to handle this previously. I would like to read up on it when I get the chance, and it would also be useful for the PR description. |
Yes, it did work even with the global flag setting removed (from this PR, the IDF version is V5.5.3 which is the same as the current one). |
Description
Attempt to fixing hanging reconnect
Motivation and Context
How Has This Been Tested?
Screenshots
Types of changes
Checklist
Summary by CodeRabbit