Summary
When running two stress test cases (native + JIT) back-to-back with 8 threads and 2-minute duration, the second test case intermittently gets a single hook.attach_link() failure with error code 23 (EBPF_INSUFFICIENT_BUFFER).
Reproduction
.\ebpf_stress_tests_um.exe -tt=8 -td=2
- Running either test case alone with 8 threads: passes
- Running both with 4 threads: passes
- Running both with 8 threads: 1 failure in the second test case (~1 in 10,000 iterations)
- The failure occurs regardless of which test case runs second
Root Cause (Preliminary)
The error propagates through the NMR provider attach path. Each stress thread creates a single_instance_hook_t (NMR provider) that lives for the thread's duration. With 8 threads rapidly registering/deregistering providers across ~5000 iterations each, some internal state from the first test case appears to leak into the second test case, causing a brief window where attach_link() returns EBPF_INSUFFICIENT_BUFFER.
The ~_single_instance_hook() destructor properly calls NmrDeregisterProvider and waits for completion, so the leak may be deeper in the execution context or NMR layer.
Impact
- This is a pre-existing issue that was previously hidden because
main() in ebpf_stress_tests.cpp discarded the session.run() return value (always exited 0)
- Only affects stress tests with high thread counts running multiple test cases sequentially
- Does not affect production code
Workaround
Running the native test case before JIT reduces (but does not eliminate) the failure rate. The test case order has been adjusted in PR #5072.
Related
Summary
When running two stress test cases (native + JIT) back-to-back with 8 threads and 2-minute duration, the second test case intermittently gets a single
hook.attach_link()failure with error code 23 (EBPF_INSUFFICIENT_BUFFER).Reproduction
Root Cause (Preliminary)
The error propagates through the NMR provider attach path. Each stress thread creates a
single_instance_hook_t(NMR provider) that lives for the thread's duration. With 8 threads rapidly registering/deregistering providers across ~5000 iterations each, some internal state from the first test case appears to leak into the second test case, causing a brief window whereattach_link()returnsEBPF_INSUFFICIENT_BUFFER.The
~_single_instance_hook()destructor properly callsNmrDeregisterProviderand waits for completion, so the leak may be deeper in the execution context or NMR layer.Impact
main()inebpf_stress_tests.cppdiscarded thesession.run()return value (always exited 0)Workaround
Running the native test case before JIT reduces (but does not eliminate) the failure rate. The test case order has been adjusted in PR #5072.
Related