Add RoCEv2 protocol module with Soft-RoCE-backed CI coverage#1186
Add RoCEv2 protocol module with Soft-RoCE-backed CI coverage#1186ruck314 wants to merge 6 commits intopre-releasefrom
Conversation
…erage Adds RoCEv2 as a first-class Rogue protocol and lifts CI line coverage on both the C++ core (src/rogue/protocols/rocev2/) and the Python control layer (python/pyrogue/protocols/_RoCEv2.py) to >=80% by exercising real libibverbs against a Linux Soft-RoCE (rdma_rxe) device — not mocks. RoCEv2 module - C++ Core / Server in include/rogue/protocols/rocev2/ and src/rogue/protocols/rocev2/, wired into CMake, conda, and the Python module graph - Python RoCEv2Server device (python/pyrogue/protocols/_RoCEv2.py) with 15 LocalVariables, getChannel(), stream property, and teardownFpgaQp() - Replaced opt-in -DROCEV2=ON with ROGUE_LINUX_BUILD; RoCEv2 ships unconditionally on Linux, macOS unaffected CI coverage uplift - full_build_test pinned to ubuntu-22.04 brings up rdma_rxe / rxe0 so hardware-gated tests run against real ibverbs - ROGUE_COVERAGE CMake option applies --coverage PRIVATE to rogue-core and rogue-core-static, producing gcov data for the rocev2 subtree - gcovr Cobertura report filtered to the rocev2 subtree, uploaded via codecov-action@v5 alongside the Python pytest-cov report Tests - tests/protocols/test_rocev2.py: 25 hardware-free + 7 hardware-gated tests covering encoders/decoders, metadata-bus round-trips, channel wiring, teardown, and live-rxe0 setup/teardown flows - tests/protocols/test_rocev2_corner_cases.py: 11 MagicMock corner-case tests (decoder failure paths, partial-setup rollback, illegal QP transitions, length-bound edges) - tests/protocols/conftest.py: shared metadata-bus response-word builders (_pack_*_ok), session-scoped rxe_server fixture, wait_for / MemoryRoot re-exports - tests/cpp/protocols/rocev2/: 4 doctest cases for Core/Server with RAII wrappers, including 2 live-rxe0 cases and 2 GeneralError / out-of-range cases - 4 tests currently @pytest.mark.skip(...) pending follow-up (3 corner cases + 1 live-server init flow) Stability fixes (surfaced while stabilising CI, unrelated to RoCEv2) - UDP packetizer integration test: enlarged SO_RCVBUF and widened the stall window for macOS arm64 - Stream-variable test: settled the enable-frame burst before sampling initial_count Sphinx docs - 8 new .rst files integrated into the existing protocol toctrees alongside RSSI/SRP/Packetizer - make html builds clean with zero RoCEv2-related warnings - Jumbo-frame MTU, SoftRoCE rdma_rxe workflow, and surf RoceEngine dependency documented No existing protocol APIs, stream/memory interfaces, or PyRogue device tree are changed — RoCEv2 itself is purely additive; the only behavioural change to existing code is enabling --coverage when ROGUE_COVERAGE=ON. C++ and Python source ported from FilMarini/rogue:rocev2. Attribution comments preserved in each ported file pointing back to FilMarini/rogue:rocev2. Slaclab standards take precedence over API compatibility; deltas documented in module docstrings.
Move Soft-RoCE bring-up, shared apt install, sourced test env preamble, and whitespace/tab lint into scripts/. The workflow steps collapse to one-line invocations and the WHY comments (azure kernel caveat, memlock rationale, -u omission) live with the code they explain. Also bumps the perf_test actions/checkout and actions/setup-python pins to v6 to match the rest of the workflow.
There was a problem hiding this comment.
Pull request overview
Adds a first-class RoCEv2 (RDMA over Converged Ethernet v2) protocol module to Rogue, plus CI/test/coverage infrastructure to exercise real libibverbs (Soft-RoCE where available) and publish dual-language coverage.
Changes:
- Introduces new C++ RoCEv2
Core/Serverimplementation + Boost.Python module registration. - Adds a PyRogue
RoCEv2Serverwrapper and extensive Python/C++ test coverage (hardware-free + hardware-gated). - Updates CI to install RDMA deps, attempt Soft-RoCE bring-up, enable gcov instrumentation, and upload coverage reports.
Reviewed changes
Copilot reviewed 41 out of 42 changed files in this pull request and generated 15 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/protocols/test_rocev2_corner_cases.py | New MagicMock-only corner-case tests for _RoCEv2 helpers and error paths |
| tests/protocols/test_rocev2.py | New unit + hardware-gated RoCEv2 tests (xdist serialized group) |
| tests/protocols/conftest.py | Shared RoCEv2 fixtures/helpers; session-scoped rxe_server; re-exports root conftest helpers |
| tests/interfaces/test_stream_variable.py | Stabilizes stream-variable test against startup frame races |
| tests/integration/test_udp_packetizer_integration.py | Stabilizes UDP packetizer integration test with larger stall window + RX buffers |
| tests/cpp/protocols/rocev2/test_rocev2.cpp | New doctest coverage for C++ RoCEv2 Core/Server paths |
| tests/cpp/protocols/rocev2/CMakeLists.txt | Adds RoCEv2 cpp test target linking against ibverbs |
| tests/cpp/protocols/CMakeLists.txt | Conditionally includes RoCEv2 cpp tests (non-Apple, Python-enabled) |
| src/rogue/protocols/rocev2/module.cpp | Boost.Python submodule registration for rogue.protocols.rocev2 |
| src/rogue/protocols/rocev2/Server.cpp | C++ RC receive server implementation backed by ibverbs + CQ polling thread |
| src/rogue/protocols/rocev2/Core.cpp | C++ ibverbs device-open + PD allocation base class |
| src/rogue/protocols/rocev2/CMakeLists.txt | Wires RoCEv2 sources into rogue-core build |
| src/rogue/protocols/module.cpp | Registers rocev2 module under ROGUE_LINUX_BUILD |
| src/rogue/protocols/CMakeLists.txt | Adds rocev2 subdirectory on non-Apple builds |
| scripts/ci_test_env.sh | Common CI test env setup (memlock best-effort + setup_rogue) |
| scripts/ci_install_deps.sh | CI apt dependencies incl. rdma-core/libibverbs-dev |
| scripts/ci_bring_up_soft_roce.sh | CI Soft-RoCE (rxe0) bring-up helper script |
| scripts/check_whitespace.sh | Centralized whitespace/tab enforcement used by CI |
| python/pyrogue/protocols/init.py | Exposes RoCEv2Server via pyrogue.protocols package |
| python/pyrogue/protocols/_RoCEv2.py | New PyRogue wrapper + metadata-bus encode/decode + setup/teardown orchestration |
| pytest.ini | Adds rocev2 pytest marker for hardware-gated tests |
| pip_requirements_ci.txt | Adds gcovr for C++ coverage reporting |
| include/rogue/protocols/rocev2/module.h | Header for rocev2 Boost.Python module setup |
| include/rogue/protocols/rocev2/Server.h | C++ API header for RoCEv2 Server |
| include/rogue/protocols/rocev2/Core.h | C++ API header for RoCEv2 Core + defaults |
| docs/src/conf.py | Mocks rogue.protocols.rocev2 for docs build environments lacking the extension |
| docs/src/built_in_modules/protocols/rocev2/index.rst | New conceptual/usage docs for RoCEv2 protocol module |
| docs/src/built_in_modules/protocols/index.rst | Adds RoCEv2 to protocol docs toctree |
| docs/src/api/python/rogue/protocols/rocev2/server.rst | New Python API page for Boost.Python Server |
| docs/src/api/python/rogue/protocols/rocev2/index.rst | New index for rogue.protocols.rocev2 Python API |
| docs/src/api/python/rogue/protocols/rocev2/core.rst | New Python API page for Boost.Python Core |
| docs/src/api/python/rogue/protocols/index.rst | Adds RoCEv2 section to Python protocols API index |
| docs/src/api/python/pyrogue/protocols/rocev2.rst | New autodoc page for PyRogue RoCEv2Server wrapper |
| docs/src/api/python/pyrogue/protocols/index.rst | Adds RoCEv2 to PyRogue protocol wrappers index |
| docs/src/api/cpp/protocols_index.rst | Adds RoCEv2 to C++ protocols index |
| docs/src/api/cpp/protocols/rocev2/server.rst | New C++ API docs for rogue::protocols::rocev2::Server |
| docs/src/api/cpp/protocols/rocev2/index.rst | New C++ API index for RoCEv2 |
| docs/src/api/cpp/protocols/rocev2/core.rst | New C++ API docs for rogue::protocols::rocev2::Core |
| conda-recipe/meta.yaml | Adds rdma-core runtime dependency to conda recipe |
| CMakeLists.txt | Adds unconditional libibverbs discovery/linking on non-Apple + coverage option |
| .gitignore | Ignores docs build outputs |
| .github/workflows/rogue_ci.yml | Updates CI to use scripts, attempt Soft-RoCE bring-up, enable/upload coverage |
| docs/src/api/python/rogue/protocols/rocev2/server.rst | Adds Boost.Python API rendering for RoCEv2 server |
| docs/src/api/python/rogue/protocols/rocev2/core.rst | Adds Boost.Python API rendering for RoCEv2 core |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…ling Addresses Copilot review on #1186: - Core.h: remove cq_/qp_/mr_ members. They were never assigned in Core (Server has its own shadowing fields) and only confused the ownership model for maintainers. - Core.cpp: free devList when ibv_get_device_list returns 0 devices, and close ctx_ explicitly when ibv_alloc_pd fails — neither path runs the destructor on a partially-constructed object, so without these the list and context would leak. - Server.cpp ctor: wrap the body in try/catch and route every failure path through cleanupResources() so a throw mid-construction does not leak slab / MR / CQ / QP. - Server.cpp: switch slab allocation from aligned_alloc to posix_memalign. The default 9000 * 256 slab is not a multiple of 4096, which is undefined per C11 even though glibc tolerates it in practice. - Server.cpp runThread: catch ibverbs failures inside the poll loop, log them, and exit the thread cleanly — an uncaught throw escaping the thread entry point would call std::terminate. - Server.cpp stop(): always join + delete thread_ when it is non-null and joinable. runThread can clear threadEn_ itself on IBV_WC_WR_FLUSH_ERR or a caught exception, in which case the previous guard skipped the join and ~std::thread would terminate the process. - Header / cpp docstrings updated to reflect actual ownership (Core owns ctx + PD; Server owns CQ / QP / MR).
The IbvCtx / IbvPd / IbvMr / IbvCq / IbvQp wrappers were defined but never referenced by any TEST_CASE — Core / Server provide their own RAII via their destructors and that is what the live-rxe0 cases rely on. Remove the dead code (and the now-unused <cstddef> / <cstdint> / <stdexcept> includes) and update the file-level docstring + the inline comment in the live-rxe0 Core test to match the actual ownership model. Addresses Copilot review on #1186.
- Update the RoCEv2Server ImportError message: this PR removed the -DROCEV2=ON CMake opt-in and now builds RoCEv2 unconditionally on Linux, so the rebuild guidance is only about libibverbs/rdma-core being available at build and runtime. - Validate pmtu in RoCEv2Server.__init__ (1..5). Without the check an out-of-range value crashed with IndexError deep inside _start() / _roce_setup_connection() when the path-MTU log line indexed [256,512,1024,2048,4096][pmtu-1]; we now raise a clear rogue.GeneralError at construction time. - test_rocev2.py: update the rocev2_hw skip reason to match the new build/runtime requirements and the ImportError-message assertion to cover the new wording (Linux build + libibverbs). - test_rocev2.py: add a parametrized regression test that pins _ROCEV2_AVAILABLE=True and exercises the new pmtu validation across -1, 0, 6, 7, 100. Addresses Copilot review on #1186.
- core.rst: Core owns the protection domain (and ibverbs context); the
CQ / QP / MR are owned by the derived Server. ~Core only deallocates
the PD and closes the context.
- server.rst: the receive thread is started by completeConnection()
after the FPGA QPN is known, NOT by the constructor. The constructor
description is updated to reflect that and to mention the
partial-init cleanup. The Threading and Lifecycle section also notes
the new in-thread try/catch around postRecvWr.
- index.rst: replace the speculative pyrogue.RoCEv2.* logger names with
the names actually created by the implementation:
* rocev2.Core / rocev2.Server (C++ side, via rogue::Logging::create)
* pyrogue.Device.RoCEv2Server.<path> (Python wrapper, via
pyrogue.logInit + the pyrogue.<family>.<ClassName>.<path>
pattern)
The setLogLevel / setFilter examples are updated to target loggers
that users can actually enable.
Addresses Copilot review on #1186.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 41 out of 42 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| ##################################### | ||
| # RoCEv2 / libibverbs (Linux only, always enabled) | ||
| ##################################### | ||
| if (NOT APPLE) | ||
| find_library(IBVERBS_LIBRARY NAMES ibverbs) | ||
| find_path(IBVERBS_INCLUDE_DIR NAMES infiniband/verbs.h) | ||
|
|
||
| if ((NOT IBVERBS_LIBRARY) OR (NOT IBVERBS_INCLUDE_DIR)) | ||
| message(FATAL_ERROR | ||
| "RoCEv2 requires libibverbs which was not found.\n" | ||
| "Install rdma-core:\n" | ||
| " Ubuntu/Debian: sudo apt-get install libibverbs-dev rdma-core\n" | ||
| " Fedora/RHEL: sudo dnf install rdma-core-devel\n" | ||
| " conda-forge: conda install -c conda-forge rdma-core") | ||
| endif() | ||
|
|
||
| add_definitions(-DROGUE_LINUX_BUILD) | ||
| include_directories(${IBVERBS_INCLUDE_DIR}) | ||
| endif() |
There was a problem hiding this comment.
The RoCEv2/libibverbs discovery is guarded with if (NOT APPLE), but the feature is described as Linux-only (ROGUE_LINUX_BUILD macro name, user-facing error messages, and docs). On non-Apple, non-Linux platforms this will hard-fail the configure step looking for libibverbs. Consider tightening the guard to Linux specifically (e.g., CMAKE_SYSTEM_NAME STREQUAL "Linux" or UNIX AND NOT APPLE) and only defining ROGUE_LINUX_BUILD under that condition.
| # See scripts/ci_bring_up_soft_roce.sh for the azure-kernel caveat that | ||
| # forces `continue-on-error: true` on GHA-hosted runners. | ||
| - name: Bring up Soft-RoCE (rxe0) | ||
| continue-on-error: true | ||
| run: ./scripts/ci_bring_up_soft_roce.sh | ||
|
|
There was a problem hiding this comment.
The PR description says full_build_test is pinned to ubuntu-22.04, but this workflow job appears to run on ubuntu-24.04 (see the job header). Please align the description with the workflow (or adjust the workflow pin) so it’s clear which runner/kernel the Soft-RoCE bring-up logic is targeting.
| def _ibverbs_available() -> bool: | ||
| """True iff rogue.protocols.rocev2 imports AND an rxe0 device can be opened. | ||
| Probes by construction because there is no list_devices() on the | ||
| Boost.Python surface.""" | ||
| # Construction is the only reliable probe — the module is built | ||
| # unconditionally on Linux with -DROCEV2=ON, but rxe0 only exists on | ||
| # machines where the rdma_rxe kernel module is loaded. | ||
| try: | ||
| import rogue.protocols.rocev2 as _r |
There was a problem hiding this comment.
_ibverbs_available() is defined but never used in this conftest. Since the rxe_server fixture already does the real availability check (and skips with a clear reason), consider removing this unused helper to avoid suggesting there is a second availability mechanism.
| self._fpga_qpn = 0 | ||
| self._fpga_pd_handler = 0 | ||
| self._fpga_lkey = 0 | ||
| self._fpga_rkey = 0 |
There was a problem hiding this comment.
teardownFpgaQp() clears the internal _fpga_* shadow fields but does not update the exported LocalVariables (FpgaQpn, FpgaLkey) that were set during _start(). This can leave stale values visible in the device tree/UI after teardown. Consider setting FpgaQpn/FpgaLkey back to 0 (and potentially ConnectionState to a teardown state) when teardown completes successfully.
| self._fpga_rkey = 0 | |
| self._fpga_rkey = 0 | |
| self.FpgaQpn.set(0) | |
| self.FpgaLkey.set(0) |
| if grep -rnI '[[:blank:]]$' --include=\*.{cpp,h,py} --exclude=doctest.h .; then | ||
| echo "Error: Trailing whitespace found in the repository!" | ||
| exit 1 | ||
| fi | ||
|
|
||
| if grep -rnI $'\t' --include=\*.{cpp,h,py} --exclude=doctest.h .; then |
There was a problem hiding this comment.
This script claims it checks only “tracked” files, but the grep runs over . and will also scan untracked/generated directories (e.g., build/, docs/build/) if they exist in the workspace. That can cause spurious CI/local failures. Consider driving the scan from git ls-files (or excluding common build output dirs) so only version-controlled *.cpp/*.h/*.py sources are checked.
| if grep -rnI '[[:blank:]]$' --include=\*.{cpp,h,py} --exclude=doctest.h .; then | |
| echo "Error: Trailing whitespace found in the repository!" | |
| exit 1 | |
| fi | |
| if grep -rnI $'\t' --include=\*.{cpp,h,py} --exclude=doctest.h .; then | |
| if git ls-files -z -- '*.cpp' '*.h' '*.py' ':(exclude)doctest.h' | xargs -0 -r grep -nI '[[:blank:]]$'; then | |
| echo "Error: Trailing whitespace found in the repository!" | |
| exit 1 | |
| fi | |
| if git ls-files -z -- '*.cpp' '*.h' '*.py' ':(exclude)doctest.h' | xargs -0 -r grep -nI $'\t'; then |
Summary
Adds RoCEv2 (RDMA over Converged Ethernet v2) as a first-class protocol module in Rogue and lifts CI line coverage on both the C++ core and the Python control layer to >=80% by exercising real
libibverbsagainst a Linux Soft-RoCE (rdma_rxe) device — not mocks.RoCEv2 module
CoreandServerclasses ininclude/rogue/protocols/rocev2/andsrc/rogue/protocols/rocev2/, wired into CMake, conda, and the Python module graphpython/pyrogue/protocols/_RoCEv2.py(RoCEv2Serverdevice with 15LocalVariables,getChannel(),streamproperty,teardownFpgaQp()) conforming toflake8and slaclab naming conventions-DROCEV2=ONwithROGUE_LINUX_BUILD; RoCEv2 now ships unconditionally on Linux (macOS unaffected)Coverage uplift
full_build_test(pinned toubuntu-22.04) brings uprdma_rxe/rxe0so hardware-gated tests run against real ibverbs instead of skippingROGUE_COVERAGECMake option — applies--coveragePRIVATE torogue-core/rogue-core-static, producinggcovdata for the rocev2 subtreecodecov-action@v5alongside the Pythonpytest-covreportTests
tests/protocols/):test_rocev2.py— 25 hardware-free + 7 hardware-gated tests (encoders/decoders, metadata-bus round-trips, channel wiring, teardown, live-rxe0 setup/teardown flows)test_rocev2_corner_cases.py— 11 MagicMock corner-case tests (decoder failure paths, partial-setup rollback, illegal QP transitions, length-bound edges)conftest.py— shared metadata-bus response-word builders (_pack_*_ok), session-scopedrxe_serverfixture,wait_for/MemoryRootre-exports@pytest.mark.skip(...)pending follow-up (3 corner cases + 1 live-server init flow)tests/cpp/protocols/rocev2/):Core/Serverwith RAII wrappers, including 2 live-rxe0 cases and 2GeneralError/ out-of-range casesSO_RCVBUFand widened the stall window for macOS arm64initial_countSphinx docs
.rstfiles integrated into the existing protocol toctrees alongside RSSI/SRP/Packetizermake htmlbuilds clean with zero RoCEv2-related warningsrdma_rxeworkflow, and surfRoceEnginedependency documentedNo existing protocol APIs, stream/memory interfaces, or PyRogue device tree are changed — RoCEv2 itself is purely additive; the only behavioural change to existing code is enabling
--coveragewhenROGUE_COVERAGE=ON.Attribution
C++ and Python source ported from FilMarini/rogue:rocev2. Attribution comments preserved in each ported file pointing back to FilMarini/rogue:rocev2. Slaclab standards take precedence over API compatibility; deltas documented in module docstrings.
Test plan
full_build_test(Linux,ROGUE_LINUX_BUILDon,ROGUE_COVERAGEon, Soft-RoCE up) and small-build / macOS jobspytest tests/protocols/test_rocev2.py tests/protocols/test_rocev2_corner_cases.py -m 'not rocev2'— hardware-free tests passpytest tests/protocols/test_rocev2.py -m rocev2— hardware-gated tests pass againstrxe0in CI; skip cleanly on dev boxes withoutrdma_rxe_RoCEv2.pyand C++src/rogue/protocols/rocev2/both >=80% line coveragecd docs && make html— builds with zero RoCEv2-related warnings; all 8 RoCEv2 HTML pages render