Skip to content

Add RoCEv2 protocol module with Soft-RoCE-backed CI coverage#1186

Draft
ruck314 wants to merge 6 commits intopre-releasefrom
RoCEv2-dev
Draft

Add RoCEv2 protocol module with Soft-RoCE-backed CI coverage#1186
ruck314 wants to merge 6 commits intopre-releasefrom
RoCEv2-dev

Conversation

@ruck314
Copy link
Copy Markdown
Contributor

@ruck314 ruck314 commented Apr 17, 2026

Summary

Adds RoCEv2 (RDMA over Converged Ethernet v2) as a first-class protocol module in Rogue and lifts CI line coverage on both the C++ core and the Python control layer to >=80% by exercising real libibverbs against a Linux Soft-RoCE (rdma_rxe) device — not mocks.

RoCEv2 module

  • C++ coreCore and Server classes in include/rogue/protocols/rocev2/ and src/rogue/protocols/rocev2/, wired into CMake, conda, and the Python module graph
  • Python wrapperpython/pyrogue/protocols/_RoCEv2.py (RoCEv2Server device with 15 LocalVariables, getChannel(), stream property, teardownFpgaQp()) conforming to flake8 and slaclab naming conventions
  • Always-on Linux build — replaced opt-in -DROCEV2=ON with ROGUE_LINUX_BUILD; RoCEv2 now ships unconditionally on Linux (macOS unaffected)

Coverage uplift

  • Soft-RoCE in CIfull_build_test (pinned to ubuntu-22.04) brings up rdma_rxe / rxe0 so hardware-gated tests run against real ibverbs instead of skipping
  • ROGUE_COVERAGE CMake option — applies --coverage PRIVATE to rogue-core / rogue-core-static, producing gcov data for the rocev2 subtree
  • gcovr Cobertura report — C++ coverage filtered to the rocev2 subtree, uploaded via codecov-action@v5 alongside the Python pytest-cov report

Tests

  • Python (tests/protocols/):
    • test_rocev2.py — 25 hardware-free + 7 hardware-gated tests (encoders/decoders, metadata-bus round-trips, channel wiring, teardown, live-rxe0 setup/teardown flows)
    • test_rocev2_corner_cases.py — 11 MagicMock corner-case tests (decoder failure paths, partial-setup rollback, illegal QP transitions, length-bound edges)
    • conftest.py — shared metadata-bus response-word builders (_pack_*_ok), session-scoped rxe_server fixture, wait_for / MemoryRoot re-exports
    • 4 tests are currently @pytest.mark.skip(...) pending follow-up (3 corner cases + 1 live-server init flow)
  • C++ (tests/cpp/protocols/rocev2/):
    • 4 doctest cases for Core / Server with RAII wrappers, including 2 live-rxe0 cases and 2 GeneralError / out-of-range cases
  • Stability fixes surfaced while stabilising CI (unrelated to RoCEv2):
    • UDP packetizer integration test: enlarged SO_RCVBUF and widened the stall window for macOS arm64
    • Stream-variable test: settled the enable-frame burst before sampling initial_count

Sphinx docs

  • 8 new .rst files integrated into the existing protocol toctrees alongside RSSI/SRP/Packetizer
  • make html builds clean with zero RoCEv2-related warnings
  • Jumbo-frame MTU, SoftRoCE rdma_rxe workflow, and surf RoceEngine dependency documented

No existing protocol APIs, stream/memory interfaces, or PyRogue device tree are changed — RoCEv2 itself is purely additive; the only behavioural change to existing code is enabling --coverage when ROGUE_COVERAGE=ON.

Attribution

C++ and Python source ported from FilMarini/rogue:rocev2. Attribution comments preserved in each ported file pointing back to FilMarini/rogue:rocev2. Slaclab standards take precedence over API compatibility; deltas documented in module docstrings.

Test plan

  • CI passes: full_build_test (Linux, ROGUE_LINUX_BUILD on, ROGUE_COVERAGE on, Soft-RoCE up) and small-build / macOS jobs
  • pytest tests/protocols/test_rocev2.py tests/protocols/test_rocev2_corner_cases.py -m 'not rocev2' — hardware-free tests pass
  • pytest tests/protocols/test_rocev2.py -m rocev2 — hardware-gated tests pass against rxe0 in CI; skip cleanly on dev boxes without rdma_rxe
  • C++ doctests: live-rxe0 cases pass in CI; missing-device cases pass on dev boxes
  • Codecov reports: Python _RoCEv2.py and C++ src/rogue/protocols/rocev2/ both >=80% line coverage
  • cd docs && make html — builds with zero RoCEv2-related warnings; all 8 RoCEv2 HTML pages render
  • Confirm no regressions in existing protocol tests (UDP packetizer and stream-variable stability fixes verified)

@ruck314 ruck314 changed the title feat(rocev2): add RoCEv2 protocol module with CI tests and Sphinx docs feat(rocev2): add RoCEv2 protocol module with Soft-RoCE-backed CI coverage Apr 18, 2026
…erage

Adds RoCEv2 as a first-class Rogue protocol and lifts CI line coverage on
both the C++ core (src/rogue/protocols/rocev2/) and the Python control
layer (python/pyrogue/protocols/_RoCEv2.py) to >=80% by exercising real
libibverbs against a Linux Soft-RoCE (rdma_rxe) device — not mocks.

RoCEv2 module
- C++ Core / Server in include/rogue/protocols/rocev2/ and
  src/rogue/protocols/rocev2/, wired into CMake, conda, and the Python
  module graph
- Python RoCEv2Server device (python/pyrogue/protocols/_RoCEv2.py) with
  15 LocalVariables, getChannel(), stream property, and teardownFpgaQp()
- Replaced opt-in -DROCEV2=ON with ROGUE_LINUX_BUILD; RoCEv2 ships
  unconditionally on Linux, macOS unaffected

CI coverage uplift
- full_build_test pinned to ubuntu-22.04 brings up rdma_rxe / rxe0 so
  hardware-gated tests run against real ibverbs
- ROGUE_COVERAGE CMake option applies --coverage PRIVATE to rogue-core
  and rogue-core-static, producing gcov data for the rocev2 subtree
- gcovr Cobertura report filtered to the rocev2 subtree, uploaded via
  codecov-action@v5 alongside the Python pytest-cov report

Tests
- tests/protocols/test_rocev2.py: 25 hardware-free + 7 hardware-gated
  tests covering encoders/decoders, metadata-bus round-trips, channel
  wiring, teardown, and live-rxe0 setup/teardown flows
- tests/protocols/test_rocev2_corner_cases.py: 11 MagicMock corner-case
  tests (decoder failure paths, partial-setup rollback, illegal QP
  transitions, length-bound edges)
- tests/protocols/conftest.py: shared metadata-bus response-word builders
  (_pack_*_ok), session-scoped rxe_server fixture, wait_for / MemoryRoot
  re-exports
- tests/cpp/protocols/rocev2/: 4 doctest cases for Core/Server with RAII
  wrappers, including 2 live-rxe0 cases and 2 GeneralError /
  out-of-range cases
- 4 tests currently @pytest.mark.skip(...) pending follow-up (3 corner
  cases + 1 live-server init flow)

Stability fixes (surfaced while stabilising CI, unrelated to RoCEv2)
- UDP packetizer integration test: enlarged SO_RCVBUF and widened the
  stall window for macOS arm64
- Stream-variable test: settled the enable-frame burst before sampling
  initial_count

Sphinx docs
- 8 new .rst files integrated into the existing protocol toctrees
  alongside RSSI/SRP/Packetizer
- make html builds clean with zero RoCEv2-related warnings
- Jumbo-frame MTU, SoftRoCE rdma_rxe workflow, and surf RoceEngine
  dependency documented

No existing protocol APIs, stream/memory interfaces, or PyRogue device
tree are changed — RoCEv2 itself is purely additive; the only
behavioural change to existing code is enabling --coverage when
ROGUE_COVERAGE=ON.

C++ and Python source ported from FilMarini/rogue:rocev2. Attribution
comments preserved in each ported file pointing back to
FilMarini/rogue:rocev2. Slaclab standards take precedence over API
compatibility; deltas documented in module docstrings.
@ruck314 ruck314 changed the title feat(rocev2): add RoCEv2 protocol module with Soft-RoCE-backed CI coverage Add RoCEv2 protocol module with Soft-RoCE-backed CI coverage Apr 18, 2026
Move Soft-RoCE bring-up, shared apt install, sourced test env preamble,
and whitespace/tab lint into scripts/. The workflow steps collapse to
one-line invocations and the WHY comments (azure kernel caveat, memlock
rationale, -u omission) live with the code they explain. Also bumps the
perf_test actions/checkout and actions/setup-python pins to v6 to match
the rest of the workflow.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a first-class RoCEv2 (RDMA over Converged Ethernet v2) protocol module to Rogue, plus CI/test/coverage infrastructure to exercise real libibverbs (Soft-RoCE where available) and publish dual-language coverage.

Changes:

  • Introduces new C++ RoCEv2 Core/Server implementation + Boost.Python module registration.
  • Adds a PyRogue RoCEv2Server wrapper and extensive Python/C++ test coverage (hardware-free + hardware-gated).
  • Updates CI to install RDMA deps, attempt Soft-RoCE bring-up, enable gcov instrumentation, and upload coverage reports.

Reviewed changes

Copilot reviewed 41 out of 42 changed files in this pull request and generated 15 comments.

Show a summary per file
File Description
tests/protocols/test_rocev2_corner_cases.py New MagicMock-only corner-case tests for _RoCEv2 helpers and error paths
tests/protocols/test_rocev2.py New unit + hardware-gated RoCEv2 tests (xdist serialized group)
tests/protocols/conftest.py Shared RoCEv2 fixtures/helpers; session-scoped rxe_server; re-exports root conftest helpers
tests/interfaces/test_stream_variable.py Stabilizes stream-variable test against startup frame races
tests/integration/test_udp_packetizer_integration.py Stabilizes UDP packetizer integration test with larger stall window + RX buffers
tests/cpp/protocols/rocev2/test_rocev2.cpp New doctest coverage for C++ RoCEv2 Core/Server paths
tests/cpp/protocols/rocev2/CMakeLists.txt Adds RoCEv2 cpp test target linking against ibverbs
tests/cpp/protocols/CMakeLists.txt Conditionally includes RoCEv2 cpp tests (non-Apple, Python-enabled)
src/rogue/protocols/rocev2/module.cpp Boost.Python submodule registration for rogue.protocols.rocev2
src/rogue/protocols/rocev2/Server.cpp C++ RC receive server implementation backed by ibverbs + CQ polling thread
src/rogue/protocols/rocev2/Core.cpp C++ ibverbs device-open + PD allocation base class
src/rogue/protocols/rocev2/CMakeLists.txt Wires RoCEv2 sources into rogue-core build
src/rogue/protocols/module.cpp Registers rocev2 module under ROGUE_LINUX_BUILD
src/rogue/protocols/CMakeLists.txt Adds rocev2 subdirectory on non-Apple builds
scripts/ci_test_env.sh Common CI test env setup (memlock best-effort + setup_rogue)
scripts/ci_install_deps.sh CI apt dependencies incl. rdma-core/libibverbs-dev
scripts/ci_bring_up_soft_roce.sh CI Soft-RoCE (rxe0) bring-up helper script
scripts/check_whitespace.sh Centralized whitespace/tab enforcement used by CI
python/pyrogue/protocols/init.py Exposes RoCEv2Server via pyrogue.protocols package
python/pyrogue/protocols/_RoCEv2.py New PyRogue wrapper + metadata-bus encode/decode + setup/teardown orchestration
pytest.ini Adds rocev2 pytest marker for hardware-gated tests
pip_requirements_ci.txt Adds gcovr for C++ coverage reporting
include/rogue/protocols/rocev2/module.h Header for rocev2 Boost.Python module setup
include/rogue/protocols/rocev2/Server.h C++ API header for RoCEv2 Server
include/rogue/protocols/rocev2/Core.h C++ API header for RoCEv2 Core + defaults
docs/src/conf.py Mocks rogue.protocols.rocev2 for docs build environments lacking the extension
docs/src/built_in_modules/protocols/rocev2/index.rst New conceptual/usage docs for RoCEv2 protocol module
docs/src/built_in_modules/protocols/index.rst Adds RoCEv2 to protocol docs toctree
docs/src/api/python/rogue/protocols/rocev2/server.rst New Python API page for Boost.Python Server
docs/src/api/python/rogue/protocols/rocev2/index.rst New index for rogue.protocols.rocev2 Python API
docs/src/api/python/rogue/protocols/rocev2/core.rst New Python API page for Boost.Python Core
docs/src/api/python/rogue/protocols/index.rst Adds RoCEv2 section to Python protocols API index
docs/src/api/python/pyrogue/protocols/rocev2.rst New autodoc page for PyRogue RoCEv2Server wrapper
docs/src/api/python/pyrogue/protocols/index.rst Adds RoCEv2 to PyRogue protocol wrappers index
docs/src/api/cpp/protocols_index.rst Adds RoCEv2 to C++ protocols index
docs/src/api/cpp/protocols/rocev2/server.rst New C++ API docs for rogue::protocols::rocev2::Server
docs/src/api/cpp/protocols/rocev2/index.rst New C++ API index for RoCEv2
docs/src/api/cpp/protocols/rocev2/core.rst New C++ API docs for rogue::protocols::rocev2::Core
conda-recipe/meta.yaml Adds rdma-core runtime dependency to conda recipe
CMakeLists.txt Adds unconditional libibverbs discovery/linking on non-Apple + coverage option
.gitignore Ignores docs build outputs
.github/workflows/rogue_ci.yml Updates CI to use scripts, attempt Soft-RoCE bring-up, enable/upload coverage
docs/src/api/python/rogue/protocols/rocev2/server.rst Adds Boost.Python API rendering for RoCEv2 server
docs/src/api/python/rogue/protocols/rocev2/core.rst Adds Boost.Python API rendering for RoCEv2 core

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread include/rogue/protocols/rocev2/Core.h Outdated
Comment thread docs/src/api/cpp/protocols/rocev2/core.rst
Comment thread tests/cpp/protocols/rocev2/test_rocev2.cpp Outdated
Comment thread src/rogue/protocols/rocev2/Server.cpp Outdated
Comment thread src/rogue/protocols/rocev2/Server.cpp Outdated
Comment thread src/rogue/protocols/rocev2/Server.cpp Outdated
Comment thread python/pyrogue/protocols/_RoCEv2.py
Comment thread docs/src/built_in_modules/protocols/rocev2/index.rst Outdated
Comment thread src/rogue/protocols/rocev2/Server.cpp Outdated
Comment thread src/rogue/protocols/rocev2/Core.cpp
ruck314 added 4 commits April 17, 2026 22:40
…ling

Addresses Copilot review on #1186:

- Core.h: remove cq_/qp_/mr_ members.  They were never assigned in Core
  (Server has its own shadowing fields) and only confused the ownership
  model for maintainers.
- Core.cpp: free devList when ibv_get_device_list returns 0 devices, and
  close ctx_ explicitly when ibv_alloc_pd fails — neither path runs the
  destructor on a partially-constructed object, so without these the
  list and context would leak.
- Server.cpp ctor: wrap the body in try/catch and route every failure
  path through cleanupResources() so a throw mid-construction does not
  leak slab / MR / CQ / QP.
- Server.cpp: switch slab allocation from aligned_alloc to posix_memalign.
  The default 9000 * 256 slab is not a multiple of 4096, which is
  undefined per C11 even though glibc tolerates it in practice.
- Server.cpp runThread: catch ibverbs failures inside the poll loop,
  log them, and exit the thread cleanly — an uncaught throw escaping
  the thread entry point would call std::terminate.
- Server.cpp stop(): always join + delete thread_ when it is non-null
  and joinable.  runThread can clear threadEn_ itself on
  IBV_WC_WR_FLUSH_ERR or a caught exception, in which case the previous
  guard skipped the join and ~std::thread would terminate the process.
- Header / cpp docstrings updated to reflect actual ownership (Core owns
  ctx + PD; Server owns CQ / QP / MR).
The IbvCtx / IbvPd / IbvMr / IbvCq / IbvQp wrappers were defined but
never referenced by any TEST_CASE — Core / Server provide their own
RAII via their destructors and that is what the live-rxe0 cases rely
on.  Remove the dead code (and the now-unused <cstddef> / <cstdint> /
<stdexcept> includes) and update the file-level docstring + the inline
comment in the live-rxe0 Core test to match the actual ownership model.

Addresses Copilot review on #1186.
- Update the RoCEv2Server ImportError message: this PR removed the
  -DROCEV2=ON CMake opt-in and now builds RoCEv2 unconditionally on
  Linux, so the rebuild guidance is only about libibverbs/rdma-core
  being available at build and runtime.
- Validate pmtu in RoCEv2Server.__init__ (1..5).  Without the check an
  out-of-range value crashed with IndexError deep inside _start() /
  _roce_setup_connection() when the path-MTU log line indexed
  [256,512,1024,2048,4096][pmtu-1]; we now raise a clear
  rogue.GeneralError at construction time.
- test_rocev2.py: update the rocev2_hw skip reason to match the new
  build/runtime requirements and the ImportError-message assertion to
  cover the new wording (Linux build + libibverbs).
- test_rocev2.py: add a parametrized regression test that pins
  _ROCEV2_AVAILABLE=True and exercises the new pmtu validation across
  -1, 0, 6, 7, 100.

Addresses Copilot review on #1186.
- core.rst: Core owns the protection domain (and ibverbs context); the
  CQ / QP / MR are owned by the derived Server.  ~Core only deallocates
  the PD and closes the context.
- server.rst: the receive thread is started by completeConnection()
  after the FPGA QPN is known, NOT by the constructor.  The constructor
  description is updated to reflect that and to mention the
  partial-init cleanup.  The Threading and Lifecycle section also notes
  the new in-thread try/catch around postRecvWr.
- index.rst: replace the speculative pyrogue.RoCEv2.* logger names with
  the names actually created by the implementation:
    * rocev2.Core / rocev2.Server (C++ side, via rogue::Logging::create)
    * pyrogue.Device.RoCEv2Server.<path> (Python wrapper, via
      pyrogue.logInit + the pyrogue.<family>.<ClassName>.<path>
      pattern)
  The setLogLevel / setFilter examples are updated to target loggers
  that users can actually enable.

Addresses Copilot review on #1186.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 41 out of 42 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread CMakeLists.txt
Comment on lines +54 to +72
#####################################
# RoCEv2 / libibverbs (Linux only, always enabled)
#####################################
if (NOT APPLE)
find_library(IBVERBS_LIBRARY NAMES ibverbs)
find_path(IBVERBS_INCLUDE_DIR NAMES infiniband/verbs.h)

if ((NOT IBVERBS_LIBRARY) OR (NOT IBVERBS_INCLUDE_DIR))
message(FATAL_ERROR
"RoCEv2 requires libibverbs which was not found.\n"
"Install rdma-core:\n"
" Ubuntu/Debian: sudo apt-get install libibverbs-dev rdma-core\n"
" Fedora/RHEL: sudo dnf install rdma-core-devel\n"
" conda-forge: conda install -c conda-forge rdma-core")
endif()

add_definitions(-DROGUE_LINUX_BUILD)
include_directories(${IBVERBS_INCLUDE_DIR})
endif()
Copy link

Copilot AI Apr 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The RoCEv2/libibverbs discovery is guarded with if (NOT APPLE), but the feature is described as Linux-only (ROGUE_LINUX_BUILD macro name, user-facing error messages, and docs). On non-Apple, non-Linux platforms this will hard-fail the configure step looking for libibverbs. Consider tightening the guard to Linux specifically (e.g., CMAKE_SYSTEM_NAME STREQUAL "Linux" or UNIX AND NOT APPLE) and only defining ROGUE_LINUX_BUILD under that condition.

Copilot uses AI. Check for mistakes.
Comment on lines +55 to +60
# See scripts/ci_bring_up_soft_roce.sh for the azure-kernel caveat that
# forces `continue-on-error: true` on GHA-hosted runners.
- name: Bring up Soft-RoCE (rxe0)
continue-on-error: true
run: ./scripts/ci_bring_up_soft_roce.sh

Copy link

Copilot AI Apr 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description says full_build_test is pinned to ubuntu-22.04, but this workflow job appears to run on ubuntu-24.04 (see the job header). Please align the description with the workflow (or adjust the workflow pin) so it’s clear which runner/kernel the Soft-RoCE bring-up logic is targeting.

Copilot uses AI. Check for mistakes.
Comment on lines +63 to +71
def _ibverbs_available() -> bool:
"""True iff rogue.protocols.rocev2 imports AND an rxe0 device can be opened.
Probes by construction because there is no list_devices() on the
Boost.Python surface."""
# Construction is the only reliable probe — the module is built
# unconditionally on Linux with -DROCEV2=ON, but rxe0 only exists on
# machines where the rdma_rxe kernel module is loaded.
try:
import rogue.protocols.rocev2 as _r
Copy link

Copilot AI Apr 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_ibverbs_available() is defined but never used in this conftest. Since the rxe_server fixture already does the real availability check (and skips with a clear reason), consider removing this unused helper to avoid suggesting there is a second availability mechanism.

Copilot uses AI. Check for mistakes.
self._fpga_qpn = 0
self._fpga_pd_handler = 0
self._fpga_lkey = 0
self._fpga_rkey = 0
Copy link

Copilot AI Apr 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

teardownFpgaQp() clears the internal _fpga_* shadow fields but does not update the exported LocalVariables (FpgaQpn, FpgaLkey) that were set during _start(). This can leave stale values visible in the device tree/UI after teardown. Consider setting FpgaQpn/FpgaLkey back to 0 (and potentially ConnectionState to a teardown state) when teardown completes successfully.

Suggested change
self._fpga_rkey = 0
self._fpga_rkey = 0
self.FpgaQpn.set(0)
self.FpgaLkey.set(0)

Copilot uses AI. Check for mistakes.
Comment on lines +10 to +15
if grep -rnI '[[:blank:]]$' --include=\*.{cpp,h,py} --exclude=doctest.h .; then
echo "Error: Trailing whitespace found in the repository!"
exit 1
fi

if grep -rnI $'\t' --include=\*.{cpp,h,py} --exclude=doctest.h .; then
Copy link

Copilot AI Apr 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This script claims it checks only “tracked” files, but the grep runs over . and will also scan untracked/generated directories (e.g., build/, docs/build/) if they exist in the workspace. That can cause spurious CI/local failures. Consider driving the scan from git ls-files (or excluding common build output dirs) so only version-controlled *.cpp/*.h/*.py sources are checked.

Suggested change
if grep -rnI '[[:blank:]]$' --include=\*.{cpp,h,py} --exclude=doctest.h .; then
echo "Error: Trailing whitespace found in the repository!"
exit 1
fi
if grep -rnI $'\t' --include=\*.{cpp,h,py} --exclude=doctest.h .; then
if git ls-files -z -- '*.cpp' '*.h' '*.py' ':(exclude)doctest.h' | xargs -0 -r grep -nI '[[:blank:]]$'; then
echo "Error: Trailing whitespace found in the repository!"
exit 1
fi
if git ls-files -z -- '*.cpp' '*.h' '*.py' ':(exclude)doctest.h' | xargs -0 -r grep -nI $'\t'; then

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants