Add IB pipelined sendrecv kernel with TiledBuffer (#2058) by goelayu · Pull Request #2058 · meta-pytorch/torchcomms

goelayu · 2026-04-13T19:24:37Z

Summary:

Refactor IB sendrecv kernel code from torchcomms test directory into a proper reusable library at comms/pipes/collectives/ib/, mirroring the Triton IB sendrecv at comms/pipes/triton/collectives/ib/.

Changes

New kernel library (collectives/ib/SendRecv.cu/.cuh): Pipelined bidirectional sendrecv over IB/IBGDA using the torchcomms device API (NCCLx/GIN). Split into clean send_tile() and recv_tile() device functions. Uses TiledBuffer<float> from D100253564 for 16-byte-aligned tile partitioning, replacing manual offset math.
Benchmark (collectives/benchmarks/IbSendRecvBenchmark.cc): Full benchmark harness with put BW reference (performance ceiling) and sendrecv sweeps across sizes, tile sizes, block counts, and pipeline depths.
Removed old files from torchcomms/tests/perf/cpp/ (PutBwBenchmark*).
Dropped non-pipelined sendRecvBwKernel — pipelined version is the default.

Key Design

send_tile() / recv_tile() are separate __device__ functions for clean composition
Ring-buffer staging with configurable pipeline depth (PD)
Signal protocol: DATA_READY (piggybacked on put), SLOT_FREE (receiver → sender), NIC_DONE counter (staging reuse)
CUDA graph compatible: iteration count and signal_base as kernel args, monotonic counters
snake_case naming following Pipes conventions

Performance (H100, IB, p2p_disabled, 128 blocks)

Size	Tile	BW (GB/s)
32KB	256B	0.03
1MB	8KB	0.98
16MB	128KB	10.45
128MB	1MB	33.69
512MB	4MB	42.25
1GB	8MB	42.92
2GB	16MB	45.25
4GB	32MB	46.34

Raw put ceiling: ~47.8 GB/s. Best sendrecv: ~46.3 GB/s at 4GB (97% of ceiling).

Depends on D100253564 for TiledBuffer.

Differential Revision: D100357723

Summary: Fix benchmark crash by switching from `getDeviceTransports()` (device pointer) to `buildP2pTransportDevice()` (host-side copy). The kernel receives `P2pNvlTransportDevice` by value, so `cudaLaunchKernel` needs to read the struct from host memory, not device memory. Also remove unused `mpi_test_utils` dep from BUCK (MPI headers are no longer directly included after the BenchmarkTestFixture migration). Differential Revision: D100352817

Summary: ## Tile sendrecv protocol Add a bidirectional pipelined data transfer protocol for NVLink P2P: - Each GPU simultaneously sends AND receives by partitioning kernel blocks into sender/receiver roles (half send, half recv). - Each sender/receiver block pair transfers an independent tile of data through a pipelined staging buffer using monotonic head/tail SignalState counters. - Sender and receiver run on separate GPUs — signals exchanged via NVLink remote writes to peer signal buffers. - Backpressure via `head >= step - pipelineDepth + 1` (CMP_GE). - Memory ordering: `group.sync()` before `st.release.sys.global` signal; `ld.acquire.sys.global` wait forms release-acquire pair with sender. - Persistent step counters in device memory for multi-call correctness. - Supports clustered kernel launch for improved NVLink utilization. - Transport-internal tile signals and stepState (tileMaxBlocks config). - Dynamic block count via per-block barrier_sync_threadgroup. ## TiledBuffer<T> abstraction Add `comms::pipes::TiledBuffer<T>` — a typed view that partitions a buffer into aligned tiles across blocks. Zero-cost abstraction. ## Transport enhancements - `tileMaxBlocks` config: transport allocates tile signals + stepState - `p2pBarrierCount` config: allocates barrier buffer for cross-GPU sync - `send_tile`/`recv_tile` with internal state (no user stepState needed) - `__restrict__` hints on send_tile/recv_tile parameters - `options()`, `local_state()`, `remote_state()` device getters ## Benchmark results (H100 NVLink, ncclx 2.29, bidirectional, clustered) Size NCCL Tile Speedup 128K 19 28 1.48x 1M 128 159 1.24x 8M 332 395 1.19x 32M 485 582 1.20x 64M 527 645 1.22x 128M 574 679 1.18x 256M 597 702 1.18x 512M 706 712 1.01x 1G 717 718 1.00x Differential Revision: D100253564

meta-codesync · 2026-04-13T19:25:09Z

@goelayu has exported this pull request. If you are a Meta employee, you can view the originating Diff in D100357723.

Summary: Pull Request resolved: meta-pytorch#2058 Refactor IB sendrecv kernel code from torchcomms test directory into a proper reusable library at `comms/pipes/collectives/ib/`, mirroring the Triton IB sendrecv at `comms/pipes/triton/collectives/ib/`. ## Changes - **New kernel library** (`collectives/ib/SendRecv.cu/.cuh`): Pipelined bidirectional sendrecv over IB/IBGDA using the torchcomms device API (NCCLx/GIN). Split into clean `send_tile()` and `recv_tile()` device functions. Uses `TiledBuffer<float>` from D100253564 for 16-byte-aligned tile partitioning, replacing manual offset math. - **Benchmark** (`collectives/benchmarks/IbSendRecvBenchmark.cc`): Full benchmark harness with put BW reference (performance ceiling) and sendrecv sweeps across sizes, tile sizes, block counts, and pipeline depths. - **Removed** old files from `torchcomms/tests/perf/cpp/` (PutBwBenchmark*). - **Dropped** non-pipelined sendRecvBwKernel — pipelined version is the default. ## Key Design - `send_tile()` / `recv_tile()` are separate `__device__` functions for clean composition - Ring-buffer staging with configurable pipeline depth (PD) - Signal protocol: DATA_READY (piggybacked on put), SLOT_FREE (receiver → sender), NIC_DONE counter (staging reuse) - CUDA graph compatible: iteration count and signal_base as kernel args, monotonic counters - snake_case naming following Pipes conventions ## Performance (H100, IB, p2p_disabled, 128 blocks) | Size | Tile | BW (GB/s) | |------|------|-----------| | 32KB | 256B | 0.03 | | 1MB | 8KB | 0.98 | | 16MB | 128KB | 10.45 | | 128MB | 1MB | 33.69 | | 512MB | 4MB | 42.25 | | 1GB | 8MB | 42.92 | | 2GB | 16MB | 45.25 | | 4GB | 32MB | 46.34 | Raw put ceiling: ~47.8 GB/s. Best sendrecv: ~46.3 GB/s at 4GB (97% of ceiling). Depends on D100253564 for TiledBuffer. Differential Revision: D100357723

Subodh Iyengar added 2 commits April 11, 2026 23:20

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 13, 2026

meta-codesync bot added fb-exported meta-exported labels Apr 13, 2026

meta-codesync bot changed the title ~~Add IB pipelined sendrecv kernel with TiledBuffer~~ Add IB pipelined sendrecv kernel with TiledBuffer (#2058) Apr 13, 2026

goelayu force-pushed the export-D100357723 branch 2 times, most recently from 1cebcd9 to dcefc81 Compare April 13, 2026 22:24

goelayu force-pushed the export-D100357723 branch from dcefc81 to 5b416a8 Compare April 14, 2026 00:47

goelayu force-pushed the export-D100357723 branch 2 times, most recently from 35cfada to 50365fb Compare April 14, 2026 01:14

goelayu force-pushed the export-D100357723 branch from 50365fb to 3ac2151 Compare April 14, 2026 01:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add IB pipelined sendrecv kernel with TiledBuffer (#2058)#2058

Add IB pipelined sendrecv kernel with TiledBuffer (#2058)#2058
goelayu wants to merge 3 commits intometa-pytorch:mainfrom
goelayu:export-D100357723

goelayu commented Apr 13, 2026 •

edited by meta-codesync bot

Loading

Uh oh!

meta-codesync bot commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

goelayu commented Apr 13, 2026 • edited by meta-codesync bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Key Design

Performance (H100, IB, p2p_disabled, 128 blocks)

Uh oh!

meta-codesync bot commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

goelayu commented Apr 13, 2026 •

edited by meta-codesync bot

Loading