Add IB pipelined sendrecv kernel with TiledBuffer (#2058)#2058
Open
goelayu wants to merge 3 commits intometa-pytorch:mainfrom
Open
Add IB pipelined sendrecv kernel with TiledBuffer (#2058)#2058goelayu wants to merge 3 commits intometa-pytorch:mainfrom
goelayu wants to merge 3 commits intometa-pytorch:mainfrom
Conversation
added 2 commits
April 11, 2026 23:20
Summary: Fix benchmark crash by switching from `getDeviceTransports()` (device pointer) to `buildP2pTransportDevice()` (host-side copy). The kernel receives `P2pNvlTransportDevice` by value, so `cudaLaunchKernel` needs to read the struct from host memory, not device memory. Also remove unused `mpi_test_utils` dep from BUCK (MPI headers are no longer directly included after the BenchmarkTestFixture migration). Differential Revision: D100352817
Summary: ## Tile sendrecv protocol Add a bidirectional pipelined data transfer protocol for NVLink P2P: - Each GPU simultaneously sends AND receives by partitioning kernel blocks into sender/receiver roles (half send, half recv). - Each sender/receiver block pair transfers an independent tile of data through a pipelined staging buffer using monotonic head/tail SignalState counters. - Sender and receiver run on separate GPUs — signals exchanged via NVLink remote writes to peer signal buffers. - Backpressure via `head >= step - pipelineDepth + 1` (CMP_GE). - Memory ordering: `group.sync()` before `st.release.sys.global` signal; `ld.acquire.sys.global` wait forms release-acquire pair with sender. - Persistent step counters in device memory for multi-call correctness. - Supports clustered kernel launch for improved NVLink utilization. - Transport-internal tile signals and stepState (tileMaxBlocks config). - Dynamic block count via per-block barrier_sync_threadgroup. ## TiledBuffer<T> abstraction Add `comms::pipes::TiledBuffer<T>` — a typed view that partitions a buffer into aligned tiles across blocks. Zero-cost abstraction. ## Transport enhancements - `tileMaxBlocks` config: transport allocates tile signals + stepState - `p2pBarrierCount` config: allocates barrier buffer for cross-GPU sync - `send_tile`/`recv_tile` with internal state (no user stepState needed) - `__restrict__` hints on send_tile/recv_tile parameters - `options()`, `local_state()`, `remote_state()` device getters ## Benchmark results (H100 NVLink, ncclx 2.29, bidirectional, clustered) Size NCCL Tile Speedup 128K 19 28 1.48x 1M 128 159 1.24x 8M 332 395 1.19x 32M 485 582 1.20x 64M 527 645 1.22x 128M 574 679 1.18x 256M 597 702 1.18x 512M 706 712 1.01x 1G 717 718 1.00x Differential Revision: D100253564
Contributor
|
@goelayu has exported this pull request. If you are a Meta employee, you can view the originating Diff in D100357723. |
goelayu
pushed a commit
to goelayu/torchcomms
that referenced
this pull request
Apr 13, 2026
Summary: Pull Request resolved: meta-pytorch#2058 Refactor IB sendrecv kernel code from torchcomms test directory into a proper reusable library at `comms/pipes/collectives/ib/`, mirroring the Triton IB sendrecv at `comms/pipes/triton/collectives/ib/`. ## Changes - **New kernel library** (`collectives/ib/SendRecv.cu/.cuh`): Pipelined bidirectional sendrecv over IB/IBGDA using the torchcomms device API (NCCLx/GIN). Split into clean `send_tile()` and `recv_tile()` device functions. Uses `TiledBuffer<float>` from D100253564 for 16-byte-aligned tile partitioning, replacing manual offset math. - **Benchmark** (`collectives/benchmarks/IbSendRecvBenchmark.cc`): Full benchmark harness with put BW reference (performance ceiling) and sendrecv sweeps across sizes, tile sizes, block counts, and pipeline depths. - **Removed** old files from `torchcomms/tests/perf/cpp/` (PutBwBenchmark*). - **Dropped** non-pipelined sendRecvBwKernel — pipelined version is the default. ## Key Design - `send_tile()` / `recv_tile()` are separate `__device__` functions for clean composition - Ring-buffer staging with configurable pipeline depth (PD) - Signal protocol: DATA_READY (piggybacked on put), SLOT_FREE (receiver → sender), NIC_DONE counter (staging reuse) - CUDA graph compatible: iteration count and signal_base as kernel args, monotonic counters - snake_case naming following Pipes conventions ## Performance (H100, IB, p2p_disabled, 128 blocks) | Size | Tile | BW (GB/s) | |------|------|-----------| | 32KB | 256B | 0.03 | | 1MB | 8KB | 0.98 | | 16MB | 128KB | 10.45 | | 128MB | 1MB | 33.69 | | 512MB | 4MB | 42.25 | | 1GB | 8MB | 42.92 | | 2GB | 16MB | 45.25 | | 4GB | 32MB | 46.34 | Raw put ceiling: ~47.8 GB/s. Best sendrecv: ~46.3 GB/s at 4GB (97% of ceiling). Depends on D100253564 for TiledBuffer. Differential Revision: D100357723
1cebcd9 to
dcefc81
Compare
goelayu
pushed a commit
to goelayu/torchcomms
that referenced
this pull request
Apr 13, 2026
Summary: Pull Request resolved: meta-pytorch#2058 Refactor IB sendrecv kernel code from torchcomms test directory into a proper reusable library at `comms/pipes/collectives/ib/`, mirroring the Triton IB sendrecv at `comms/pipes/triton/collectives/ib/`. ## Changes - **New kernel library** (`collectives/ib/SendRecv.cu/.cuh`): Pipelined bidirectional sendrecv over IB/IBGDA using the torchcomms device API (NCCLx/GIN). Split into clean `send_tile()` and `recv_tile()` device functions. Uses `TiledBuffer<float>` from D100253564 for 16-byte-aligned tile partitioning, replacing manual offset math. - **Benchmark** (`collectives/benchmarks/IbSendRecvBenchmark.cc`): Full benchmark harness with put BW reference (performance ceiling) and sendrecv sweeps across sizes, tile sizes, block counts, and pipeline depths. - **Removed** old files from `torchcomms/tests/perf/cpp/` (PutBwBenchmark*). - **Dropped** non-pipelined sendRecvBwKernel — pipelined version is the default. ## Key Design - `send_tile()` / `recv_tile()` are separate `__device__` functions for clean composition - Ring-buffer staging with configurable pipeline depth (PD) - Signal protocol: DATA_READY (piggybacked on put), SLOT_FREE (receiver → sender), NIC_DONE counter (staging reuse) - CUDA graph compatible: iteration count and signal_base as kernel args, monotonic counters - snake_case naming following Pipes conventions ## Performance (H100, IB, p2p_disabled, 128 blocks) | Size | Tile | BW (GB/s) | |------|------|-----------| | 32KB | 256B | 0.03 | | 1MB | 8KB | 0.98 | | 16MB | 128KB | 10.45 | | 128MB | 1MB | 33.69 | | 512MB | 4MB | 42.25 | | 1GB | 8MB | 42.92 | | 2GB | 16MB | 45.25 | | 4GB | 32MB | 46.34 | Raw put ceiling: ~47.8 GB/s. Best sendrecv: ~46.3 GB/s at 4GB (97% of ceiling). Depends on D100253564 for TiledBuffer. Differential Revision: D100357723
goelayu
pushed a commit
to goelayu/torchcomms
that referenced
this pull request
Apr 14, 2026
Summary: Pull Request resolved: meta-pytorch#2058 Refactor IB sendrecv kernel code from torchcomms test directory into a proper reusable library at `comms/pipes/collectives/ib/`, mirroring the Triton IB sendrecv at `comms/pipes/triton/collectives/ib/`. ## Changes - **New kernel library** (`collectives/ib/SendRecv.cu/.cuh`): Pipelined bidirectional sendrecv over IB/IBGDA using the torchcomms device API (NCCLx/GIN). Split into clean `send_tile()` and `recv_tile()` device functions. Uses `TiledBuffer<float>` from D100253564 for 16-byte-aligned tile partitioning, replacing manual offset math. - **Benchmark** (`collectives/benchmarks/IbSendRecvBenchmark.cc`): Full benchmark harness with put BW reference (performance ceiling) and sendrecv sweeps across sizes, tile sizes, block counts, and pipeline depths. - **Removed** old files from `torchcomms/tests/perf/cpp/` (PutBwBenchmark*). - **Dropped** non-pipelined sendRecvBwKernel — pipelined version is the default. ## Key Design - `send_tile()` / `recv_tile()` are separate `__device__` functions for clean composition - Ring-buffer staging with configurable pipeline depth (PD) - Signal protocol: DATA_READY (piggybacked on put), SLOT_FREE (receiver → sender), NIC_DONE counter (staging reuse) - CUDA graph compatible: iteration count and signal_base as kernel args, monotonic counters - snake_case naming following Pipes conventions ## Performance (H100, IB, p2p_disabled, 128 blocks) | Size | Tile | BW (GB/s) | |------|------|-----------| | 32KB | 256B | 0.03 | | 1MB | 8KB | 0.98 | | 16MB | 128KB | 10.45 | | 128MB | 1MB | 33.69 | | 512MB | 4MB | 42.25 | | 1GB | 8MB | 42.92 | | 2GB | 16MB | 45.25 | | 4GB | 32MB | 46.34 | Raw put ceiling: ~47.8 GB/s. Best sendrecv: ~46.3 GB/s at 4GB (97% of ceiling). Depends on D100253564 for TiledBuffer. Differential Revision: D100357723
dcefc81 to
5b416a8
Compare
goelayu
pushed a commit
to goelayu/torchcomms
that referenced
this pull request
Apr 14, 2026
Summary: Pull Request resolved: meta-pytorch#2058 Refactor IB sendrecv kernel code from torchcomms test directory into a proper reusable library at `comms/pipes/collectives/ib/`, mirroring the Triton IB sendrecv at `comms/pipes/triton/collectives/ib/`. ## Changes - **New kernel library** (`collectives/ib/SendRecv.cu/.cuh`): Pipelined bidirectional sendrecv over IB/IBGDA using the torchcomms device API (NCCLx/GIN). Split into clean `send_tile()` and `recv_tile()` device functions. Uses `TiledBuffer<float>` from D100253564 for 16-byte-aligned tile partitioning, replacing manual offset math. - **Benchmark** (`collectives/benchmarks/IbSendRecvBenchmark.cc`): Full benchmark harness with put BW reference (performance ceiling) and sendrecv sweeps across sizes, tile sizes, block counts, and pipeline depths. - **Removed** old files from `torchcomms/tests/perf/cpp/` (PutBwBenchmark*). - **Dropped** non-pipelined sendRecvBwKernel — pipelined version is the default. ## Key Design - `send_tile()` / `recv_tile()` are separate `__device__` functions for clean composition - Ring-buffer staging with configurable pipeline depth (PD) - Signal protocol: DATA_READY (piggybacked on put), SLOT_FREE (receiver → sender), NIC_DONE counter (staging reuse) - CUDA graph compatible: iteration count and signal_base as kernel args, monotonic counters - snake_case naming following Pipes conventions ## Performance (H100, IB, p2p_disabled, 128 blocks) | Size | Tile | BW (GB/s) | |------|------|-----------| | 32KB | 256B | 0.03 | | 1MB | 8KB | 0.98 | | 16MB | 128KB | 10.45 | | 128MB | 1MB | 33.69 | | 512MB | 4MB | 42.25 | | 1GB | 8MB | 42.92 | | 2GB | 16MB | 45.25 | | 4GB | 32MB | 46.34 | Raw put ceiling: ~47.8 GB/s. Best sendrecv: ~46.3 GB/s at 4GB (97% of ceiling). Depends on D100253564 for TiledBuffer. Differential Revision: D100357723
35cfada to
50365fb
Compare
goelayu
pushed a commit
to goelayu/torchcomms
that referenced
this pull request
Apr 14, 2026
Summary: Pull Request resolved: meta-pytorch#2058 Refactor IB sendrecv kernel code from torchcomms test directory into a proper reusable library at `comms/pipes/collectives/ib/`, mirroring the Triton IB sendrecv at `comms/pipes/triton/collectives/ib/`. ## Changes - **New kernel library** (`collectives/ib/SendRecv.cu/.cuh`): Pipelined bidirectional sendrecv over IB/IBGDA using the torchcomms device API (NCCLx/GIN). Split into clean `send_tile()` and `recv_tile()` device functions. Uses `TiledBuffer<float>` from D100253564 for 16-byte-aligned tile partitioning, replacing manual offset math. - **Benchmark** (`collectives/benchmarks/IbSendRecvBenchmark.cc`): Full benchmark harness with put BW reference (performance ceiling) and sendrecv sweeps across sizes, tile sizes, block counts, and pipeline depths. - **Removed** old files from `torchcomms/tests/perf/cpp/` (PutBwBenchmark*). - **Dropped** non-pipelined sendRecvBwKernel — pipelined version is the default. ## Key Design - `send_tile()` / `recv_tile()` are separate `__device__` functions for clean composition - Ring-buffer staging with configurable pipeline depth (PD) - Signal protocol: DATA_READY (piggybacked on put), SLOT_FREE (receiver → sender), NIC_DONE counter (staging reuse) - CUDA graph compatible: iteration count and signal_base as kernel args, monotonic counters - snake_case naming following Pipes conventions ## Performance (H100, IB, p2p_disabled, 128 blocks) | Size | Tile | BW (GB/s) | |------|------|-----------| | 32KB | 256B | 0.03 | | 1MB | 8KB | 0.98 | | 16MB | 128KB | 10.45 | | 128MB | 1MB | 33.69 | | 512MB | 4MB | 42.25 | | 1GB | 8MB | 42.92 | | 2GB | 16MB | 45.25 | | 4GB | 32MB | 46.34 | Raw put ceiling: ~47.8 GB/s. Best sendrecv: ~46.3 GB/s at 4GB (97% of ceiling). Depends on D100253564 for TiledBuffer. Differential Revision: D100357723
Summary: Pull Request resolved: meta-pytorch#2058 Refactor IB sendrecv kernel code from torchcomms test directory into a proper reusable library at `comms/pipes/collectives/ib/`, mirroring the Triton IB sendrecv at `comms/pipes/triton/collectives/ib/`. ## Changes - **New kernel library** (`collectives/ib/SendRecv.cu/.cuh`): Pipelined bidirectional sendrecv over IB/IBGDA using the torchcomms device API (NCCLx/GIN). Split into clean `send_tile()` and `recv_tile()` device functions. Uses `TiledBuffer<float>` from D100253564 for 16-byte-aligned tile partitioning, replacing manual offset math. - **Benchmark** (`collectives/benchmarks/IbSendRecvBenchmark.cc`): Full benchmark harness with put BW reference (performance ceiling) and sendrecv sweeps across sizes, tile sizes, block counts, and pipeline depths. - **Removed** old files from `torchcomms/tests/perf/cpp/` (PutBwBenchmark*). - **Dropped** non-pipelined sendRecvBwKernel — pipelined version is the default. ## Key Design - `send_tile()` / `recv_tile()` are separate `__device__` functions for clean composition - Ring-buffer staging with configurable pipeline depth (PD) - Signal protocol: DATA_READY (piggybacked on put), SLOT_FREE (receiver → sender), NIC_DONE counter (staging reuse) - CUDA graph compatible: iteration count and signal_base as kernel args, monotonic counters - snake_case naming following Pipes conventions ## Performance (H100, IB, p2p_disabled, 128 blocks) | Size | Tile | BW (GB/s) | |------|------|-----------| | 32KB | 256B | 0.03 | | 1MB | 8KB | 0.98 | | 16MB | 128KB | 10.45 | | 128MB | 1MB | 33.69 | | 512MB | 4MB | 42.25 | | 1GB | 8MB | 42.92 | | 2GB | 16MB | 45.25 | | 4GB | 32MB | 46.34 | Raw put ceiling: ~47.8 GB/s. Best sendrecv: ~46.3 GB/s at 4GB (97% of ceiling). Depends on D100253564 for TiledBuffer. Differential Revision: D100357723
50365fb to
3ac2151
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary:
Refactor IB sendrecv kernel code from torchcomms test directory into a proper reusable library at
comms/pipes/collectives/ib/, mirroring the Triton IB sendrecv atcomms/pipes/triton/collectives/ib/.Changes
collectives/ib/SendRecv.cu/.cuh): Pipelined bidirectional sendrecv over IB/IBGDA using the torchcomms device API (NCCLx/GIN). Split into cleansend_tile()andrecv_tile()device functions. UsesTiledBuffer<float>from D100253564 for 16-byte-aligned tile partitioning, replacing manual offset math.collectives/benchmarks/IbSendRecvBenchmark.cc): Full benchmark harness with put BW reference (performance ceiling) and sendrecv sweeps across sizes, tile sizes, block counts, and pipeline depths.torchcomms/tests/perf/cpp/(PutBwBenchmark*).Key Design
send_tile()/recv_tile()are separate__device__functions for clean compositionPerformance (H100, IB, p2p_disabled, 128 blocks)
Raw put ceiling: ~47.8 GB/s. Best sendrecv: ~46.3 GB/s at 4GB (97% of ceiling).
Depends on D100253564 for TiledBuffer.
Differential Revision: D100357723