Skip to content

Add IB pipelined sendrecv kernel with TiledBuffer (#2058)#2058

Open
goelayu wants to merge 3 commits intometa-pytorch:mainfrom
goelayu:export-D100357723
Open

Add IB pipelined sendrecv kernel with TiledBuffer (#2058)#2058
goelayu wants to merge 3 commits intometa-pytorch:mainfrom
goelayu:export-D100357723

Conversation

@goelayu
Copy link
Copy Markdown

@goelayu goelayu commented Apr 13, 2026

Summary:

Refactor IB sendrecv kernel code from torchcomms test directory into a proper reusable library at comms/pipes/collectives/ib/, mirroring the Triton IB sendrecv at comms/pipes/triton/collectives/ib/.

Changes

  • New kernel library (collectives/ib/SendRecv.cu/.cuh): Pipelined bidirectional sendrecv over IB/IBGDA using the torchcomms device API (NCCLx/GIN). Split into clean send_tile() and recv_tile() device functions. Uses TiledBuffer<float> from D100253564 for 16-byte-aligned tile partitioning, replacing manual offset math.
  • Benchmark (collectives/benchmarks/IbSendRecvBenchmark.cc): Full benchmark harness with put BW reference (performance ceiling) and sendrecv sweeps across sizes, tile sizes, block counts, and pipeline depths.
  • Removed old files from torchcomms/tests/perf/cpp/ (PutBwBenchmark*).
  • Dropped non-pipelined sendRecvBwKernel — pipelined version is the default.

Key Design

  • send_tile() / recv_tile() are separate __device__ functions for clean composition
  • Ring-buffer staging with configurable pipeline depth (PD)
  • Signal protocol: DATA_READY (piggybacked on put), SLOT_FREE (receiver → sender), NIC_DONE counter (staging reuse)
  • CUDA graph compatible: iteration count and signal_base as kernel args, monotonic counters
  • snake_case naming following Pipes conventions

Performance (H100, IB, p2p_disabled, 128 blocks)

Size Tile BW (GB/s)
32KB 256B 0.03
1MB 8KB 0.98
16MB 128KB 10.45
128MB 1MB 33.69
512MB 4MB 42.25
1GB 8MB 42.92
2GB 16MB 45.25
4GB 32MB 46.34

Raw put ceiling: ~47.8 GB/s. Best sendrecv: ~46.3 GB/s at 4GB (97% of ceiling).

Depends on D100253564 for TiledBuffer.

Differential Revision: D100357723

Subodh Iyengar added 2 commits April 11, 2026 23:20
Summary:
Fix benchmark crash by switching from `getDeviceTransports()` (device
pointer) to `buildP2pTransportDevice()` (host-side copy). The kernel
receives `P2pNvlTransportDevice` by value, so `cudaLaunchKernel` needs
to read the struct from host memory, not device memory.

Also remove unused `mpi_test_utils` dep from BUCK (MPI headers are no
longer directly included after the BenchmarkTestFixture migration).

Differential Revision: D100352817
Summary:
## Tile sendrecv protocol

Add a bidirectional pipelined data transfer protocol for NVLink P2P:

- Each GPU simultaneously sends AND receives by partitioning kernel blocks
  into sender/receiver roles (half send, half recv).
- Each sender/receiver block pair transfers an independent tile of data
  through a pipelined staging buffer using monotonic head/tail SignalState
  counters.
- Sender and receiver run on separate GPUs — signals exchanged via NVLink
  remote writes to peer signal buffers.
- Backpressure via `head >= step - pipelineDepth + 1` (CMP_GE).
- Memory ordering: `group.sync()` before `st.release.sys.global` signal;
  `ld.acquire.sys.global` wait forms release-acquire pair with sender.
- Persistent step counters in device memory for multi-call correctness.
- Supports clustered kernel launch for improved NVLink utilization.
- Transport-internal tile signals and stepState (tileMaxBlocks config).
- Dynamic block count via per-block barrier_sync_threadgroup.

## TiledBuffer<T> abstraction

Add `comms::pipes::TiledBuffer<T>` — a typed view that partitions a buffer
into aligned tiles across blocks. Zero-cost abstraction.

## Transport enhancements

- `tileMaxBlocks` config: transport allocates tile signals + stepState
- `p2pBarrierCount` config: allocates barrier buffer for cross-GPU sync
- `send_tile`/`recv_tile` with internal state (no user stepState needed)
- `__restrict__` hints on send_tile/recv_tile parameters
- `options()`, `local_state()`, `remote_state()` device getters

## Benchmark results (H100 NVLink, ncclx 2.29, bidirectional, clustered)

  Size    NCCL    Tile    Speedup
  128K      19      28     1.48x
  1M       128     159     1.24x
  8M       332     395     1.19x
  32M      485     582     1.20x
  64M      527     645     1.22x
  128M     574     679     1.18x
  256M     597     702     1.18x
  512M     706     712     1.01x
  1G       717     718     1.00x

Differential Revision: D100253564
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 13, 2026
@meta-codesync
Copy link
Copy Markdown
Contributor

meta-codesync bot commented Apr 13, 2026

@goelayu has exported this pull request. If you are a Meta employee, you can view the originating Diff in D100357723.

goelayu pushed a commit to goelayu/torchcomms that referenced this pull request Apr 13, 2026
Summary:
Pull Request resolved: meta-pytorch#2058

Refactor IB sendrecv kernel code from torchcomms test directory into a proper reusable library at `comms/pipes/collectives/ib/`, mirroring the Triton IB sendrecv at `comms/pipes/triton/collectives/ib/`.

## Changes

- **New kernel library** (`collectives/ib/SendRecv.cu/.cuh`): Pipelined bidirectional sendrecv over IB/IBGDA using the torchcomms device API (NCCLx/GIN). Split into clean `send_tile()` and `recv_tile()` device functions. Uses `TiledBuffer<float>` from D100253564 for 16-byte-aligned tile partitioning, replacing manual offset math.
- **Benchmark** (`collectives/benchmarks/IbSendRecvBenchmark.cc`): Full benchmark harness with put BW reference (performance ceiling) and sendrecv sweeps across sizes, tile sizes, block counts, and pipeline depths.
- **Removed** old files from `torchcomms/tests/perf/cpp/` (PutBwBenchmark*).
- **Dropped** non-pipelined sendRecvBwKernel — pipelined version is the default.

## Key Design

- `send_tile()` / `recv_tile()` are separate `__device__` functions for clean composition
- Ring-buffer staging with configurable pipeline depth (PD)
- Signal protocol: DATA_READY (piggybacked on put), SLOT_FREE (receiver → sender), NIC_DONE counter (staging reuse)
- CUDA graph compatible: iteration count and signal_base as kernel args, monotonic counters
- snake_case naming following Pipes conventions

## Performance (H100, IB, p2p_disabled, 128 blocks)

| Size | Tile | BW (GB/s) |
|------|------|-----------|
| 32KB | 256B | 0.03 |
| 1MB | 8KB | 0.98 |
| 16MB | 128KB | 10.45 |
| 128MB | 1MB | 33.69 |
| 512MB | 4MB | 42.25 |
| 1GB | 8MB | 42.92 |
| 2GB | 16MB | 45.25 |
| 4GB | 32MB | 46.34 |

Raw put ceiling: ~47.8 GB/s. Best sendrecv: ~46.3 GB/s at 4GB (97% of ceiling).

Depends on D100253564 for TiledBuffer.

Differential Revision: D100357723
@meta-codesync meta-codesync bot changed the title Add IB pipelined sendrecv kernel with TiledBuffer Add IB pipelined sendrecv kernel with TiledBuffer (#2058) Apr 13, 2026
@goelayu goelayu force-pushed the export-D100357723 branch 2 times, most recently from 1cebcd9 to dcefc81 Compare April 13, 2026 22:24
goelayu pushed a commit to goelayu/torchcomms that referenced this pull request Apr 13, 2026
Summary:
Pull Request resolved: meta-pytorch#2058

Refactor IB sendrecv kernel code from torchcomms test directory into a proper reusable library at `comms/pipes/collectives/ib/`, mirroring the Triton IB sendrecv at `comms/pipes/triton/collectives/ib/`.

## Changes

- **New kernel library** (`collectives/ib/SendRecv.cu/.cuh`): Pipelined bidirectional sendrecv over IB/IBGDA using the torchcomms device API (NCCLx/GIN). Split into clean `send_tile()` and `recv_tile()` device functions. Uses `TiledBuffer<float>` from D100253564 for 16-byte-aligned tile partitioning, replacing manual offset math.
- **Benchmark** (`collectives/benchmarks/IbSendRecvBenchmark.cc`): Full benchmark harness with put BW reference (performance ceiling) and sendrecv sweeps across sizes, tile sizes, block counts, and pipeline depths.
- **Removed** old files from `torchcomms/tests/perf/cpp/` (PutBwBenchmark*).
- **Dropped** non-pipelined sendRecvBwKernel — pipelined version is the default.

## Key Design

- `send_tile()` / `recv_tile()` are separate `__device__` functions for clean composition
- Ring-buffer staging with configurable pipeline depth (PD)
- Signal protocol: DATA_READY (piggybacked on put), SLOT_FREE (receiver → sender), NIC_DONE counter (staging reuse)
- CUDA graph compatible: iteration count and signal_base as kernel args, monotonic counters
- snake_case naming following Pipes conventions

## Performance (H100, IB, p2p_disabled, 128 blocks)

| Size | Tile | BW (GB/s) |
|------|------|-----------|
| 32KB | 256B | 0.03 |
| 1MB | 8KB | 0.98 |
| 16MB | 128KB | 10.45 |
| 128MB | 1MB | 33.69 |
| 512MB | 4MB | 42.25 |
| 1GB | 8MB | 42.92 |
| 2GB | 16MB | 45.25 |
| 4GB | 32MB | 46.34 |

Raw put ceiling: ~47.8 GB/s. Best sendrecv: ~46.3 GB/s at 4GB (97% of ceiling).

Depends on D100253564 for TiledBuffer.

Differential Revision: D100357723
goelayu pushed a commit to goelayu/torchcomms that referenced this pull request Apr 14, 2026
Summary:
Pull Request resolved: meta-pytorch#2058

Refactor IB sendrecv kernel code from torchcomms test directory into a proper reusable library at `comms/pipes/collectives/ib/`, mirroring the Triton IB sendrecv at `comms/pipes/triton/collectives/ib/`.

## Changes

- **New kernel library** (`collectives/ib/SendRecv.cu/.cuh`): Pipelined bidirectional sendrecv over IB/IBGDA using the torchcomms device API (NCCLx/GIN). Split into clean `send_tile()` and `recv_tile()` device functions. Uses `TiledBuffer<float>` from D100253564 for 16-byte-aligned tile partitioning, replacing manual offset math.
- **Benchmark** (`collectives/benchmarks/IbSendRecvBenchmark.cc`): Full benchmark harness with put BW reference (performance ceiling) and sendrecv sweeps across sizes, tile sizes, block counts, and pipeline depths.
- **Removed** old files from `torchcomms/tests/perf/cpp/` (PutBwBenchmark*).
- **Dropped** non-pipelined sendRecvBwKernel — pipelined version is the default.

## Key Design

- `send_tile()` / `recv_tile()` are separate `__device__` functions for clean composition
- Ring-buffer staging with configurable pipeline depth (PD)
- Signal protocol: DATA_READY (piggybacked on put), SLOT_FREE (receiver → sender), NIC_DONE counter (staging reuse)
- CUDA graph compatible: iteration count and signal_base as kernel args, monotonic counters
- snake_case naming following Pipes conventions

## Performance (H100, IB, p2p_disabled, 128 blocks)

| Size | Tile | BW (GB/s) |
|------|------|-----------|
| 32KB | 256B | 0.03 |
| 1MB | 8KB | 0.98 |
| 16MB | 128KB | 10.45 |
| 128MB | 1MB | 33.69 |
| 512MB | 4MB | 42.25 |
| 1GB | 8MB | 42.92 |
| 2GB | 16MB | 45.25 |
| 4GB | 32MB | 46.34 |

Raw put ceiling: ~47.8 GB/s. Best sendrecv: ~46.3 GB/s at 4GB (97% of ceiling).

Depends on D100253564 for TiledBuffer.

Differential Revision: D100357723
@goelayu goelayu force-pushed the export-D100357723 branch from dcefc81 to 5b416a8 Compare April 14, 2026 00:47
goelayu pushed a commit to goelayu/torchcomms that referenced this pull request Apr 14, 2026
Summary:
Pull Request resolved: meta-pytorch#2058

Refactor IB sendrecv kernel code from torchcomms test directory into a proper reusable library at `comms/pipes/collectives/ib/`, mirroring the Triton IB sendrecv at `comms/pipes/triton/collectives/ib/`.

## Changes

- **New kernel library** (`collectives/ib/SendRecv.cu/.cuh`): Pipelined bidirectional sendrecv over IB/IBGDA using the torchcomms device API (NCCLx/GIN). Split into clean `send_tile()` and `recv_tile()` device functions. Uses `TiledBuffer<float>` from D100253564 for 16-byte-aligned tile partitioning, replacing manual offset math.
- **Benchmark** (`collectives/benchmarks/IbSendRecvBenchmark.cc`): Full benchmark harness with put BW reference (performance ceiling) and sendrecv sweeps across sizes, tile sizes, block counts, and pipeline depths.
- **Removed** old files from `torchcomms/tests/perf/cpp/` (PutBwBenchmark*).
- **Dropped** non-pipelined sendRecvBwKernel — pipelined version is the default.

## Key Design

- `send_tile()` / `recv_tile()` are separate `__device__` functions for clean composition
- Ring-buffer staging with configurable pipeline depth (PD)
- Signal protocol: DATA_READY (piggybacked on put), SLOT_FREE (receiver → sender), NIC_DONE counter (staging reuse)
- CUDA graph compatible: iteration count and signal_base as kernel args, monotonic counters
- snake_case naming following Pipes conventions

## Performance (H100, IB, p2p_disabled, 128 blocks)

| Size | Tile | BW (GB/s) |
|------|------|-----------|
| 32KB | 256B | 0.03 |
| 1MB | 8KB | 0.98 |
| 16MB | 128KB | 10.45 |
| 128MB | 1MB | 33.69 |
| 512MB | 4MB | 42.25 |
| 1GB | 8MB | 42.92 |
| 2GB | 16MB | 45.25 |
| 4GB | 32MB | 46.34 |

Raw put ceiling: ~47.8 GB/s. Best sendrecv: ~46.3 GB/s at 4GB (97% of ceiling).

Depends on D100253564 for TiledBuffer.

Differential Revision: D100357723
@goelayu goelayu force-pushed the export-D100357723 branch 2 times, most recently from 35cfada to 50365fb Compare April 14, 2026 01:14
goelayu pushed a commit to goelayu/torchcomms that referenced this pull request Apr 14, 2026
Summary:
Pull Request resolved: meta-pytorch#2058

Refactor IB sendrecv kernel code from torchcomms test directory into a proper reusable library at `comms/pipes/collectives/ib/`, mirroring the Triton IB sendrecv at `comms/pipes/triton/collectives/ib/`.

## Changes

- **New kernel library** (`collectives/ib/SendRecv.cu/.cuh`): Pipelined bidirectional sendrecv over IB/IBGDA using the torchcomms device API (NCCLx/GIN). Split into clean `send_tile()` and `recv_tile()` device functions. Uses `TiledBuffer<float>` from D100253564 for 16-byte-aligned tile partitioning, replacing manual offset math.
- **Benchmark** (`collectives/benchmarks/IbSendRecvBenchmark.cc`): Full benchmark harness with put BW reference (performance ceiling) and sendrecv sweeps across sizes, tile sizes, block counts, and pipeline depths.
- **Removed** old files from `torchcomms/tests/perf/cpp/` (PutBwBenchmark*).
- **Dropped** non-pipelined sendRecvBwKernel — pipelined version is the default.

## Key Design

- `send_tile()` / `recv_tile()` are separate `__device__` functions for clean composition
- Ring-buffer staging with configurable pipeline depth (PD)
- Signal protocol: DATA_READY (piggybacked on put), SLOT_FREE (receiver → sender), NIC_DONE counter (staging reuse)
- CUDA graph compatible: iteration count and signal_base as kernel args, monotonic counters
- snake_case naming following Pipes conventions

## Performance (H100, IB, p2p_disabled, 128 blocks)

| Size | Tile | BW (GB/s) |
|------|------|-----------|
| 32KB | 256B | 0.03 |
| 1MB | 8KB | 0.98 |
| 16MB | 128KB | 10.45 |
| 128MB | 1MB | 33.69 |
| 512MB | 4MB | 42.25 |
| 1GB | 8MB | 42.92 |
| 2GB | 16MB | 45.25 |
| 4GB | 32MB | 46.34 |

Raw put ceiling: ~47.8 GB/s. Best sendrecv: ~46.3 GB/s at 4GB (97% of ceiling).

Depends on D100253564 for TiledBuffer.

Differential Revision: D100357723
Summary:
Pull Request resolved: meta-pytorch#2058

Refactor IB sendrecv kernel code from torchcomms test directory into a proper reusable library at `comms/pipes/collectives/ib/`, mirroring the Triton IB sendrecv at `comms/pipes/triton/collectives/ib/`.

## Changes

- **New kernel library** (`collectives/ib/SendRecv.cu/.cuh`): Pipelined bidirectional sendrecv over IB/IBGDA using the torchcomms device API (NCCLx/GIN). Split into clean `send_tile()` and `recv_tile()` device functions. Uses `TiledBuffer<float>` from D100253564 for 16-byte-aligned tile partitioning, replacing manual offset math.
- **Benchmark** (`collectives/benchmarks/IbSendRecvBenchmark.cc`): Full benchmark harness with put BW reference (performance ceiling) and sendrecv sweeps across sizes, tile sizes, block counts, and pipeline depths.
- **Removed** old files from `torchcomms/tests/perf/cpp/` (PutBwBenchmark*).
- **Dropped** non-pipelined sendRecvBwKernel — pipelined version is the default.

## Key Design

- `send_tile()` / `recv_tile()` are separate `__device__` functions for clean composition
- Ring-buffer staging with configurable pipeline depth (PD)
- Signal protocol: DATA_READY (piggybacked on put), SLOT_FREE (receiver → sender), NIC_DONE counter (staging reuse)
- CUDA graph compatible: iteration count and signal_base as kernel args, monotonic counters
- snake_case naming following Pipes conventions

## Performance (H100, IB, p2p_disabled, 128 blocks)

| Size | Tile | BW (GB/s) |
|------|------|-----------|
| 32KB | 256B | 0.03 |
| 1MB | 8KB | 0.98 |
| 16MB | 128KB | 10.45 |
| 128MB | 1MB | 33.69 |
| 512MB | 4MB | 42.25 |
| 1GB | 8MB | 42.92 |
| 2GB | 16MB | 45.25 |
| 4GB | 32MB | 46.34 |

Raw put ceiling: ~47.8 GB/s. Best sendrecv: ~46.3 GB/s at 4GB (97% of ceiling).

Depends on D100253564 for TiledBuffer.

Differential Revision: D100357723
@goelayu goelayu force-pushed the export-D100357723 branch from 50365fb to 3ac2151 Compare April 14, 2026 01:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot. fb-exported meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant