Skip to content

Fix naming convention and CUDA printf format in BarrierState/ChunkState#2367

Open
snarayankh wants to merge 7 commits intometa-pytorch:mainfrom
snarayankh:export-D103738398
Open

Fix naming convention and CUDA printf format in BarrierState/ChunkState#2367
snarayankh wants to merge 7 commits intometa-pytorch:mainfrom
snarayankh:export-D103738398

Conversation

@snarayankh
Copy link
Copy Markdown

Summary:
Pipes Guardian automated code quality fixes (3/3 LLM agreement, 0.95 confidence):

  1. Rename getBarrierBufferSize to get_barrier_buffer_size to match the library's snake_case convention for all functions/methods. Updated all 3 call sites (definition, BarrierBench.cc, MultiPeerNvlTransport.cc).

  2. Fix CUDA printf format mismatch in ChunkState::wait_ready_to_recv: replace %zu with %llu and add static_cast<unsigned long long>() for the size_t argument. CUDA device-side printf does not support the z length modifier, which can produce garbage output. This matches the correct pattern already used in BarrierState::wait.

Differential Revision: D103738398

snarayankh and others added 7 commits April 30, 2026 10:56
Summary:
Design doc for the CTPAT AllGather algorithm — a rail-parallel butterfly (recursive-doubling) variant for multi-GPU-per-node H100 topologies. CTPAT is implemented as a new `NCCL_ALLGATHER_P_ALGO::ctpat` variant inside `AllGatherP/`, reusing the existing persistent infrastructure (PersistArgs, allGatherCtrl, PipeSync, nvlCeBcast).

Key design decisions:
- Lives in AllGatherP/ (persistent path), not AllGather/ (eager path), to reuse handle exchange, PipeSync coordination, and CE broadcast orchestration
- v1 is structurally identical to ctrdpipeline — same butterfly pattern, same execution model
- Power-of-two nNodes only for v1; non-power-of-two deferred to Phase 3
- CUDA graph support deferred from v1 (ctgraph_pipeline hardcodes ctpipeline)
- No new CVARs beyond adding ctpat to NCCL_ALLGATHER_P_ALGO choices
- Eligibility validated in allGatherPInit(), not at exec time

Differential Revision: D102874738
Summary:
Implements Phase 1 of the CTPAT AllGather algorithm — a rail-parallel butterfly (recursive-doubling) variant for multi-GPU-per-node H100 topologies.

CTPAT is added as a new `NCCL_ALLGATHER_P_ALGO::ctpat` variant inside AllGatherP/, reusing the existing persistent infrastructure (PersistArgs, allGatherCtrl, PipeSync, nvlCeBcast). For v1, `execPat()` is structurally identical to `execRecursiveDoubling()` — same butterfly steps, same GPE callback, same PipeSync overlap, same NVL CE broadcast.

Changes:
- Add `ctpat` to `NCCL_ALLGATHER_P_ALGO` choices in `nccl_cvars.yaml`
- Create `AllGatherP/PatImpl.cc` with `gpeFn` callback and `execPat()` method
- Add `execPat()` declaration and `algoName(ctpat)` to `AlgoImpl.h`
- Add `ctpat` dispatch case to `allGatherPExec()` in `AllGatherP.cc`
- Add PAT eligibility validation in `allGatherPInit()`: power-of-2 nNodes, nRanks%nLocalRanks==0, NVL connectivity for local peers

Usage: `NCCL_ALLGATHER_P_ALGO=ctpat` (explicit opt-in, no auto-selection for v1)

Design doc: D102874738

Differential Revision: D102877289
Summary:
Design doc for ctpatcopy — a staged-buffer variant of the AllGather PAT algorithm for NVL x IB on H100. Routes IB transfers through pre-registered staging buffers instead of putting directly to recvbuff, providing pre-registered buffer benefits and laying groundwork for future chunk-level pipelining.

Key design decisions:
- Separate NCCL_ALLGATHER_P_ALGO::ctpatcopy enum (not dynamic branch under ctpat)
- Step-granular staging with contiguous per-step layout (no slot reuse within step)
- Fully blocking per step via bidirectional sync: pipeSync (GPE→stream) + stepDoneSync (stream→GPE)
- Per-step cudaStreamSynchronize(copyStream) for icopy→iput ordering
- Extended PipeEnd kernel resets both pipeSync and stepDoneSync for persistent replay safety
- Requires nLocalRanks > 1 (stream-side copy-back only exists in multi-GPU path)
- Per-persistent-request staging via BufManager (32MB send + 32MB recv)
- Falls back to execPat() (zero-copy) when sendSize exceeds staging capacity

Design doc only — no code changes.

Differential Revision: D102933592
Summary:
Implements ctpatcopy — a staged-buffer variant of the AllGather PAT algorithm for NVL x IB. Routes IB transfers through pre-registered staging buffers instead of putting directly to recvbuff.

Key design:
- Separate NCCL_ALLGATHER_P_ALGO::ctpatcopy enum (not dynamic branch under ctpat)
- Step-granular staging with contiguous per-step layout (no slot reuse within step)
- Fully blocking per step via bidirectional sync: pipeSync (GPE→stream) + stepDoneSync (stream→GPE)
- Per-step cudaStreamSynchronize(copyStream) for icopy→iput ordering
- Extended PipeEnd kernel resets both pipeSync and stepDoneSync for persistent replay safety
- Requires nLocalRanks > 1 (stream-side copy-back path)
- Per-persistent-request staging via BufManager (32MB send + 32MB recv)
- Falls back to execPat() (zero-copy) when sendSize exceeds staging capacity

New files:
- AllGatherP/PatCopyImpl.cc: staged gpeFn + execPatCopy()
- AllGatherP/PatCopyImpl.cu: ncclKernelStepDone + ncclKernelAllGatherPPatCopyPipeEnd

Modified files:
- Types.h: StagingBufId, StagingInfo, StepDoneKernArgs, PatCopyPipeEndKernArgs, Resource extension
- AlgoImpl.h: execPatCopy() declaration, ctpatcopy in algoName()
- AllGatherP.cc: ctpatcopy dispatch, staging init/exchange/destroy lifecycle
- CtranGpe.h: algoResource field in allgatherp_init OpElem
- BUCK: PatCopyImpl.cu in hetero_ctran_device_lib
- nccl_cvars.yaml: ctpatcopy enum + staging buf size CVAR

Design doc: D102933592

Differential Revision: D102936132
…nchmark for MAST

Summary:
Under CUDA_VISIBLE_DEVICES isolation (e.g. torchrun on MAST), each process sees only 1 GPU and all report the same cudaDev. `cudaDeviceCanAccessPeer(dev, dev)` returns 0 per CUDA spec, causing the NVL backend to miss all local peers. This adds a check: if a peer reports the same cudaDev, assume NVL is available since topology grouped us as co-local.

Also hardens the AllgatherP benchmark for MAST execution:
- Add NVL availability check before running algos that require NVL (ctdirect, ctpipeline, ctpat, ctpatcopy)
- Add `dup2(STDERR_FILENO, STDOUT_FILENO)` so MAST captures benchmark output (MAST only collects stderr)
- Add `--mem_type` flag support for cuMem allocation in the 2x8 binary BUCK target

Differential Revision: D103240803
Summary:
Replace ctpatcopy's CE-based nvlCeBcast peer broadcast with a Pipes-based NVL dissemination kernel (NvlDissemKernel.cu) following the AllToAllv pattern. All local ranks launch a single cooperative kernel with per-peer matched send/recv subgroups inside the kernel via partition_interleaved.

The recv-staged ctpatcopy design remains unstable at 512KB-2MB across both CE and Pipes dissemination variants. The spikes are a property of the recv-staged execution shape (data from staging.recvBuf after PipeSync), not the specific dissemination mechanism. ctpat (direct IB into recvbuff) is stable at the same sizes.

Changes:
- New NvlDissemKernel.cu: single cooperative kernel with send_group/recv_group per peer, self-copy via put_group
- Types.h: NvlDissemKernArgs struct with forward-declared P2pNvlTransportDevice
- PatCopyImpl.cc: ENABLE_PIPES guard, CE fallback in #else
- BUCK: add NvlDissemKernel.cu to hetero_ctran_device_lib

Differential Revision: D103435262
Summary:
Pipes Guardian automated code quality fixes (3/3 LLM agreement, 0.95 confidence):

1. Rename `getBarrierBufferSize` to `get_barrier_buffer_size` to match the library's snake_case convention for all functions/methods. Updated all 3 call sites (definition, BarrierBench.cc, MultiPeerNvlTransport.cc).

2. Fix CUDA printf format mismatch in ChunkState::wait_ready_to_recv: replace `%zu` with `%llu` and add `static_cast<unsigned long long>()` for the `size_t` argument. CUDA device-side printf does not support the `z` length modifier, which can produce garbage output. This matches the correct pattern already used in BarrierState::wait.

Differential Revision: D103738398
@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 4, 2026
@meta-codesync
Copy link
Copy Markdown
Contributor

meta-codesync Bot commented May 4, 2026

@snarayankh has exported this pull request. If you are a Meta employee, you can view the originating Diff in D103738398.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot. fb-exported meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant