Fix naming convention and CUDA printf format in BarrierState/ChunkState by snarayankh · Pull Request #2367 · meta-pytorch/torchcomms

snarayankh · 2026-05-04T22:50:38Z

Summary:
Pipes Guardian automated code quality fixes (3/3 LLM agreement, 0.95 confidence):

Rename getBarrierBufferSize to get_barrier_buffer_size to match the library's snake_case convention for all functions/methods. Updated all 3 call sites (definition, BarrierBench.cc, MultiPeerNvlTransport.cc).
Fix CUDA printf format mismatch in ChunkState::wait_ready_to_recv: replace %zu with %llu and add static_cast<unsigned long long>() for the size_t argument. CUDA device-side printf does not support the z length modifier, which can produce garbage output. This matches the correct pattern already used in BarrierState::wait.

Differential Revision: D103738398

Summary: Design doc for the CTPAT AllGather algorithm — a rail-parallel butterfly (recursive-doubling) variant for multi-GPU-per-node H100 topologies. CTPAT is implemented as a new `NCCL_ALLGATHER_P_ALGO::ctpat` variant inside `AllGatherP/`, reusing the existing persistent infrastructure (PersistArgs, allGatherCtrl, PipeSync, nvlCeBcast). Key design decisions: - Lives in AllGatherP/ (persistent path), not AllGather/ (eager path), to reuse handle exchange, PipeSync coordination, and CE broadcast orchestration - v1 is structurally identical to ctrdpipeline — same butterfly pattern, same execution model - Power-of-two nNodes only for v1; non-power-of-two deferred to Phase 3 - CUDA graph support deferred from v1 (ctgraph_pipeline hardcodes ctpipeline) - No new CVARs beyond adding ctpat to NCCL_ALLGATHER_P_ALGO choices - Eligibility validated in allGatherPInit(), not at exec time Differential Revision: D102874738

Summary: Implements Phase 1 of the CTPAT AllGather algorithm — a rail-parallel butterfly (recursive-doubling) variant for multi-GPU-per-node H100 topologies. CTPAT is added as a new `NCCL_ALLGATHER_P_ALGO::ctpat` variant inside AllGatherP/, reusing the existing persistent infrastructure (PersistArgs, allGatherCtrl, PipeSync, nvlCeBcast). For v1, `execPat()` is structurally identical to `execRecursiveDoubling()` — same butterfly steps, same GPE callback, same PipeSync overlap, same NVL CE broadcast. Changes: - Add `ctpat` to `NCCL_ALLGATHER_P_ALGO` choices in `nccl_cvars.yaml` - Create `AllGatherP/PatImpl.cc` with `gpeFn` callback and `execPat()` method - Add `execPat()` declaration and `algoName(ctpat)` to `AlgoImpl.h` - Add `ctpat` dispatch case to `allGatherPExec()` in `AllGatherP.cc` - Add PAT eligibility validation in `allGatherPInit()`: power-of-2 nNodes, nRanks%nLocalRanks==0, NVL connectivity for local peers Usage: `NCCL_ALLGATHER_P_ALGO=ctpat` (explicit opt-in, no auto-selection for v1) Design doc: D102874738 Differential Revision: D102877289

Summary: Design doc for ctpatcopy — a staged-buffer variant of the AllGather PAT algorithm for NVL x IB on H100. Routes IB transfers through pre-registered staging buffers instead of putting directly to recvbuff, providing pre-registered buffer benefits and laying groundwork for future chunk-level pipelining. Key design decisions: - Separate NCCL_ALLGATHER_P_ALGO::ctpatcopy enum (not dynamic branch under ctpat) - Step-granular staging with contiguous per-step layout (no slot reuse within step) - Fully blocking per step via bidirectional sync: pipeSync (GPE→stream) + stepDoneSync (stream→GPE) - Per-step cudaStreamSynchronize(copyStream) for icopy→iput ordering - Extended PipeEnd kernel resets both pipeSync and stepDoneSync for persistent replay safety - Requires nLocalRanks > 1 (stream-side copy-back only exists in multi-GPU path) - Per-persistent-request staging via BufManager (32MB send + 32MB recv) - Falls back to execPat() (zero-copy) when sendSize exceeds staging capacity Design doc only — no code changes. Differential Revision: D102933592

Summary: Implements ctpatcopy — a staged-buffer variant of the AllGather PAT algorithm for NVL x IB. Routes IB transfers through pre-registered staging buffers instead of putting directly to recvbuff. Key design: - Separate NCCL_ALLGATHER_P_ALGO::ctpatcopy enum (not dynamic branch under ctpat) - Step-granular staging with contiguous per-step layout (no slot reuse within step) - Fully blocking per step via bidirectional sync: pipeSync (GPE→stream) + stepDoneSync (stream→GPE) - Per-step cudaStreamSynchronize(copyStream) for icopy→iput ordering - Extended PipeEnd kernel resets both pipeSync and stepDoneSync for persistent replay safety - Requires nLocalRanks > 1 (stream-side copy-back path) - Per-persistent-request staging via BufManager (32MB send + 32MB recv) - Falls back to execPat() (zero-copy) when sendSize exceeds staging capacity New files: - AllGatherP/PatCopyImpl.cc: staged gpeFn + execPatCopy() - AllGatherP/PatCopyImpl.cu: ncclKernelStepDone + ncclKernelAllGatherPPatCopyPipeEnd Modified files: - Types.h: StagingBufId, StagingInfo, StepDoneKernArgs, PatCopyPipeEndKernArgs, Resource extension - AlgoImpl.h: execPatCopy() declaration, ctpatcopy in algoName() - AllGatherP.cc: ctpatcopy dispatch, staging init/exchange/destroy lifecycle - CtranGpe.h: algoResource field in allgatherp_init OpElem - BUCK: PatCopyImpl.cu in hetero_ctran_device_lib - nccl_cvars.yaml: ctpatcopy enum + staging buf size CVAR Design doc: D102933592 Differential Revision: D102936132

…nchmark for MAST Summary: Under CUDA_VISIBLE_DEVICES isolation (e.g. torchrun on MAST), each process sees only 1 GPU and all report the same cudaDev. `cudaDeviceCanAccessPeer(dev, dev)` returns 0 per CUDA spec, causing the NVL backend to miss all local peers. This adds a check: if a peer reports the same cudaDev, assume NVL is available since topology grouped us as co-local. Also hardens the AllgatherP benchmark for MAST execution: - Add NVL availability check before running algos that require NVL (ctdirect, ctpipeline, ctpat, ctpatcopy) - Add `dup2(STDERR_FILENO, STDOUT_FILENO)` so MAST captures benchmark output (MAST only collects stderr) - Add `--mem_type` flag support for cuMem allocation in the 2x8 binary BUCK target Differential Revision: D103240803

Summary: Replace ctpatcopy's CE-based nvlCeBcast peer broadcast with a Pipes-based NVL dissemination kernel (NvlDissemKernel.cu) following the AllToAllv pattern. All local ranks launch a single cooperative kernel with per-peer matched send/recv subgroups inside the kernel via partition_interleaved. The recv-staged ctpatcopy design remains unstable at 512KB-2MB across both CE and Pipes dissemination variants. The spikes are a property of the recv-staged execution shape (data from staging.recvBuf after PipeSync), not the specific dissemination mechanism. ctpat (direct IB into recvbuff) is stable at the same sizes. Changes: - New NvlDissemKernel.cu: single cooperative kernel with send_group/recv_group per peer, self-copy via put_group - Types.h: NvlDissemKernArgs struct with forward-declared P2pNvlTransportDevice - PatCopyImpl.cc: ENABLE_PIPES guard, CE fallback in #else - BUCK: add NvlDissemKernel.cu to hetero_ctran_device_lib Differential Revision: D103435262

Summary: Pipes Guardian automated code quality fixes (3/3 LLM agreement, 0.95 confidence): 1. Rename `getBarrierBufferSize` to `get_barrier_buffer_size` to match the library's snake_case convention for all functions/methods. Updated all 3 call sites (definition, BarrierBench.cc, MultiPeerNvlTransport.cc). 2. Fix CUDA printf format mismatch in ChunkState::wait_ready_to_recv: replace `%zu` with `%llu` and add `static_cast<unsigned long long>()` for the `size_t` argument. CUDA device-side printf does not support the `z` length modifier, which can produce garbage output. This matches the correct pattern already used in BarrierState::wait. Differential Revision: D103738398

meta-codesync · 2026-05-04T22:51:11Z

@snarayankh has exported this pull request. If you are a Meta employee, you can view the originating Diff in D103738398.

snarayankh and others added 7 commits April 30, 2026 10:56

meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 4, 2026

meta-codesync Bot added fb-exported meta-exported labels May 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix naming convention and CUDA printf format in BarrierState/ChunkState#2367

Fix naming convention and CUDA printf format in BarrierState/ChunkState#2367
snarayankh wants to merge 7 commits intometa-pytorch:mainfrom
snarayankh:export-D103738398

snarayankh commented May 4, 2026

Uh oh!

meta-codesync Bot commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

snarayankh commented May 4, 2026

Uh oh!

meta-codesync Bot commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant