Fix naming convention and CUDA printf format in BarrierState/ChunkState#2367
Open
snarayankh wants to merge 7 commits intometa-pytorch:mainfrom
Open
Fix naming convention and CUDA printf format in BarrierState/ChunkState#2367snarayankh wants to merge 7 commits intometa-pytorch:mainfrom
snarayankh wants to merge 7 commits intometa-pytorch:mainfrom
Conversation
Summary: Design doc for the CTPAT AllGather algorithm — a rail-parallel butterfly (recursive-doubling) variant for multi-GPU-per-node H100 topologies. CTPAT is implemented as a new `NCCL_ALLGATHER_P_ALGO::ctpat` variant inside `AllGatherP/`, reusing the existing persistent infrastructure (PersistArgs, allGatherCtrl, PipeSync, nvlCeBcast). Key design decisions: - Lives in AllGatherP/ (persistent path), not AllGather/ (eager path), to reuse handle exchange, PipeSync coordination, and CE broadcast orchestration - v1 is structurally identical to ctrdpipeline — same butterfly pattern, same execution model - Power-of-two nNodes only for v1; non-power-of-two deferred to Phase 3 - CUDA graph support deferred from v1 (ctgraph_pipeline hardcodes ctpipeline) - No new CVARs beyond adding ctpat to NCCL_ALLGATHER_P_ALGO choices - Eligibility validated in allGatherPInit(), not at exec time Differential Revision: D102874738
Summary: Implements Phase 1 of the CTPAT AllGather algorithm — a rail-parallel butterfly (recursive-doubling) variant for multi-GPU-per-node H100 topologies. CTPAT is added as a new `NCCL_ALLGATHER_P_ALGO::ctpat` variant inside AllGatherP/, reusing the existing persistent infrastructure (PersistArgs, allGatherCtrl, PipeSync, nvlCeBcast). For v1, `execPat()` is structurally identical to `execRecursiveDoubling()` — same butterfly steps, same GPE callback, same PipeSync overlap, same NVL CE broadcast. Changes: - Add `ctpat` to `NCCL_ALLGATHER_P_ALGO` choices in `nccl_cvars.yaml` - Create `AllGatherP/PatImpl.cc` with `gpeFn` callback and `execPat()` method - Add `execPat()` declaration and `algoName(ctpat)` to `AlgoImpl.h` - Add `ctpat` dispatch case to `allGatherPExec()` in `AllGatherP.cc` - Add PAT eligibility validation in `allGatherPInit()`: power-of-2 nNodes, nRanks%nLocalRanks==0, NVL connectivity for local peers Usage: `NCCL_ALLGATHER_P_ALGO=ctpat` (explicit opt-in, no auto-selection for v1) Design doc: D102874738 Differential Revision: D102877289
Summary: Design doc for ctpatcopy — a staged-buffer variant of the AllGather PAT algorithm for NVL x IB on H100. Routes IB transfers through pre-registered staging buffers instead of putting directly to recvbuff, providing pre-registered buffer benefits and laying groundwork for future chunk-level pipelining. Key design decisions: - Separate NCCL_ALLGATHER_P_ALGO::ctpatcopy enum (not dynamic branch under ctpat) - Step-granular staging with contiguous per-step layout (no slot reuse within step) - Fully blocking per step via bidirectional sync: pipeSync (GPE→stream) + stepDoneSync (stream→GPE) - Per-step cudaStreamSynchronize(copyStream) for icopy→iput ordering - Extended PipeEnd kernel resets both pipeSync and stepDoneSync for persistent replay safety - Requires nLocalRanks > 1 (stream-side copy-back only exists in multi-GPU path) - Per-persistent-request staging via BufManager (32MB send + 32MB recv) - Falls back to execPat() (zero-copy) when sendSize exceeds staging capacity Design doc only — no code changes. Differential Revision: D102933592
Summary: Implements ctpatcopy — a staged-buffer variant of the AllGather PAT algorithm for NVL x IB. Routes IB transfers through pre-registered staging buffers instead of putting directly to recvbuff. Key design: - Separate NCCL_ALLGATHER_P_ALGO::ctpatcopy enum (not dynamic branch under ctpat) - Step-granular staging with contiguous per-step layout (no slot reuse within step) - Fully blocking per step via bidirectional sync: pipeSync (GPE→stream) + stepDoneSync (stream→GPE) - Per-step cudaStreamSynchronize(copyStream) for icopy→iput ordering - Extended PipeEnd kernel resets both pipeSync and stepDoneSync for persistent replay safety - Requires nLocalRanks > 1 (stream-side copy-back path) - Per-persistent-request staging via BufManager (32MB send + 32MB recv) - Falls back to execPat() (zero-copy) when sendSize exceeds staging capacity New files: - AllGatherP/PatCopyImpl.cc: staged gpeFn + execPatCopy() - AllGatherP/PatCopyImpl.cu: ncclKernelStepDone + ncclKernelAllGatherPPatCopyPipeEnd Modified files: - Types.h: StagingBufId, StagingInfo, StepDoneKernArgs, PatCopyPipeEndKernArgs, Resource extension - AlgoImpl.h: execPatCopy() declaration, ctpatcopy in algoName() - AllGatherP.cc: ctpatcopy dispatch, staging init/exchange/destroy lifecycle - CtranGpe.h: algoResource field in allgatherp_init OpElem - BUCK: PatCopyImpl.cu in hetero_ctran_device_lib - nccl_cvars.yaml: ctpatcopy enum + staging buf size CVAR Design doc: D102933592 Differential Revision: D102936132
…nchmark for MAST Summary: Under CUDA_VISIBLE_DEVICES isolation (e.g. torchrun on MAST), each process sees only 1 GPU and all report the same cudaDev. `cudaDeviceCanAccessPeer(dev, dev)` returns 0 per CUDA spec, causing the NVL backend to miss all local peers. This adds a check: if a peer reports the same cudaDev, assume NVL is available since topology grouped us as co-local. Also hardens the AllgatherP benchmark for MAST execution: - Add NVL availability check before running algos that require NVL (ctdirect, ctpipeline, ctpat, ctpatcopy) - Add `dup2(STDERR_FILENO, STDOUT_FILENO)` so MAST captures benchmark output (MAST only collects stderr) - Add `--mem_type` flag support for cuMem allocation in the 2x8 binary BUCK target Differential Revision: D103240803
Summary: Replace ctpatcopy's CE-based nvlCeBcast peer broadcast with a Pipes-based NVL dissemination kernel (NvlDissemKernel.cu) following the AllToAllv pattern. All local ranks launch a single cooperative kernel with per-peer matched send/recv subgroups inside the kernel via partition_interleaved. The recv-staged ctpatcopy design remains unstable at 512KB-2MB across both CE and Pipes dissemination variants. The spikes are a property of the recv-staged execution shape (data from staging.recvBuf after PipeSync), not the specific dissemination mechanism. ctpat (direct IB into recvbuff) is stable at the same sizes. Changes: - New NvlDissemKernel.cu: single cooperative kernel with send_group/recv_group per peer, self-copy via put_group - Types.h: NvlDissemKernArgs struct with forward-declared P2pNvlTransportDevice - PatCopyImpl.cc: ENABLE_PIPES guard, CE fallback in #else - BUCK: add NvlDissemKernel.cu to hetero_ctran_device_lib Differential Revision: D103435262
Summary: Pipes Guardian automated code quality fixes (3/3 LLM agreement, 0.95 confidence): 1. Rename `getBarrierBufferSize` to `get_barrier_buffer_size` to match the library's snake_case convention for all functions/methods. Updated all 3 call sites (definition, BarrierBench.cc, MultiPeerNvlTransport.cc). 2. Fix CUDA printf format mismatch in ChunkState::wait_ready_to_recv: replace `%zu` with `%llu` and add `static_cast<unsigned long long>()` for the `size_t` argument. CUDA device-side printf does not support the `z` length modifier, which can produce garbage output. This matches the correct pattern already used in BarrierState::wait. Differential Revision: D103738398
Contributor
|
@snarayankh has exported this pull request. If you are a Meta employee, you can view the originating Diff in D103738398. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary:
Pipes Guardian automated code quality fixes (3/3 LLM agreement, 0.95 confidence):
Rename
getBarrierBufferSizetoget_barrier_buffer_sizeto match the library's snake_case convention for all functions/methods. Updated all 3 call sites (definition, BarrierBench.cc, MultiPeerNvlTransport.cc).Fix CUDA printf format mismatch in ChunkState::wait_ready_to_recv: replace
%zuwith%lluand addstatic_cast<unsigned long long>()for thesize_targument. CUDA device-side printf does not support thezlength modifier, which can produce garbage output. This matches the correct pattern already used in BarrierState::wait.Differential Revision: D103738398