Skip to content

TL/CUDA: NVLS allgatherv#1240

Merged
janjust merged 4 commits intoopenucx:masterfrom
ikryukov:nvls_allgatherv
Jan 8, 2026
Merged

TL/CUDA: NVLS allgatherv#1240
janjust merged 4 commits intoopenucx:masterfrom
ikryukov:nvls_allgatherv

Conversation

@ikryukov
Copy link
Collaborator

What

NVLS allgatherv algorithm implementation.

Why ?

Adding NVLS support for allgatherv enables optimized data gathering across GPUs by utilizing multimem.st instructions to efficiently broadcast each rank's data to all participants through the NVLink SHARP.

How ?

  • Each rank writes its local data to a shared NVLS multicast buffer at its designated offset using vectorized multimem.st store operations (16 bytes at a time)
  • Pre-barrier ensures all ranks are ready before writing; post-barrier ensures all writes complete before reading
  • Final gathered data is copied from the unicast (UC) buffer to the destination
  • Requires 16-byte alignment for offsets and counts (enforced with validation)
  • Reuses existing NVLS infrastructure (symmetric MC/UC buffers, control structures, barrier primitives)

@ikryukov ikryukov self-assigned this Dec 15, 2025
@ikryukov ikryukov requested review from Sergei-Lebedev and janjust and removed request for Sergei-Lebedev December 15, 2025 16:44
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Dec 15, 2025

Greptile Summary

Implements NVLS (NVLink SHARP) support for allgatherv operations, enabling optimized data gathering across GPUs using multimem.st instructions to broadcast each rank's data through NVLink multicast buffers.

Key changes:

  • Each rank writes its data to shared NVLS multicast buffer at designated offset using vectorized 16-byte stores
  • Pre/post barriers synchronize ranks before writing and after completion
  • Validates 16-byte alignment requirements for offsets and counts
  • Checks that total data size fits within NVLS symmetric buffer limits
  • Final gathered data copied from unicast buffer to destination
  • Reuses existing NVLS infrastructure (symmetric MC/UC buffers, control structures, barrier primitives)

Additional improvements:

  • Fixed missing ucc_tl_cuda_task_put cleanup on error in allreduce_nvls.c:219
  • Added buffer size validation check for allreduce NVLS (previously missing)

Confidence Score: 5/5

  • Safe to merge - implementation follows established NVLS patterns with proper validation and error handling
  • Well-structured implementation that mirrors existing NVLS collectives (allreduce, reduce_scatter). Includes comprehensive alignment validation, buffer size checks, proper resource cleanup, and fixes existing bugs in allreduce. The kernel logic is sound with correct barrier synchronization and vectorized memory operations.
  • No files require special attention

Important Files Changed

Filename Overview
src/components/tl/cuda/allgatherv/allgatherv_nvls.c New NVLS allgatherv implementation with proper alignment validation, buffer size checks, and barrier synchronization
src/components/tl/cuda/kernels/allgatherv_kernel.cu CUDA kernel using vectorized multimem.st instructions for 16-byte writes with proper barriers and alignment assertions
src/components/tl/cuda/allreduce/allreduce_nvls.c Added buffer size validation and fixed missing task cleanup on error in allreduce NVLS

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional Comments (2)

  1. src/components/tl/cuda/allgatherv/allgatherv_nvls.c, line 228-248 (link)

    logic: alignment validation only checks current rank's offset/count, not all ranks - if any rank has misaligned data, the kernel will fail at runtime

  2. src/components/tl/cuda/allgatherv/allgatherv_nvls.c, line 216-226 (link)

    logic: missing validation that total_count_bytes fits within nvls_symmetric_size buffer - could overflow symmetric buffer

10 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

11 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@ikryukov
Copy link
Collaborator Author

/build

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This PR implements NVLS allgatherv algorithm for CUDA, enabling optimized data gathering across GPUs using NVLink SHARP with multimem.st instructions.

Key Changes:

  • Added new NVLS allgatherv algorithm implementation with proper 16-byte alignment validation
  • Each rank writes its data to shared NVLS multicast buffer using vectorized stores (16 bytes at a time)
  • Implemented pre/post barrier synchronization to ensure data consistency
  • Added proper buffer size validation and error handling
  • Includes bug fix in allreduce_nvls.c for missing task cleanup on event creation failure

Architecture:

  • Reuses existing NVLS infrastructure (symmetric MC/UC buffers, control structures, barrier primitives)
  • Datatype-agnostic implementation treating all data as raw bytes in uint32_t units
  • Final data copied from unicast buffer to destination after all ranks complete writing

Confidence Score: 4/5

  • safe to merge with one alignment issue that should be verified
  • implementation follows established NVLS patterns and includes proper validation; one potential alignment issue in kernel needs verification
  • pay close attention to src/components/tl/cuda/kernels/allgatherv_kernel.cu line 39 for potential alignment issues

Important Files Changed

File Analysis

Filename Score Overview
src/components/tl/cuda/kernels/allgatherv_kernel.cu 4/5 new CUDA kernel for NVLS allgatherv; potential alignment issue with uint4 cast on line 39
src/components/tl/cuda/allgatherv/allgatherv_nvls.c 5/5 implements NVLS allgatherv algorithm with proper alignment validation, buffer management, and CUDA event handling
src/components/tl/cuda/tl_cuda.h 5/5 adds allgatherv_nvls task state structure with offset, count, and NVLS buffer pointers
src/components/tl/cuda/allreduce/allreduce_nvls.c 5/5 adds buffer size validation and fixes missing task cleanup on event creation failure

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This PR adds NVLS (NVLink SHARP) support for the allgatherv collective operation, enabling efficient GPU-to-GPU data gathering through hardware multicast capabilities. The implementation follows the established pattern from reduce_scatter and allreduce NVLS algorithms.

Key changes:

  • New NVLS allgatherv algorithm with vectorized CUDA kernel using multimem.st instructions for 16-byte aligned writes
  • Proper validation for 16-byte alignment requirements (offset and count must be divisible by 4 uint32_t elements)
  • Datatype-agnostic implementation converting all operations to uint32_t units
  • Pre/post barriers ensure proper synchronization across all ranks before writing and reading
  • Reuses existing NVLS infrastructure (symmetric MC/UC buffers, control structures, barrier primitives)
  • Bug fix in allreduce_nvls.c: adds missing symmetric buffer size validation and task cleanup on error

Integration:

  • Properly integrated with existing TL/CUDA collective dispatcher
  • Build system updated for both source and kernel compilation
  • Follows existing NVLS algorithm patterns consistently

Confidence Score: 4/5

  • This PR is safe to merge with minimal risk after review of the minor style suggestion
  • The implementation is solid and follows established patterns from existing NVLS algorithms. Proper validation, error handling, and alignment checks are in place. The code includes defensive assertions and clear documentation. One minor style improvement was suggested for kernel loop clarity. The bug fix in allreduce_nvls.c is a bonus improvement.
  • No files require special attention - all changes follow established patterns and include proper validation

Important Files Changed

File Analysis

Filename Score Overview
src/components/tl/cuda/allgatherv/allgatherv_nvls.c 4/5 implements NVLS allgatherv with proper validation, alignment checks, and error handling. includes datatype-agnostic approach converting all data to uint32_t units
src/components/tl/cuda/kernels/allgatherv_kernel.cu 4/5 adds vectorized CUDA kernel using multimem.st for NVLS allgatherv with barriers. minor style improvement suggested for loop clarity
src/components/tl/cuda/tl_cuda_coll.c 5/5 registers NVLS init function in algorithm dispatcher, fixes incorrect ALLGATHER enum references to ALLGATHERV
src/components/tl/cuda/allreduce/allreduce_nvls.c 5/5 adds symmetric buffer size validation and fixes missing task cleanup on error path

@ikryukov
Copy link
Collaborator Author

ikryukov commented Jan 8, 2026

/build

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

Implements NVLS allgatherv algorithm that uses NVIDIA's NVLink SHARP multicast capabilities to efficiently gather variable-sized data from all ranks. The implementation reuses the existing NVLS infrastructure and follows established patterns from allreduce and reduce_scatter NVLS implementations.

Key changes:

  • Core implementation in allgatherv_nvls.c validates 16-byte alignment requirements and buffer size limits before launching vectorized CUDA kernel
  • CUDA kernel uses multimem.st instructions to write each rank's data to shared multicast buffer at designated offsets, with pre/post barriers for synchronization
  • Gathered data is copied from unicast buffer to final destination
  • Added buffer size validation to allreduce_nvls.c and fixed memory leak on event creation failure
  • Properly integrated with build system and algorithm selection logic

Confidence Score: 5/5

  • This PR is safe to merge with high confidence
  • The implementation follows established NVLS patterns from existing collectives, includes proper validation for alignment and buffer sizes, handles error cases appropriately, and includes a beneficial fix to allreduce_nvls.c. The code is well-structured and consistent with the codebase conventions.
  • No files require special attention

Important Files Changed

File Analysis

Filename Score Overview
src/components/tl/cuda/allgatherv/allgatherv_nvls.c 5/5 Implemented NVLS allgatherv with proper alignment checks and buffer validation
src/components/tl/cuda/allreduce/allreduce_nvls.c 5/5 Added symmetric buffer size validation and fixed memory leak on event creation failure
src/components/tl/cuda/kernels/allgatherv_kernel.cu 5/5 Implemented vectorized NVLS kernel using multimem.st instructions with proper barriers
src/components/tl/cuda/tl_cuda.h 5/5 Added allgatherv_nvls struct to task union with offset, count, and buffer handles
src/components/tl/cuda/tl_cuda_coll.c 5/5 Wired NVLS algorithm to init function in algorithm dispatcher

@janjust janjust enabled auto-merge (squash) January 8, 2026 16:06
Signed-off-by: Ilya Kryukov <ikryukov@nvidia.com>
Signed-off-by: Ilya Kryukov <ikryukov@nvidia.com>
Signed-off-by: Ilya Kryukov <ikryukov@nvidia.com>
Signed-off-by: Ilya Kryukov <ikryukov@nvidia.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

Implements NVLS allgatherv algorithm enabling optimized GPU data gathering through NVLink SHARP multicast operations. Each rank writes its data to a shared NVLS buffer using vectorized 16-byte multimem.st instructions with pre- and post-barriers for synchronization. The implementation reuses existing NVLS infrastructure including symmetric MC/UC buffers and barrier primitives, with strict 16-byte alignment validation enforced throughout.

Confidence Score: 4/5

  • Safe to merge with one minor style improvement recommended
  • The implementation correctly follows established NVLS patterns from allreduce and reduce_scatter. Kernel logic properly handles vectorized memory operations with correct alignment assertions. Barrier synchronization uses proven primitives. One minor style issue with dead code in else branch (lines 49-57) that duplicates the if branch unnecessarily. All critical paths handle errors appropriately and alignment requirements are thoroughly validated.
  • src/components/tl/cuda/allgatherv/allgatherv_nvls.c has minor dead code that should be cleaned up

Important Files Changed

File Analysis

Filename Score Overview
src/components/tl/cuda/allgatherv/allgatherv_nvls.c 4/5 Implements NVLS allgatherv algorithm with proper alignment validation and barrier synchronization; contains minor dead code in else branch
src/components/tl/cuda/kernels/allgatherv_kernel.cu 5/5 CUDA kernel correctly implements vectorized multimem.st operations with proper alignment assertions and barrier synchronization
src/components/tl/cuda/tl_cuda.h 5/5 Adds allgatherv_nvls task structure fields for MC/UC addresses, offsets, counts, and control

@ikryukov
Copy link
Collaborator Author

ikryukov commented Jan 8, 2026

/build

@janjust janjust merged commit 09f3e59 into openucx:master Jan 8, 2026
10 of 11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants