TL/CUDA: NVLS allgatherv by ikryukov · Pull Request #1240 · openucx/ucc

ikryukov · 2025-12-15T16:44:45Z

What

NVLS allgatherv algorithm implementation.

Why ?

Adding NVLS support for allgatherv enables optimized data gathering across GPUs by utilizing multimem.st instructions to efficiently broadcast each rank's data to all participants through the NVLink SHARP.

How ?

Each rank writes its local data to a shared NVLS multicast buffer at its designated offset using vectorized multimem.st store operations (16 bytes at a time)
Pre-barrier ensures all ranks are ready before writing; post-barrier ensures all writes complete before reading
Final gathered data is copied from the unicast (UC) buffer to the destination
Requires 16-byte alignment for offsets and counts (enforced with validation)
Reuses existing NVLS infrastructure (symmetric MC/UC buffers, control structures, barrier primitives)

greptile-apps · 2025-12-15T16:48:12Z

Greptile Summary

Implements NVLS (NVLink SHARP) support for allgatherv operations, enabling optimized data gathering across GPUs using multimem.st instructions to broadcast each rank's data through NVLink multicast buffers.

Key changes:

Each rank writes its data to shared NVLS multicast buffer at designated offset using vectorized 16-byte stores
Pre/post barriers synchronize ranks before writing and after completion
Validates 16-byte alignment requirements for offsets and counts
Checks that total data size fits within NVLS symmetric buffer limits
Final gathered data copied from unicast buffer to destination
Reuses existing NVLS infrastructure (symmetric MC/UC buffers, control structures, barrier primitives)

Additional improvements:

Fixed missing ucc_tl_cuda_task_put cleanup on error in allreduce_nvls.c:219
Added buffer size validation check for allreduce NVLS (previously missing)

Confidence Score: 5/5

Safe to merge - implementation follows established NVLS patterns with proper validation and error handling
Well-structured implementation that mirrors existing NVLS collectives (allreduce, reduce_scatter). Includes comprehensive alignment validation, buffer size checks, proper resource cleanup, and fixes existing bugs in allreduce. The kernel logic is sound with correct barrier synchronization and vectorized memory operations.
No files require special attention

Important Files Changed

Filename	Overview
src/components/tl/cuda/allgatherv/allgatherv_nvls.c	New NVLS allgatherv implementation with proper alignment validation, buffer size checks, and barrier synchronization
src/components/tl/cuda/kernels/allgatherv_kernel.cu	CUDA kernel using vectorized multimem.st instructions for 16-byte writes with proper barriers and alignment assertions
src/components/tl/cuda/allreduce/allreduce_nvls.c	Added buffer size validation and fixed missing task cleanup on error in allreduce NVLS

greptile-apps

Additional Comments (2)

src/components/tl/cuda/allgatherv/allgatherv_nvls.c, line 228-248 (link)

logic: alignment validation only checks current rank's offset/count, not all ranks - if any rank has misaligned data, the kernel will fail at runtime
src/components/tl/cuda/allgatherv/allgatherv_nvls.c, line 216-226 (link)

logic: missing validation that total_count_bytes fits within nvls_symmetric_size buffer - could overflow symmetric buffer

_{10 files reviewed, 2 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps

_{11 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

ikryukov · 2025-12-18T14:38:56Z

/build

greptile-apps

Greptile Overview

Greptile Summary

This PR implements NVLS allgatherv algorithm for CUDA, enabling optimized data gathering across GPUs using NVLink SHARP with multimem.st instructions.

Key Changes:

Added new NVLS allgatherv algorithm implementation with proper 16-byte alignment validation
Each rank writes its data to shared NVLS multicast buffer using vectorized stores (16 bytes at a time)
Implemented pre/post barrier synchronization to ensure data consistency
Added proper buffer size validation and error handling
Includes bug fix in allreduce_nvls.c for missing task cleanup on event creation failure

Architecture:

Reuses existing NVLS infrastructure (symmetric MC/UC buffers, control structures, barrier primitives)
Datatype-agnostic implementation treating all data as raw bytes in uint32_t units
Final data copied from unicast buffer to destination after all ranks complete writing

Confidence Score: 4/5

safe to merge with one alignment issue that should be verified
implementation follows established NVLS patterns and includes proper validation; one potential alignment issue in kernel needs verification
pay close attention to src/components/tl/cuda/kernels/allgatherv_kernel.cu line 39 for potential alignment issues

Important Files Changed

File Analysis

Filename	Score	Overview
src/components/tl/cuda/kernels/allgatherv_kernel.cu	4/5	new CUDA kernel for NVLS allgatherv; potential alignment issue with uint4 cast on line 39
src/components/tl/cuda/allgatherv/allgatherv_nvls.c	5/5	implements NVLS allgatherv algorithm with proper alignment validation, buffer management, and CUDA event handling
src/components/tl/cuda/tl_cuda.h	5/5	adds allgatherv_nvls task state structure with offset, count, and NVLS buffer pointers
src/components/tl/cuda/allreduce/allreduce_nvls.c	5/5	adds buffer size validation and fixes missing task cleanup on event creation failure

src/components/tl/cuda/kernels/allgatherv_kernel.cu

greptile-apps

Greptile Overview

Greptile Summary

This PR adds NVLS (NVLink SHARP) support for the allgatherv collective operation, enabling efficient GPU-to-GPU data gathering through hardware multicast capabilities. The implementation follows the established pattern from reduce_scatter and allreduce NVLS algorithms.

Key changes:

New NVLS allgatherv algorithm with vectorized CUDA kernel using multimem.st instructions for 16-byte aligned writes
Proper validation for 16-byte alignment requirements (offset and count must be divisible by 4 uint32_t elements)
Datatype-agnostic implementation converting all operations to uint32_t units
Pre/post barriers ensure proper synchronization across all ranks before writing and reading
Reuses existing NVLS infrastructure (symmetric MC/UC buffers, control structures, barrier primitives)
Bug fix in allreduce_nvls.c: adds missing symmetric buffer size validation and task cleanup on error

Integration:

Properly integrated with existing TL/CUDA collective dispatcher
Build system updated for both source and kernel compilation
Follows existing NVLS algorithm patterns consistently

Confidence Score: 4/5

This PR is safe to merge with minimal risk after review of the minor style suggestion
The implementation is solid and follows established patterns from existing NVLS algorithms. Proper validation, error handling, and alignment checks are in place. The code includes defensive assertions and clear documentation. One minor style improvement was suggested for kernel loop clarity. The bug fix in allreduce_nvls.c is a bonus improvement.
No files require special attention - all changes follow established patterns and include proper validation

Important Files Changed

File Analysis

Filename	Score	Overview
src/components/tl/cuda/allgatherv/allgatherv_nvls.c	4/5	implements NVLS allgatherv with proper validation, alignment checks, and error handling. includes datatype-agnostic approach converting all data to uint32_t units
src/components/tl/cuda/kernels/allgatherv_kernel.cu	4/5	adds vectorized CUDA kernel using multimem.st for NVLS allgatherv with barriers. minor style improvement suggested for loop clarity
src/components/tl/cuda/tl_cuda_coll.c	5/5	registers NVLS init function in algorithm dispatcher, fixes incorrect ALLGATHER enum references to ALLGATHERV
src/components/tl/cuda/allreduce/allreduce_nvls.c	5/5	adds symmetric buffer size validation and fixes missing task cleanup on error path

src/components/tl/cuda/kernels/allgatherv_kernel.cu

ikryukov · 2026-01-08T11:35:13Z

/build

greptile-apps

Greptile Overview

Greptile Summary

Implements NVLS allgatherv algorithm that uses NVIDIA's NVLink SHARP multicast capabilities to efficiently gather variable-sized data from all ranks. The implementation reuses the existing NVLS infrastructure and follows established patterns from allreduce and reduce_scatter NVLS implementations.

Key changes:

Core implementation in allgatherv_nvls.c validates 16-byte alignment requirements and buffer size limits before launching vectorized CUDA kernel
CUDA kernel uses multimem.st instructions to write each rank's data to shared multicast buffer at designated offsets, with pre/post barriers for synchronization
Gathered data is copied from unicast buffer to final destination
Added buffer size validation to allreduce_nvls.c and fixed memory leak on event creation failure
Properly integrated with build system and algorithm selection logic

Confidence Score: 5/5

This PR is safe to merge with high confidence
The implementation follows established NVLS patterns from existing collectives, includes proper validation for alignment and buffer sizes, handles error cases appropriately, and includes a beneficial fix to allreduce_nvls.c. The code is well-structured and consistent with the codebase conventions.
No files require special attention

Important Files Changed

File Analysis

Filename	Score	Overview
src/components/tl/cuda/allgatherv/allgatherv_nvls.c	5/5	Implemented NVLS allgatherv with proper alignment checks and buffer validation
src/components/tl/cuda/allreduce/allreduce_nvls.c	5/5	Added symmetric buffer size validation and fixed memory leak on event creation failure
src/components/tl/cuda/kernels/allgatherv_kernel.cu	5/5	Implemented vectorized NVLS kernel using multimem.st instructions with proper barriers
src/components/tl/cuda/tl_cuda.h	5/5	Added `allgatherv_nvls` struct to task union with offset, count, and buffer handles
src/components/tl/cuda/tl_cuda_coll.c	5/5	Wired NVLS algorithm to init function in algorithm dispatcher

Signed-off-by: Ilya Kryukov <ikryukov@nvidia.com>

greptile-apps

Greptile Overview

Greptile Summary

Implements NVLS allgatherv algorithm enabling optimized GPU data gathering through NVLink SHARP multicast operations. Each rank writes its data to a shared NVLS buffer using vectorized 16-byte multimem.st instructions with pre- and post-barriers for synchronization. The implementation reuses existing NVLS infrastructure including symmetric MC/UC buffers and barrier primitives, with strict 16-byte alignment validation enforced throughout.

Confidence Score: 4/5

Safe to merge with one minor style improvement recommended
The implementation correctly follows established NVLS patterns from allreduce and reduce_scatter. Kernel logic properly handles vectorized memory operations with correct alignment assertions. Barrier synchronization uses proven primitives. One minor style issue with dead code in else branch (lines 49-57) that duplicates the if branch unnecessarily. All critical paths handle errors appropriately and alignment requirements are thoroughly validated.
src/components/tl/cuda/allgatherv/allgatherv_nvls.c has minor dead code that should be cleaned up

Important Files Changed

File Analysis

Filename	Score	Overview
src/components/tl/cuda/allgatherv/allgatherv_nvls.c	4/5	Implements NVLS allgatherv algorithm with proper alignment validation and barrier synchronization; contains minor dead code in else branch
src/components/tl/cuda/kernels/allgatherv_kernel.cu	5/5	CUDA kernel correctly implements vectorized multimem.st operations with proper alignment assertions and barrier synchronization
src/components/tl/cuda/tl_cuda.h	5/5	Adds allgatherv_nvls task structure fields for MC/UC addresses, offsets, counts, and control

src/components/tl/cuda/allgatherv/allgatherv_nvls.c

ikryukov · 2026-01-08T19:17:16Z

/build

ikryukov self-assigned this Dec 15, 2025

ikryukov requested review from Sergei-Lebedev and janjust and removed request for Sergei-Lebedev December 15, 2025 16:44

ikryukov added the Ready-for-Review label Dec 15, 2025

greptile-apps bot reviewed Dec 15, 2025

View reviewed changes

ikryukov requested review from Sergei-Lebedev and nsarka December 15, 2025 16:58

greptile-apps bot reviewed Dec 15, 2025

View reviewed changes

nsarka approved these changes Dec 17, 2025

View reviewed changes

Sergei-Lebedev approved these changes Dec 18, 2025

View reviewed changes

janjust approved these changes Dec 18, 2025

View reviewed changes

ikryukov force-pushed the nvls_allgatherv branch from 144c7ca to e96ce94 Compare January 8, 2026 11:02

greptile-apps bot reviewed Jan 8, 2026

View reviewed changes

src/components/tl/cuda/kernels/allgatherv_kernel.cu Show resolved Hide resolved

greptile-apps bot reviewed Jan 8, 2026

View reviewed changes

src/components/tl/cuda/kernels/allgatherv_kernel.cu Show resolved Hide resolved

ikryukov force-pushed the nvls_allgatherv branch from bca6a68 to f75c492 Compare January 8, 2026 11:36

greptile-apps bot reviewed Jan 8, 2026

View reviewed changes

janjust enabled auto-merge (squash) January 8, 2026 16:06

ikryukov added 4 commits January 8, 2026 10:06

TL/CUDA: nvls allgatherv

46cc523

Signed-off-by: Ilya Kryukov <ikryukov@nvidia.com>

TL/CUDA: fix buf size validation

bd143d9

Signed-off-by: Ilya Kryukov <ikryukov@nvidia.com>

TL/CUDA: simplify and fix

36ddfca

Signed-off-by: Ilya Kryukov <ikryukov@nvidia.com>

TL/CUDA: add alignment assertion

2d44654

Signed-off-by: Ilya Kryukov <ikryukov@nvidia.com>

janjust force-pushed the nvls_allgatherv branch from f75c492 to 2d44654 Compare January 8, 2026 16:06

greptile-apps bot reviewed Jan 8, 2026

View reviewed changes

src/components/tl/cuda/allgatherv/allgatherv_nvls.c Show resolved Hide resolved

janjust merged commit 09f3e59 into openucx:master Jan 8, 2026
10 of 11 checks passed

Conversation

ikryukov commented Dec 15, 2025

What

Why ?

How ?

Uh oh!

greptile-apps bot commented Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Uh oh!

greptile-apps bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Additional Comments (2)

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

ikryukov commented Dec 18, 2025

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Greptile Overview

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Greptile Overview

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Uh oh!

Uh oh!

ikryukov commented Jan 8, 2026

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Greptile Overview

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Greptile Overview

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Uh oh!

Uh oh!

ikryukov commented Jan 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

greptile-apps bot commented Dec 15, 2025 •

edited

Loading

greptile-apps bot left a comment •

edited

Loading