Conversation
|
| Filename | Overview |
|---|---|
| src/components/tl/cuda/allgatherv/allgatherv_nvls.c | New NVLS allgatherv implementation with proper alignment validation, buffer size checks, and barrier synchronization |
| src/components/tl/cuda/kernels/allgatherv_kernel.cu | CUDA kernel using vectorized multimem.st instructions for 16-byte writes with proper barriers and alignment assertions |
| src/components/tl/cuda/allreduce/allreduce_nvls.c | Added buffer size validation and fixed missing task cleanup on error in allreduce NVLS |
There was a problem hiding this comment.
Additional Comments (2)
-
src/components/tl/cuda/allgatherv/allgatherv_nvls.c, line 228-248 (link)logic: alignment validation only checks current rank's offset/count, not all ranks - if any rank has misaligned data, the kernel will fail at runtime
-
src/components/tl/cuda/allgatherv/allgatherv_nvls.c, line 216-226 (link)logic: missing validation that
total_count_bytesfits withinnvls_symmetric_sizebuffer - could overflow symmetric buffer
10 files reviewed, 2 comments
|
/build |
144c7ca to
e96ce94
Compare
There was a problem hiding this comment.
Greptile Overview
Greptile Summary
This PR implements NVLS allgatherv algorithm for CUDA, enabling optimized data gathering across GPUs using NVLink SHARP with multimem.st instructions.
Key Changes:
- Added new NVLS allgatherv algorithm implementation with proper 16-byte alignment validation
- Each rank writes its data to shared NVLS multicast buffer using vectorized stores (16 bytes at a time)
- Implemented pre/post barrier synchronization to ensure data consistency
- Added proper buffer size validation and error handling
- Includes bug fix in
allreduce_nvls.cfor missing task cleanup on event creation failure
Architecture:
- Reuses existing NVLS infrastructure (symmetric MC/UC buffers, control structures, barrier primitives)
- Datatype-agnostic implementation treating all data as raw bytes in uint32_t units
- Final data copied from unicast buffer to destination after all ranks complete writing
Confidence Score: 4/5
- safe to merge with one alignment issue that should be verified
- implementation follows established NVLS patterns and includes proper validation; one potential alignment issue in kernel needs verification
- pay close attention to
src/components/tl/cuda/kernels/allgatherv_kernel.culine 39 for potential alignment issues
Important Files Changed
File Analysis
| Filename | Score | Overview |
|---|---|---|
| src/components/tl/cuda/kernels/allgatherv_kernel.cu | 4/5 | new CUDA kernel for NVLS allgatherv; potential alignment issue with uint4 cast on line 39 |
| src/components/tl/cuda/allgatherv/allgatherv_nvls.c | 5/5 | implements NVLS allgatherv algorithm with proper alignment validation, buffer management, and CUDA event handling |
| src/components/tl/cuda/tl_cuda.h | 5/5 | adds allgatherv_nvls task state structure with offset, count, and NVLS buffer pointers |
| src/components/tl/cuda/allreduce/allreduce_nvls.c | 5/5 | adds buffer size validation and fixes missing task cleanup on event creation failure |
There was a problem hiding this comment.
Greptile Overview
Greptile Summary
This PR adds NVLS (NVLink SHARP) support for the allgatherv collective operation, enabling efficient GPU-to-GPU data gathering through hardware multicast capabilities. The implementation follows the established pattern from reduce_scatter and allreduce NVLS algorithms.
Key changes:
- New NVLS allgatherv algorithm with vectorized CUDA kernel using
multimem.stinstructions for 16-byte aligned writes - Proper validation for 16-byte alignment requirements (offset and count must be divisible by 4 uint32_t elements)
- Datatype-agnostic implementation converting all operations to uint32_t units
- Pre/post barriers ensure proper synchronization across all ranks before writing and reading
- Reuses existing NVLS infrastructure (symmetric MC/UC buffers, control structures, barrier primitives)
- Bug fix in allreduce_nvls.c: adds missing symmetric buffer size validation and task cleanup on error
Integration:
- Properly integrated with existing TL/CUDA collective dispatcher
- Build system updated for both source and kernel compilation
- Follows existing NVLS algorithm patterns consistently
Confidence Score: 4/5
- This PR is safe to merge with minimal risk after review of the minor style suggestion
- The implementation is solid and follows established patterns from existing NVLS algorithms. Proper validation, error handling, and alignment checks are in place. The code includes defensive assertions and clear documentation. One minor style improvement was suggested for kernel loop clarity. The bug fix in allreduce_nvls.c is a bonus improvement.
- No files require special attention - all changes follow established patterns and include proper validation
Important Files Changed
File Analysis
| Filename | Score | Overview |
|---|---|---|
| src/components/tl/cuda/allgatherv/allgatherv_nvls.c | 4/5 | implements NVLS allgatherv with proper validation, alignment checks, and error handling. includes datatype-agnostic approach converting all data to uint32_t units |
| src/components/tl/cuda/kernels/allgatherv_kernel.cu | 4/5 | adds vectorized CUDA kernel using multimem.st for NVLS allgatherv with barriers. minor style improvement suggested for loop clarity |
| src/components/tl/cuda/tl_cuda_coll.c | 5/5 | registers NVLS init function in algorithm dispatcher, fixes incorrect ALLGATHER enum references to ALLGATHERV |
| src/components/tl/cuda/allreduce/allreduce_nvls.c | 5/5 | adds symmetric buffer size validation and fixes missing task cleanup on error path |
|
/build |
bca6a68 to
f75c492
Compare
There was a problem hiding this comment.
Greptile Overview
Greptile Summary
Implements NVLS allgatherv algorithm that uses NVIDIA's NVLink SHARP multicast capabilities to efficiently gather variable-sized data from all ranks. The implementation reuses the existing NVLS infrastructure and follows established patterns from allreduce and reduce_scatter NVLS implementations.
Key changes:
- Core implementation in
allgatherv_nvls.cvalidates 16-byte alignment requirements and buffer size limits before launching vectorized CUDA kernel - CUDA kernel uses
multimem.stinstructions to write each rank's data to shared multicast buffer at designated offsets, with pre/post barriers for synchronization - Gathered data is copied from unicast buffer to final destination
- Added buffer size validation to allreduce_nvls.c and fixed memory leak on event creation failure
- Properly integrated with build system and algorithm selection logic
Confidence Score: 5/5
- This PR is safe to merge with high confidence
- The implementation follows established NVLS patterns from existing collectives, includes proper validation for alignment and buffer sizes, handles error cases appropriately, and includes a beneficial fix to allreduce_nvls.c. The code is well-structured and consistent with the codebase conventions.
- No files require special attention
Important Files Changed
File Analysis
| Filename | Score | Overview |
|---|---|---|
| src/components/tl/cuda/allgatherv/allgatherv_nvls.c | 5/5 | Implemented NVLS allgatherv with proper alignment checks and buffer validation |
| src/components/tl/cuda/allreduce/allreduce_nvls.c | 5/5 | Added symmetric buffer size validation and fixed memory leak on event creation failure |
| src/components/tl/cuda/kernels/allgatherv_kernel.cu | 5/5 | Implemented vectorized NVLS kernel using multimem.st instructions with proper barriers |
| src/components/tl/cuda/tl_cuda.h | 5/5 | Added allgatherv_nvls struct to task union with offset, count, and buffer handles |
| src/components/tl/cuda/tl_cuda_coll.c | 5/5 | Wired NVLS algorithm to init function in algorithm dispatcher |
Signed-off-by: Ilya Kryukov <ikryukov@nvidia.com>
Signed-off-by: Ilya Kryukov <ikryukov@nvidia.com>
Signed-off-by: Ilya Kryukov <ikryukov@nvidia.com>
Signed-off-by: Ilya Kryukov <ikryukov@nvidia.com>
f75c492 to
2d44654
Compare
There was a problem hiding this comment.
Greptile Overview
Greptile Summary
Implements NVLS allgatherv algorithm enabling optimized GPU data gathering through NVLink SHARP multicast operations. Each rank writes its data to a shared NVLS buffer using vectorized 16-byte multimem.st instructions with pre- and post-barriers for synchronization. The implementation reuses existing NVLS infrastructure including symmetric MC/UC buffers and barrier primitives, with strict 16-byte alignment validation enforced throughout.
Confidence Score: 4/5
- Safe to merge with one minor style improvement recommended
- The implementation correctly follows established NVLS patterns from allreduce and reduce_scatter. Kernel logic properly handles vectorized memory operations with correct alignment assertions. Barrier synchronization uses proven primitives. One minor style issue with dead code in else branch (lines 49-57) that duplicates the if branch unnecessarily. All critical paths handle errors appropriately and alignment requirements are thoroughly validated.
- src/components/tl/cuda/allgatherv/allgatherv_nvls.c has minor dead code that should be cleaned up
Important Files Changed
File Analysis
| Filename | Score | Overview |
|---|---|---|
| src/components/tl/cuda/allgatherv/allgatherv_nvls.c | 4/5 | Implements NVLS allgatherv algorithm with proper alignment validation and barrier synchronization; contains minor dead code in else branch |
| src/components/tl/cuda/kernels/allgatherv_kernel.cu | 5/5 | CUDA kernel correctly implements vectorized multimem.st operations with proper alignment assertions and barrier synchronization |
| src/components/tl/cuda/tl_cuda.h | 5/5 | Adds allgatherv_nvls task structure fields for MC/UC addresses, offsets, counts, and control |
|
/build |
What
NVLS allgatherv algorithm implementation.
Why ?
Adding NVLS support for allgatherv enables optimized data gathering across GPUs by utilizing
multimem.stinstructions to efficiently broadcast each rank's data to all participants through the NVLink SHARP.How ?