fix: preserve RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES to prevent NCCL NVSwitch bugs#2252
fix: preserve RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES to prevent NCCL NVSwitch bugs#2252dmvevents wants to merge 1 commit intoNVIDIA-NeMo:mainfrom
Conversation
…NCCL NVSwitch bugs Fixes NVIDIA-NeMo#1963. Related: NVIDIA-NeMo#1961. The current code removes RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES from the worker environment, which forces Ray to mask CUDA_VISIBLE_DEVICES per actor. This triggers known NCCL bugs on NVSwitch topologies (H200/P5en, H100/P5): - cuMem import penalty causing 2400x slower first AllReduce (nccl#1749) - NVLS rank ordering corruption causing hangs (nccl#1906) - Multi-channel P2P hangs at >8M elements Fix: set RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES=1 so each worker sees all GPUs and uses explicit torch.cuda.set_device(local_rank) instead. This mirrors torchrun behavior and works correctly on both NVSwitch and NVLink topologies.
📝 WalkthroughWalkthroughThe change modifies environment variable handling in worker initialization to explicitly preserve the Changes
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~3 minutes 🚥 Pre-merge checks | ✅ 5 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
nemo_rl/distributed/worker_groups.py (1)
506-506: Usesetdefault()to preserve caller-provided overrides while maintaining the safe default.Line 506 unconditionally overwrites any value for
RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES, contradicting the comment which says "Preserve". The current merge pattern (lines 430–433) respects caller-providedenv_vars, and the pattern of explicitpop()calls (lines 508–512) shows selective env var management. Usingsetdefault()aligns the implementation with both the comment's intent and the method's design pattern of respecting caller values while providing a safe default for this critical NCCL bug workaround.♻️ Suggested refactor
- worker_env_vars["RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES"] = "1" + worker_env_vars.setdefault( + "RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES", "1" + )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nemo_rl/distributed/worker_groups.py` at line 506, The code unconditionally overwrites RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES on worker_env_vars which violates the "Preserve" intent; change the assignment to use worker_env_vars.setdefault("RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES", "1") so that any caller-provided value is preserved while still providing the safe default for the NCCL workaround (locate the usage of worker_env_vars and the RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES symbol in worker_groups.py).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@nemo_rl/distributed/worker_groups.py`:
- Line 506: The code unconditionally overwrites
RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES on worker_env_vars which violates
the "Preserve" intent; change the assignment to use
worker_env_vars.setdefault("RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES", "1")
so that any caller-provided value is preserved while still providing the safe
default for the NCCL workaround (locate the usage of worker_env_vars and the
RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES symbol in worker_groups.py).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: fd3c68a0-4013-4655-bc91-decde792ee45
📒 Files selected for processing (1)
nemo_rl/distributed/worker_groups.py
|
hi @guyueh1 @terrykong , could you help to take a review to check if this will have some side effects? |
terrykong
left a comment
There was a problem hiding this comment.
Review Summary
The approach (preserving RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES=1) is sound and consistent with how vLLM and SGLang workers already behave. However, there is a critical regression in the default DTensor V1 path and several secondary issues that need attention before merge.
Historical context: The original pop() was introduced in PR #432 by @hemildesai for ray job submit support. After investigation, the KubeRay risk is low — NeMo-RL uses whole-node GPU allocations, and vLLM/SGLang already set this env var to 1.
| # penalty (nccl#1749) and NVLS rank ordering corruption (nccl#1906). | ||
| # Workers use explicit torch.cuda.set_device(local_rank) instead. | ||
| # See: https://github.com/NVIDIA-NeMo/RL/issues/1963 | ||
| worker_env_vars["RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES"] = "1" |
There was a problem hiding this comment.
Critical: DTensor V1 LOCAL_RANK=0 regression
This unconditional = "1" makes all GPUs visible to every worker. However, dtensor_policy_worker.py:320-329 has a LOCAL_RANK=0 hack that assumes only 1 GPU is visible:
# torch==2.8 uses LOCAL_RANK to set the device here
# but CUDA_VISIBLE_DEVICES is set to only 1 gpu, so we need to temporarily set LOCAL_RANK to 0.
prev_local_rank = os.environ["LOCAL_RANK"]
os.environ["LOCAL_RANK"] = "0"
device_mesh = torch.distributed.device_mesh.init_device_mesh(...)With all GPUs visible, init_device_mesh calls set_device(0) for every worker → all workers fight over GPU 0 → OOM. DTensor V1 is the default (lm_policy.py:113: _v2=False). DTensor V2 is unaffected (uses FSDP2Manager which reads real LOCAL_RANK).
Fix: Remove the LOCAL_RANK=0 hack (lines 323-324). With all GPUs visible, init_device_mesh should use the real LOCAL_RANK=bundle_idx.
| # from masking CUDA_VISIBLE_DEVICES per actor. GPU masking triggers NCCL | ||
| # bugs on NVSwitch topologies (H200/P5en, H100/P5) including cuMem import | ||
| # penalty (nccl#1749) and NVLS rank ordering corruption (nccl#1906). | ||
| # Workers use explicit torch.cuda.set_device(local_rank) instead. |
There was a problem hiding this comment.
This comment is inaccurate — no production nemo_rl code calls torch.cuda.set_device(local_rank). Device binding happens implicitly via init_device_mesh / Megatron internals reading the LOCAL_RANK env var.
| # Workers use explicit torch.cuda.set_device(local_rank) instead. | |
| # Workers rely on LOCAL_RANK env var for device selection via | |
| # init_device_mesh / Megatron internals. |
Additional FindingsvLLM Non-Parallel GPU Collision Risk
Stale CommentsThese comments reference CUDA_VISIBLE_DEVICES masking behavior that no longer applies with this change:
Megatron Seed Computation Change
|
|
Re: CodeRabbit's The unconditional Every other site in the codebase that sets this variable uses unconditional assignment: The word "Preserve" in the comment means "keep this env var present" (vs. the old |
|
hey @dmvevents. the above review was claude generated so we should just use it as a guide. i think this change makes sense, but we'll need to run the nightly CI to make sure this doesn't break anything since this change has a large impact |
Fixes #1963. Related: #1961.
Summary
worker_groups.pyremovesRAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICESfrom the worker environment (line 501), which forces Ray to set per-actorCUDA_VISIBLE_DEVICESmasking (e.g.,CUDA_VISIBLE_DEVICES=3for the 4th GPU). This triggers three confirmed NCCL bugs on NVSwitch topologies (H200 P5en, H100 P5):cuMem import penalty (NVIDIA/nccl#1749) —
p2pMap()iterates over all devices when importing cuMem handles with non-overlappingCUDA_VISIBLE_DEVICES. Causes 3,660ms first-operation penalty vs 1.5ms with torchrun.NVLS rank ordering corruption (NVIDIA/nccl#1906) — allgather in
nvls.ccis missing a user rank table when GPU indices are permuted byCUDA_VISIBLE_DEVICES. Causes hang or silent data corruption on NVSwitch systems.Multi-channel P2P hang at >8M elements — AllReduce hangs for tensors larger than ~32MB even with
NCCL_CUMEM_ENABLE=0andNCCL_NVLS_ENABLE=0.Fix
Preserve
RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES=1instead of removing it. This tells Ray not to maskCUDA_VISIBLE_DEVICES, so each worker sees all GPUs. Workers use explicittorch.cuda.set_device(local_rank)instead, which mirrors torchrun behavior and works correctly on both NVSwitch and NVLink topologies.Benchmarks (same hardware, same NCCL)
Both tests: P5en.48xlarge (8x H200, NVSwitch), NCCL 2.27.5, EFA, same container.
Test plan
LOCAL_RANKis already set inworker_env_vars(line 492), so workers cantorch.cuda.set_device()correctlySummary by CodeRabbit