cp: fix(perf): set NCCL env vars when nccl_ub enabled via recipe config (3283) into r0.4.0#3305
cp: fix(perf): set NCCL env vars when nccl_ub enabled via recipe config (3283) into r0.4.0#3305
fix(perf): set NCCL env vars when nccl_ub enabled via recipe config (3283) into r0.4.0#3305Conversation
…3283) Signed-off-by: Dingqing Yang <dingqingy@nvidia.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
📝 WalkthroughWalkthroughThe script now conditionally sets NCCL environment variables ( Changes
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~2 minutes 🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@scripts/performance/run_script.py`:
- Around line 99-101: The code uses os.environ.setdefault(...) for NCCL flags
which won't override existing values; change this to explicit assignment when
recipe.ddp.nccl_ub is truthy: set os.environ["NCCL_NVLS_ENABLE"] = "1" and
os.environ["NCCL_CTA_POLICY"] = "1" instead of setdefault so the flags are
enforced; update the block referencing recipe.ddp.nccl_ub and replace calls to
os.environ.setdefault with direct assignments to the two environment variables.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: b906de96-d88a-41c2-a6c5-5b59db86873a
📒 Files selected for processing (1)
scripts/performance/run_script.py
| if getattr(recipe.ddp, "nccl_ub", False): | ||
| os.environ.setdefault("NCCL_NVLS_ENABLE", "1") | ||
| os.environ.setdefault("NCCL_CTA_POLICY", "1") |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "Check new nccl_ub env logic in run_script.py:"
rg -n -C3 'nccl_ub|NCCL_NVLS_ENABLE|NCCL_CTA_POLICY|setdefault' scripts/performance/run_script.py
echo
echo "Check existing defaults in performance pipeline:"
rg -n -C3 '"NCCL_NVLS_ENABLE"\s*:\s*"0"|NCCL_CTA_POLICY' \
scripts/performance/utils/executors.py \
scripts/performance/setup_experiment.pyRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 2400
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== Check imports in run_script.py ==="
head -50 scripts/performance/run_script.py | grep -E "import|from"
echo
echo "=== Check function/module that calls run_script.py or sets up environment ==="
rg -l "run_script" scripts/performance/
echo
echo "=== Check other setdefault usage in scripts/performance/ ==="
rg -n "setdefault" scripts/performance/
echo
echo "=== Check direct assignment patterns for env vars in scripts/performance/ ==="
rg -n 'os\.environ\[.*\]\s*=' scripts/performance/ | head -20Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 1117
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== Check if setup_experiment.py imports executors.py ==="
rg -n "import.*executor|from.*executor" scripts/performance/setup_experiment.py
echo
echo "=== Check how NCCL_NVLS_ENABLE is used in setup_experiment.py (context around line 340) ==="
sed -n '330,360p' scripts/performance/setup_experiment.py
echo
echo "=== Check executors.py to see how defaults are used ==="
sed -n '30,70p' scripts/performance/utils/executors.py
echo
echo "=== Check if run_script.py is executed as subprocess (checking for subprocess calls in setup_experiment) ==="
rg -n "subprocess|Popen|run\(" scripts/performance/setup_experiment.py | head -10Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 3174
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== Check how executors handle env vars - look at function signatures and merging logic ==="
sed -n '70,150p' scripts/performance/utils/executors.py
echo
echo "=== Check kubeflow_executor to see how custom_env_vars is used ==="
rg -n "def kubeflow_executor" -A 30 scripts/performance/utils/executors.py
echo
echo "=== Check how env vars are passed to subprocess/container in executors ==="
rg -n "PERF_ENV_VARS|custom_env_vars" scripts/performance/utils/executors.py | head -20Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 5308
🏁 Script executed:
#!/bin/bash
set -euo pivefail
echo "=== Check where run_script.py is actually invoked - look for command construction ==="
rg -n "run_script.py" scripts/performance/setup_experiment.py
echo
echo "=== Look at how the executor actually runs the command with env vars ==="
sed -n '200,250p' scripts/performance/utils/executors.py
echo
echo "=== Verify if env_vars dict is actually merged/passed to subprocess ==="
rg -n "env=" scripts/performance/utils/executors.pyRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 124
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== Check where run_script.py is actually invoked - look for command construction ==="
rg -n "run_script.py" scripts/performance/setup_experiment.py
echo
echo "=== Look at how the executor actually runs the command with env vars ==="
sed -n '200,250p' scripts/performance/utils/executors.py
echo
echo "=== Verify if env_vars dict is actually merged/passed to subprocess ==="
rg -n "env=" scripts/performance/utils/executors.pyRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 2108
Use explicit assignment instead of setdefault() for NCCL environment variable enforcement
On lines 100-101, os.environ.setdefault() will skip overriding any pre-existing values. When recipe.ddp.nccl_ub is true, the intent is to explicitly enable these settings. Using setdefault() allows pre-existing "0" values to persist, which contradicts the intent and violates the code style guideline to "be as explicit as possible."
Proposed fix
if getattr(recipe.ddp, "nccl_ub", False):
- os.environ.setdefault("NCCL_NVLS_ENABLE", "1")
- os.environ.setdefault("NCCL_CTA_POLICY", "1")
+ os.environ["NCCL_NVLS_ENABLE"] = "1"
+ os.environ["NCCL_CTA_POLICY"] = "1"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if getattr(recipe.ddp, "nccl_ub", False): | |
| os.environ.setdefault("NCCL_NVLS_ENABLE", "1") | |
| os.environ.setdefault("NCCL_CTA_POLICY", "1") | |
| if getattr(recipe.ddp, "nccl_ub", False): | |
| os.environ["NCCL_NVLS_ENABLE"] = "1" | |
| os.environ["NCCL_CTA_POLICY"] = "1" |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scripts/performance/run_script.py` around lines 99 - 101, The code uses
os.environ.setdefault(...) for NCCL flags which won't override existing values;
change this to explicit assignment when recipe.ddp.nccl_ub is truthy: set
os.environ["NCCL_NVLS_ENABLE"] = "1" and os.environ["NCCL_CTA_POLICY"] = "1"
instead of setdefault so the flags are enforced; update the block referencing
recipe.ddp.nccl_ub and replace calls to os.environ.setdefault with direct
assignments to the two environment variables.
There was a problem hiding this comment.
@dingqingy-nv the AI comment for setdefault seems valid.
the value will not be updated to 1 if the env var already exists (regardless if its 1 or 0) in env inside container.
Why not use direct assignment?
There was a problem hiding this comment.
Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!
|
for patch |
|
For context, this PR is to cherry-pick #3283, which is an interface fix for nccl_ub. At this point, none of our benchmark use this feature so this fix is not required. |
Summary
r0.4.0release branchNCCL_NVLS_ENABLE=1andNCCL_CTA_POLICY=1env vars whennccl_ubis enabled via recipe config (not just CLI)dryrunblock that exists onmainbut not onr0.4.0Test plan
r0.4.0baserecipe.ddp.nccl_ub=True🤖 Generated with Claude Code
Summary by CodeRabbit
Release Notes