Skip to content
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions scripts/performance/run_script.py
Original file line number Diff line number Diff line change
Expand Up @@ -95,6 +95,11 @@ def main():
config_variant=args.config_variant,
)

# Set NCCL env vars for nccl_ub enabled via recipe config (not just CLI).
if getattr(recipe.ddp, "nccl_ub", False):
os.environ.setdefault("NCCL_NVLS_ENABLE", "1")
os.environ.setdefault("NCCL_CTA_POLICY", "1")
Comment on lines +99 to +101
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "Check new nccl_ub env logic in run_script.py:"
rg -n -C3 'nccl_ub|NCCL_NVLS_ENABLE|NCCL_CTA_POLICY|setdefault' scripts/performance/run_script.py

echo
echo "Check existing defaults in performance pipeline:"
rg -n -C3 '"NCCL_NVLS_ENABLE"\s*:\s*"0"|NCCL_CTA_POLICY' \
  scripts/performance/utils/executors.py \
  scripts/performance/setup_experiment.py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 2400


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== Check imports in run_script.py ==="
head -50 scripts/performance/run_script.py | grep -E "import|from"

echo
echo "=== Check function/module that calls run_script.py or sets up environment ==="
rg -l "run_script" scripts/performance/

echo
echo "=== Check other setdefault usage in scripts/performance/ ==="
rg -n "setdefault" scripts/performance/

echo
echo "=== Check direct assignment patterns for env vars in scripts/performance/ ==="
rg -n 'os\.environ\[.*\]\s*=' scripts/performance/ | head -20

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 1117


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== Check if setup_experiment.py imports executors.py ==="
rg -n "import.*executor|from.*executor" scripts/performance/setup_experiment.py

echo
echo "=== Check how NCCL_NVLS_ENABLE is used in setup_experiment.py (context around line 340) ==="
sed -n '330,360p' scripts/performance/setup_experiment.py

echo
echo "=== Check executors.py to see how defaults are used ==="
sed -n '30,70p' scripts/performance/utils/executors.py

echo
echo "=== Check if run_script.py is executed as subprocess (checking for subprocess calls in setup_experiment) ==="
rg -n "subprocess|Popen|run\(" scripts/performance/setup_experiment.py | head -10

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 3174


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== Check how executors handle env vars - look at function signatures and merging logic ==="
sed -n '70,150p' scripts/performance/utils/executors.py

echo
echo "=== Check kubeflow_executor to see how custom_env_vars is used ==="
rg -n "def kubeflow_executor" -A 30 scripts/performance/utils/executors.py

echo
echo "=== Check how env vars are passed to subprocess/container in executors ==="
rg -n "PERF_ENV_VARS|custom_env_vars" scripts/performance/utils/executors.py | head -20

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 5308


🏁 Script executed:

#!/bin/bash
set -euo pivefail

echo "=== Check where run_script.py is actually invoked - look for command construction ==="
rg -n "run_script.py" scripts/performance/setup_experiment.py

echo
echo "=== Look at how the executor actually runs the command with env vars ==="
sed -n '200,250p' scripts/performance/utils/executors.py

echo
echo "=== Verify if env_vars dict is actually merged/passed to subprocess ==="
rg -n "env=" scripts/performance/utils/executors.py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 124


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== Check where run_script.py is actually invoked - look for command construction ==="
rg -n "run_script.py" scripts/performance/setup_experiment.py

echo
echo "=== Look at how the executor actually runs the command with env vars ==="
sed -n '200,250p' scripts/performance/utils/executors.py

echo
echo "=== Verify if env_vars dict is actually merged/passed to subprocess ==="
rg -n "env=" scripts/performance/utils/executors.py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 2108


Use explicit assignment instead of setdefault() for NCCL environment variable enforcement

On lines 100-101, os.environ.setdefault() will skip overriding any pre-existing values. When recipe.ddp.nccl_ub is true, the intent is to explicitly enable these settings. Using setdefault() allows pre-existing "0" values to persist, which contradicts the intent and violates the code style guideline to "be as explicit as possible."

Proposed fix
    if getattr(recipe.ddp, "nccl_ub", False):
-        os.environ.setdefault("NCCL_NVLS_ENABLE", "1")
-        os.environ.setdefault("NCCL_CTA_POLICY", "1")
+        os.environ["NCCL_NVLS_ENABLE"] = "1"
+        os.environ["NCCL_CTA_POLICY"] = "1"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if getattr(recipe.ddp, "nccl_ub", False):
os.environ.setdefault("NCCL_NVLS_ENABLE", "1")
os.environ.setdefault("NCCL_CTA_POLICY", "1")
if getattr(recipe.ddp, "nccl_ub", False):
os.environ["NCCL_NVLS_ENABLE"] = "1"
os.environ["NCCL_CTA_POLICY"] = "1"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/performance/run_script.py` around lines 99 - 101, The code uses
os.environ.setdefault(...) for NCCL flags which won't override existing values;
change this to explicit assignment when recipe.ddp.nccl_ub is truthy: set
os.environ["NCCL_NVLS_ENABLE"] = "1" and os.environ["NCCL_CTA_POLICY"] = "1"
instead of setdefault so the flags are enforced; update the block referencing
recipe.ddp.nccl_ub and replace calls to os.environ.setdefault with direct
assignments to the two environment variables.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dingqingy-nv the AI comment for setdefault seems valid.
the value will not be updated to 1 if the env var already exists (regardless if its 1 or 0) in env inside container.

Why not use direct assignment?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!


# Select forward step function based on the model family name.
if args.domain == "vlm":
forward_step_func = vlm_forward_step
Expand Down
Loading