Update Qwen3-VL pretrain perf configs for 30B and 235B by tomlifu · Pull Request #3327 · NVIDIA-NeMo/Megatron-Bridge

tomlifu · 2026-04-14T22:16:59Z

What does this PR do ?

Add performance-optimized pretraining configs for Qwen3-VL 30B-A3B and 235B-A22B across GB300, GB200, B200, and H100 GPUs
Refactor qwen3_vl_pretrain.py to use get_workload_base_config() pattern (consistent with qwen35_vl_pretrain.py)
Fix 235B HF path (Qwen/Qwen3-VL-235B-A22B → Qwen/Qwen3-VL-235B-A22B-Instruct)
Disable "attn" CUDA graph scope for Qwen3VLModel (causes position_embedding_type AttributeError) — keep moe_router + moe_preprocess only
GB200 235B: set MBS=2 and uneven PP split (10/12) for best throughput
Set common VLM-specific configs: apply_rope_fusion=False, moe_router_force_load_balancing=True, disable overlap_grad/param_gather

Changelog

Add specific line by line info of high level changes in this PR.

GitHub Actions CI

See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

Related to # (issue)

Summary by CodeRabbit

Release Notes

New Features
- Added pretraining configurations for Qwen3.5-VL models (35B, 122B, and 397B variants) across multiple GPU types with BF16 and FP8 precision support.
Improvements
- Updated Qwen3-VL recipes to use the Instruct variant checkpoint for better instruction-following capabilities.
- Enhanced mock data generation for vision language model datasets with improved variability.
- Improved parallel group size handling for encoder-parallel configurations.

copy-pr-bot · 2026-04-14T22:17:03Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-04-14T22:28:55Z

📝 Walkthrough

Walkthrough

Adds Qwen3.5-VL pretraining configuration modules with model-specific workload base configs for three model sizes (35B-A3B, 122B-A10B, 397B-A17B) across multiple GPU types. Refactors existing Qwen3-VL configuration to use a unified helper function pattern. Updates mock data generation to produce larger synthetic examples and modifies recipe HF model paths to instruct variants.

Changes

Cohort / File(s)	Summary
Qwen3.5-VL Configuration Addition `scripts/performance/configs/qwen_vl/qwen35_vl_pretrain.py`, `scripts/performance/configs/qwen_vl/qwen35_vl_workload_base_configs.py`, `scripts/performance/configs/qwen_vl/__init__.py`	Introduces new Qwen3.5-VL pretraining config module with shared common config routine and GPU-specific factory functions (35B-A3B, 122B-A10B, 397B-A17B across gb300/b300/gb200/b200/h100). Adds 45 new workload base config constants with hardware-dependent parallelism and batch size settings. Updates module exports to include all new config symbols.
Qwen3-VL Configuration Refactoring `scripts/performance/configs/qwen_vl/qwen3_vl_pretrain.py`, `scripts/performance/configs/qwen_vl/qwen3_vl_workload_base_configs.py`	Refactors Qwen3-VL pretrain functions to use unified `_qwen3_vl_pretrain_config` helper replacing hardcoded constant selection. Renames exported factories from `_mock_config_` to `_pretrain_config_`. Updates GB200 presets to set `micro_batch_size=2` and adjust CUDA graph scope.
Recipe and Bridge Updates `src/megatron/bridge/recipes/qwen_vl/qwen3_vl.py`, `src/megatron/bridge/recipes/qwen_vl/qwen35_vl.py`, `src/megatron/bridge/models/qwen_vl/qwen3_vl_step.py`, `src/megatron/bridge/data/vlm_datasets/mock_provider.py`	Updates Qwen3-VL recipe HF model path from base to instruct variant. Modifies encoder-parallel group detection in step forward pass. Expands mock data generation to produce 1000 synthetic examples with variable response lengths.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

cp: Add Qwen3.5-VL MoE pretrain performance configs (3101) into r0.4.0 #3304 — Directly adds the same Qwen3.5-VL configuration modules and workload base presets with identical symbol exports
Add Qwen3.5-VL MoE pretrain performance configs #3101 — Implements identical code-level changes including qwen35_vl_pretrain module, workload base configs, and bridge recipe/step modifications
Add refactored SFT, PEFT recipes for VLMs #2614 — Modifies the same VLM recipe code and introduces VLM recipe builder refactoring patterns

Suggested labels

performance, area:perf, feature, area:recipe, performance/release, needs-review

Suggested reviewers

cuichenx
yaoyu-33

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Test Results For Major Changes	⚠️ Warning	PR contains major performance-optimized configuration changes but lacks test results, performance benchmarks, or convergence validation information required for such changes.	Add test results, before/after performance benchmarks on relevant hardware (GB300, GB200, B200, H100), and configuration context to validate no regression occurred.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Update Qwen3-VL pretrain perf configs for 30B and 235B' accurately summarizes the main objective of adding performance-optimized pretraining configurations for Qwen3-VL 30B and 235B variants across multiple GPU platforms.
Docstring Coverage	✅ Passed	Docstring coverage is 94.12% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

⚔️ Resolve merge conflicts

Resolve merge conflict in branch lifuz/qwen3-vl-pretrain-perf

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

scripts/performance/configs/qwen_vl/qwen3_vl_pretrain.py (1)

71-88: ⚠️ Potential issue | 🟠 Major

Apply workload parallelism before building the recipe.

recipe_fn(...) computes some derived fields from its incoming parallelism, but set_workload_base_configs(...) mutates that parallelism afterward. The concrete breakage is pipeline_dtype: _qwen3_vl_common() sets it from pipeline_model_parallel_size in src/megatron/bridge/recipes/qwen_vl/qwen3_vl.py Line 121, while set_workload_base_configs() never recomputes it. That leaves configs like the GB300 variants with pipeline_model_parallel_size=1 but a stale pipeline dtype from the recipe defaults.

🔧 Suggested fix

     cfg = recipe_fn(
         mock=mock,
         precision_config=precision_config,
         comm_overlap_config=CommOverlapConfig(tp_comm_overlap=tp_comm_overlap),
         moe_flex_dispatcher_backend=base_cfg.moe_flex_dispatcher_backend,
+        tensor_model_parallel_size=base_cfg.tensor_model_parallel_size,
+        pipeline_model_parallel_size=base_cfg.pipeline_model_parallel_size,
+        expert_model_parallel_size=base_cfg.expert_model_parallel_size,
+        context_parallel_size=base_cfg.context_parallel_size,
+        sequence_parallel=base_cfg.sequence_parallel,
+        global_batch_size=base_cfg.global_batch_size,
+        micro_batch_size=base_cfg.micro_batch_size,
     )
     set_workload_base_configs(cfg, base_cfg)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@scripts/performance/configs/qwen_vl/qwen3_vl_pretrain.py` around lines 71 -
88, The recipe is built from stale parallelism because
set_workload_base_configs(cfg, base_cfg) runs after recipe_fn(...) — move the
workload-parallelism application before constructing the recipe so derived
fields are computed from the final parallelism: call
set_workload_base_configs(base_cfg) (i.e., apply base_cfg to cfg) prior to
invoking recipe_fn, or alternatively call recipe_fn after
set_workload_base_configs and then run set_qwen3_vl_common_configs(cfg); ensure
pipeline_dtype (computed in _qwen3_vl_common()/set_qwen3_vl_common_configs)
reflects pipeline_model_parallel_size from base_cfg.

🧹 Nitpick comments (4)

scripts/performance/configs/qwen_vl/qwen3_vl_pretrain.py (2)
127-136: The FP8-MX overlap guard is now dead code.

set_qwen3_vl_common_configs() already forces overlap_param_gather = False on cfg.comm_overlap, cfg.ddp, and cfg.optimizer for every config, so this branch no longer changes behavior. Either drop it or move the disabling back out of the common helper if it is meant to stay B200/FP8-MX-specific.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/performance/configs/qwen_vl/qwen3_vl_pretrain.py` around lines 127 -
136, The branch that disables overlap_param_gather for precision "fp8_mx" is now
redundant because set_qwen3_vl_common_configs() already forces
cfg.comm_overlap.overlap_param_gather = False, cfg.ddp.overlap_param_gather =
False, and cfg.optimizer.overlap_param_gather = False; either remove the if
precision == "fp8_mx" block in qwen3_vl_pretrain.py (around the
_qwen3_vl_pretrain_config call) to eliminate dead code, or if the intent was to
keep the disabling specific to B200/FP8-MX, move those three assignments out of
set_qwen3_vl_common_configs() back into the FP8-MX-specific branch so the
behavior remains configuration-specific (update references to cfg.comm_overlap,
cfg.ddp, and cfg.optimizer accordingly).
61-69: Make the new config builders keyword-only.

These signatures take multiple str parameters, so positional calls are easy to mix up. Add a * separator to match the repo’s Python API convention.

As per coding guidelines, "Use * separator for functions with multiple same-type parameters".

Also applies to: 98-100, 109-110, 123-125, 140-142, 156-158, 166-168, 179-181, 189-191
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/performance/configs/qwen_vl/qwen3_vl_pretrain.py` around lines 61 -
69, The config builder signatures (e.g., _qwen3_vl_pretrain_config) accept
multiple same-typed str parameters and must be made keyword-only by adding a "*"
separator in the parameter list (so positional callers cannot mix up arguments);
modify the function signature to place a "*" after the required positional
params (for example, keeping model_recipe_name positional but making gpu,
recipe_fn, precision, mock, config_variant, tp_comm_overlap keyword-only) and
apply the same change to the other config builder functions in this file so all
multi-str parameter builders follow the repo convention while preserving
existing type annotations and defaults.
scripts/performance/configs/qwen_vl/qwen35_vl_pretrain.py (1)
143-151: Minor: Redundant tp_comm_overlap=True parameter.

Line 150 explicitly passes tp_comm_overlap=True, which is already the default value in _qwen35_vl_pretrain_config. This is harmless but redundant compared to other non-H100 configs that rely on the default.
Optional cleanup
 def qwen35_vl_35b_a3b_pretrain_config_h100(
     precision: str = "bf16", mock: bool = True, config_variant: str = "v1"
 ) -> ConfigContainer:
     """H100, baseline config."""
     return _qwen35_vl_pretrain_config(
         "qwen35_vl_35b_a3b", "h100", qwen35_vl_35b_a3b_pretrain_mock_config,
         precision=precision, mock=mock, config_variant=config_variant,
-        tp_comm_overlap=True,
     )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/performance/configs/qwen_vl/qwen35_vl_pretrain.py` around lines 143 -
151, The qwen35_vl_35b_a3b_pretrain_config_h100 function redundantly passes
tp_comm_overlap=True to _qwen35_vl_pretrain_config even though that parameter
defaults to True; remove the explicit tp_comm_overlap=True argument from
qwen35_vl_35b_a3b_pretrain_config_h100 so it relies on the default in
_qwen35_vl_pretrain_config, keeping the function signature and other arguments
unchanged.
scripts/performance/configs/qwen_vl/qwen35_vl_workload_base_configs.py (1)
424-470: Consider sorting __all__ to satisfy ruff linting.

Static analysis (RUF022) flags that __all__ is not sorted. Per coding guidelines, ruff is used for linting. You can run uv run ruff check --fix . to auto-sort.

Also, note that H100 FP8_MX variants are not defined for any model size (no aliases like QWEN35_VL_35B_A3B_PRETRAIN_CONFIG_H100_FP8_MX = ...). If H100 FP8_MX support is needed, those aliases should be added.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/performance/configs/qwen_vl/qwen35_vl_workload_base_configs.py`
around lines 424 - 470, The __all__ list is not alphabetically sorted (RUF022)
and several H100 FP8_MX aliases are missing; sort the entries in __all__
alphabetically to satisfy ruff and, if H100 FP8_MX variants are required, add
the corresponding alias constants (e.g.,
QWEN35_VL_35B_A3B_PRETRAIN_CONFIG_H100_FP8_MX,
QWEN35_VL_122B_A10B_PRETRAIN_CONFIG_H100_FP8_MX,
QWEN35_VL_397B_A17B_PRETRAIN_CONFIG_H100_FP8_MX) mapping them to the appropriate
existing config objects; after changes run ruff --fix (or uv run ruff check
--fix .) to confirm the list is sorted.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@scripts/performance/configs/qwen_vl/qwen3_vl_pretrain.py`:
- Around line 71-88: The recipe is built from stale parallelism because
set_workload_base_configs(cfg, base_cfg) runs after recipe_fn(...) — move the
workload-parallelism application before constructing the recipe so derived
fields are computed from the final parallelism: call
set_workload_base_configs(base_cfg) (i.e., apply base_cfg to cfg) prior to
invoking recipe_fn, or alternatively call recipe_fn after
set_workload_base_configs and then run set_qwen3_vl_common_configs(cfg); ensure
pipeline_dtype (computed in _qwen3_vl_common()/set_qwen3_vl_common_configs)
reflects pipeline_model_parallel_size from base_cfg.

---

Nitpick comments:
In `@scripts/performance/configs/qwen_vl/qwen3_vl_pretrain.py`:
- Around line 127-136: The branch that disables overlap_param_gather for
precision "fp8_mx" is now redundant because set_qwen3_vl_common_configs()
already forces cfg.comm_overlap.overlap_param_gather = False,
cfg.ddp.overlap_param_gather = False, and cfg.optimizer.overlap_param_gather =
False; either remove the if precision == "fp8_mx" block in qwen3_vl_pretrain.py
(around the _qwen3_vl_pretrain_config call) to eliminate dead code, or if the
intent was to keep the disabling specific to B200/FP8-MX, move those three
assignments out of set_qwen3_vl_common_configs() back into the FP8-MX-specific
branch so the behavior remains configuration-specific (update references to
cfg.comm_overlap, cfg.ddp, and cfg.optimizer accordingly).
- Around line 61-69: The config builder signatures (e.g.,
_qwen3_vl_pretrain_config) accept multiple same-typed str parameters and must be
made keyword-only by adding a "*" separator in the parameter list (so positional
callers cannot mix up arguments); modify the function signature to place a "*"
after the required positional params (for example, keeping model_recipe_name
positional but making gpu, recipe_fn, precision, mock, config_variant,
tp_comm_overlap keyword-only) and apply the same change to the other config
builder functions in this file so all multi-str parameter builders follow the
repo convention while preserving existing type annotations and defaults.

In `@scripts/performance/configs/qwen_vl/qwen35_vl_pretrain.py`:
- Around line 143-151: The qwen35_vl_35b_a3b_pretrain_config_h100 function
redundantly passes tp_comm_overlap=True to _qwen35_vl_pretrain_config even
though that parameter defaults to True; remove the explicit tp_comm_overlap=True
argument from qwen35_vl_35b_a3b_pretrain_config_h100 so it relies on the default
in _qwen35_vl_pretrain_config, keeping the function signature and other
arguments unchanged.

In `@scripts/performance/configs/qwen_vl/qwen35_vl_workload_base_configs.py`:
- Around line 424-470: The __all__ list is not alphabetically sorted (RUF022)
and several H100 FP8_MX aliases are missing; sort the entries in __all__
alphabetically to satisfy ruff and, if H100 FP8_MX variants are required, add
the corresponding alias constants (e.g.,
QWEN35_VL_35B_A3B_PRETRAIN_CONFIG_H100_FP8_MX,
QWEN35_VL_122B_A10B_PRETRAIN_CONFIG_H100_FP8_MX,
QWEN35_VL_397B_A17B_PRETRAIN_CONFIG_H100_FP8_MX) mapping them to the appropriate
existing config objects; after changes run ruff --fix (or uv run ruff check
--fix .) to confirm the list is sorted.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: ec57236a-41d4-4054-9546-6bce98f5ec9e

📥 Commits

Reviewing files that changed from the base of the PR and between 9b38f65 and ae8eb97.

📒 Files selected for processing (9)

scripts/performance/configs/qwen_vl/__init__.py
scripts/performance/configs/qwen_vl/qwen35_vl_pretrain.py
scripts/performance/configs/qwen_vl/qwen35_vl_workload_base_configs.py
scripts/performance/configs/qwen_vl/qwen3_vl_pretrain.py
scripts/performance/configs/qwen_vl/qwen3_vl_workload_base_configs.py
src/megatron/bridge/data/vlm_datasets/mock_provider.py
src/megatron/bridge/models/qwen_vl/qwen3_vl_step.py
src/megatron/bridge/recipes/qwen_vl/qwen35_vl.py
src/megatron/bridge/recipes/qwen_vl/qwen3_vl.py

yaoyu-33 · 2026-04-15T03:37:34Z

/ok to test cecd5f7

yaoyu-33 · 2026-04-15T03:37:52Z

/claude review

claude · 2026-04-15T03:42:49Z

+    cfg.model.cuda_graph_impl = "transformer_engine"
+    cfg.model.cuda_graph_scope = ["moe_router", "moe_preprocess"]


Nit: These two lines are no-ops — set_qwen3_vl_common_configs() (called inside _qwen3_vl_pretrain_config) already sets cuda_graph_impl and cuda_graph_scope to these exact values. Only lines 118-119 (num_layers_in_first/last_pipeline_stage) are actually overriding anything here.

Same applies to the 30B GB200 function below (lines 175-176) where the entire post-call block is redundant.

Suggested change

cfg.model.cuda_graph_impl = "transformer_engine"

cfg.model.cuda_graph_scope = ["moe_router", "moe_preprocess"]

cfg.model.num_layers_in_first_pipeline_stage = 10

claude · 2026-04-15T03:42:52Z

    moe_flex_dispatcher_backend="hybridep",
    cuda_graph_impl="transformer_engine",
-    cuda_graph_scope=["attn", "moe_router", "moe_preprocess"],
+    cuda_graph_scope=["moe_router", "moe_preprocess"],


The 235B GB200 base configs here were updated to remove "attn" from cuda_graph_scope (good), but the 30B GB200 BF16 and FP8_CS base configs at lines 175 and 185 still include "attn".

Not a runtime bug — set_qwen3_vl_common_configs() overrides the scope — but having stale values in the base configs is misleading and inconsistent with this PR's intent. Consider updating those too for consistency.

claude

The refactor to _qwen3_vl_pretrain_config / get_workload_base_config() looks clean and consistent with the Qwen3.5-VL pattern. HF path fix and "attn" scope removal make sense.

Two minor items flagged inline:

Redundant cuda_graph overrides in GB200 functions — set_qwen3_vl_common_configs() already sets these, so the explicit re-assignments in both GB200 functions are no-ops.
Stale "attn" scope in 30B GB200 base configs — the 235B GB200 base configs were updated but the 30B GB200 BF16/FP8_CS configs in qwen3_vl_workload_base_configs.py (lines 175, 185) still include "attn". Not a runtime issue since set_qwen3_vl_common_configs overrides it, but worth cleaning up for consistency.

tomlifu · 2026-04-15T04:34:27Z

/ok to test 1b50ae1

malay-nagda

LGTM.

but, minor comments from @claude seem relevant?

tomlifu · 2026-04-15T05:00:28Z

/ok to test 9f5f334

Add performance-optimized pretraining configs for Qwen3-VL 30B-A3B and 235B-A22B across GB300, GB200, B200, and H100 GPUs. Key changes: - Refactor qwen3_vl_pretrain.py to use get_workload_base_config() pattern (consistent with qwen35_vl_pretrain.py) - Register Qwen3-VL pretrain configs in __init__.py - Fix 235B HF path: Qwen/Qwen3-VL-235B-A22B -> Qwen/Qwen3-VL-235B-A22B-Instruct - Disable "attn" CUDA graph scope for Qwen3VLModel (incompatible with position_embedding_type access) — keep moe_router + moe_preprocess only - GB200 235B: set MBS=2 and uneven PP split (10/12) for best throughput - Set common VLM-specific configs: apply_rope_fusion=False, moe_router_force_load_balancing=True, disable overlap_grad/param_gather Signed-off-by: Lifu Zhang <lifuz@login-lyris02.lyris.clusters.nvidia.com>

Signed-off-by: Lifu Zhang <lifuz@login-lyris02.lyris.clusters.nvidia.com>

…ttn scope Signed-off-by: Lifu Zhang <lifuz@login-lyris02.lyris.clusters.nvidia.com>

tomlifu · 2026-04-15T05:20:55Z

/ok to test 66c52d5

Signed-off-by: Lifu Zhang <lifuz@login-lyris02.lyris.clusters.nvidia.com> Co-authored-by: Lifu Zhang <lifuz@login-lyris02.lyris.clusters.nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>

tomlifu force-pushed the lifuz/qwen3-vl-pretrain-perf branch from ae8eb97 to cecd5f7 Compare April 14, 2026 22:19

tomlifu requested review from cuichenx, dingqingy-nv, malay-nagda and yaoyu-33 April 14, 2026 22:20

tomlifu self-assigned this Apr 14, 2026

tomlifu added 26.04.01 r0.4.0 Auto-cherrypick to release branch. Apply before merge; cherrypick happens after merge. performance performance/release Performance items related with NeMo release area:perf Performance optimizations and benchmarking labels Apr 14, 2026

coderabbitai bot reviewed Apr 14, 2026

View reviewed changes

claude bot reviewed Apr 15, 2026

View reviewed changes

copy-pr-bot bot temporarily deployed to test April 15, 2026 04:35 Inactive

malay-nagda previously approved these changes Apr 15, 2026

View reviewed changes

tomlifu dismissed malay-nagda’s stale review via 9f5f334 April 15, 2026 04:58

copy-pr-bot bot had a problem deploying to test April 15, 2026 05:01 Error

Lifu Zhang added 3 commits April 14, 2026 22:11

[perf] fix: Apply ruff-format to qwen3_vl_pretrain.py

75926c6

Signed-off-by: Lifu Zhang <lifuz@login-lyris02.lyris.clusters.nvidia.com>

[perf] fix: Remove redundant cuda_graph overrides and fix 30B GB200 a…

66c52d5

…ttn scope Signed-off-by: Lifu Zhang <lifuz@login-lyris02.lyris.clusters.nvidia.com>

tomlifu force-pushed the lifuz/qwen3-vl-pretrain-perf branch from 9f5f334 to 66c52d5 Compare April 15, 2026 05:13

copy-pr-bot bot temporarily deployed to test April 15, 2026 05:21 Inactive

tomlifu enabled auto-merge (squash) April 15, 2026 16:52

yaoyu-33 approved these changes Apr 15, 2026

View reviewed changes

tomlifu merged commit 7778d4c into NVIDIA-NeMo:main Apr 15, 2026
57 checks passed

		cfg.model.cuda_graph_impl = "transformer_engine"
		cfg.model.cuda_graph_scope = ["moe_router", "moe_preprocess"]

	cfg.model.cuda_graph_impl = "transformer_engine"
	cfg.model.cuda_graph_scope = ["moe_router", "moe_preprocess"]
	cfg.model.num_layers_in_first_pipeline_stage = 10

Conversation

tomlifu commented Apr 14, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Changelog

GitHub Actions CI

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Release Notes

Uh oh!

copy-pr-bot bot commented Apr 14, 2026

Uh oh!

coderabbitai bot commented Apr 14, 2026

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

yaoyu-33 commented Apr 15, 2026

Uh oh!

yaoyu-33 commented Apr 15, 2026

Uh oh!

claude bot Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot left a comment

Choose a reason for hiding this comment

Uh oh!

tomlifu commented Apr 15, 2026

Uh oh!

malay-nagda left a comment

Choose a reason for hiding this comment

Uh oh!

tomlifu commented Apr 15, 2026

Uh oh!

tomlifu commented Apr 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tomlifu commented Apr 14, 2026 •

edited by coderabbitai bot

Loading