Update Qwen3-VL pretrain perf configs for 30B and 235B#3327
Update Qwen3-VL pretrain perf configs for 30B and 235B#3327tomlifu merged 3 commits intoNVIDIA-NeMo:mainfrom
Conversation
ae8eb97 to
cecd5f7
Compare
📝 WalkthroughWalkthroughAdds Qwen3.5-VL pretraining configuration modules with model-specific workload base configs for three model sizes (35B-A3B, 122B-A10B, 397B-A17B) across multiple GPU types. Refactors existing Qwen3-VL configuration to use a unified helper function pattern. Updates mock data generation to produce larger synthetic examples and modifies recipe HF model paths to instruct variants. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
⚔️ Resolve merge conflicts
Comment |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
scripts/performance/configs/qwen_vl/qwen3_vl_pretrain.py (1)
71-88:⚠️ Potential issue | 🟠 MajorApply workload parallelism before building the recipe.
recipe_fn(...)computes some derived fields from its incoming parallelism, butset_workload_base_configs(...)mutates that parallelism afterward. The concrete breakage ispipeline_dtype:_qwen3_vl_common()sets it frompipeline_model_parallel_sizeinsrc/megatron/bridge/recipes/qwen_vl/qwen3_vl.pyLine 121, whileset_workload_base_configs()never recomputes it. That leaves configs like the GB300 variants withpipeline_model_parallel_size=1but a stale pipeline dtype from the recipe defaults.🔧 Suggested fix
cfg = recipe_fn( mock=mock, precision_config=precision_config, comm_overlap_config=CommOverlapConfig(tp_comm_overlap=tp_comm_overlap), moe_flex_dispatcher_backend=base_cfg.moe_flex_dispatcher_backend, + tensor_model_parallel_size=base_cfg.tensor_model_parallel_size, + pipeline_model_parallel_size=base_cfg.pipeline_model_parallel_size, + expert_model_parallel_size=base_cfg.expert_model_parallel_size, + context_parallel_size=base_cfg.context_parallel_size, + sequence_parallel=base_cfg.sequence_parallel, + global_batch_size=base_cfg.global_batch_size, + micro_batch_size=base_cfg.micro_batch_size, ) set_workload_base_configs(cfg, base_cfg)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/performance/configs/qwen_vl/qwen3_vl_pretrain.py` around lines 71 - 88, The recipe is built from stale parallelism because set_workload_base_configs(cfg, base_cfg) runs after recipe_fn(...) — move the workload-parallelism application before constructing the recipe so derived fields are computed from the final parallelism: call set_workload_base_configs(base_cfg) (i.e., apply base_cfg to cfg) prior to invoking recipe_fn, or alternatively call recipe_fn after set_workload_base_configs and then run set_qwen3_vl_common_configs(cfg); ensure pipeline_dtype (computed in _qwen3_vl_common()/set_qwen3_vl_common_configs) reflects pipeline_model_parallel_size from base_cfg.
🧹 Nitpick comments (4)
scripts/performance/configs/qwen_vl/qwen3_vl_pretrain.py (2)
127-136: The FP8-MX overlap guard is now dead code.
set_qwen3_vl_common_configs()already forcesoverlap_param_gather = Falseoncfg.comm_overlap,cfg.ddp, andcfg.optimizerfor every config, so this branch no longer changes behavior. Either drop it or move the disabling back out of the common helper if it is meant to stay B200/FP8-MX-specific.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/performance/configs/qwen_vl/qwen3_vl_pretrain.py` around lines 127 - 136, The branch that disables overlap_param_gather for precision "fp8_mx" is now redundant because set_qwen3_vl_common_configs() already forces cfg.comm_overlap.overlap_param_gather = False, cfg.ddp.overlap_param_gather = False, and cfg.optimizer.overlap_param_gather = False; either remove the if precision == "fp8_mx" block in qwen3_vl_pretrain.py (around the _qwen3_vl_pretrain_config call) to eliminate dead code, or if the intent was to keep the disabling specific to B200/FP8-MX, move those three assignments out of set_qwen3_vl_common_configs() back into the FP8-MX-specific branch so the behavior remains configuration-specific (update references to cfg.comm_overlap, cfg.ddp, and cfg.optimizer accordingly).
61-69: Make the new config builders keyword-only.These signatures take multiple
strparameters, so positional calls are easy to mix up. Add a*separator to match the repo’s Python API convention.As per coding guidelines, "Use
*separator for functions with multiple same-type parameters".Also applies to: 98-100, 109-110, 123-125, 140-142, 156-158, 166-168, 179-181, 189-191
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/performance/configs/qwen_vl/qwen3_vl_pretrain.py` around lines 61 - 69, The config builder signatures (e.g., _qwen3_vl_pretrain_config) accept multiple same-typed str parameters and must be made keyword-only by adding a "*" separator in the parameter list (so positional callers cannot mix up arguments); modify the function signature to place a "*" after the required positional params (for example, keeping model_recipe_name positional but making gpu, recipe_fn, precision, mock, config_variant, tp_comm_overlap keyword-only) and apply the same change to the other config builder functions in this file so all multi-str parameter builders follow the repo convention while preserving existing type annotations and defaults.scripts/performance/configs/qwen_vl/qwen35_vl_pretrain.py (1)
143-151: Minor: Redundanttp_comm_overlap=Trueparameter.Line 150 explicitly passes
tp_comm_overlap=True, which is already the default value in_qwen35_vl_pretrain_config. This is harmless but redundant compared to other non-H100 configs that rely on the default.Optional cleanup
def qwen35_vl_35b_a3b_pretrain_config_h100( precision: str = "bf16", mock: bool = True, config_variant: str = "v1" ) -> ConfigContainer: """H100, baseline config.""" return _qwen35_vl_pretrain_config( "qwen35_vl_35b_a3b", "h100", qwen35_vl_35b_a3b_pretrain_mock_config, precision=precision, mock=mock, config_variant=config_variant, - tp_comm_overlap=True, )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/performance/configs/qwen_vl/qwen35_vl_pretrain.py` around lines 143 - 151, The qwen35_vl_35b_a3b_pretrain_config_h100 function redundantly passes tp_comm_overlap=True to _qwen35_vl_pretrain_config even though that parameter defaults to True; remove the explicit tp_comm_overlap=True argument from qwen35_vl_35b_a3b_pretrain_config_h100 so it relies on the default in _qwen35_vl_pretrain_config, keeping the function signature and other arguments unchanged.scripts/performance/configs/qwen_vl/qwen35_vl_workload_base_configs.py (1)
424-470: Consider sorting__all__to satisfy ruff linting.Static analysis (RUF022) flags that
__all__is not sorted. Per coding guidelines, ruff is used for linting. You can runuv run ruff check --fix .to auto-sort.Also, note that H100 FP8_MX variants are not defined for any model size (no aliases like
QWEN35_VL_35B_A3B_PRETRAIN_CONFIG_H100_FP8_MX = ...). If H100 FP8_MX support is needed, those aliases should be added.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/performance/configs/qwen_vl/qwen35_vl_workload_base_configs.py` around lines 424 - 470, The __all__ list is not alphabetically sorted (RUF022) and several H100 FP8_MX aliases are missing; sort the entries in __all__ alphabetically to satisfy ruff and, if H100 FP8_MX variants are required, add the corresponding alias constants (e.g., QWEN35_VL_35B_A3B_PRETRAIN_CONFIG_H100_FP8_MX, QWEN35_VL_122B_A10B_PRETRAIN_CONFIG_H100_FP8_MX, QWEN35_VL_397B_A17B_PRETRAIN_CONFIG_H100_FP8_MX) mapping them to the appropriate existing config objects; after changes run ruff --fix (or uv run ruff check --fix .) to confirm the list is sorted.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@scripts/performance/configs/qwen_vl/qwen3_vl_pretrain.py`:
- Around line 71-88: The recipe is built from stale parallelism because
set_workload_base_configs(cfg, base_cfg) runs after recipe_fn(...) — move the
workload-parallelism application before constructing the recipe so derived
fields are computed from the final parallelism: call
set_workload_base_configs(base_cfg) (i.e., apply base_cfg to cfg) prior to
invoking recipe_fn, or alternatively call recipe_fn after
set_workload_base_configs and then run set_qwen3_vl_common_configs(cfg); ensure
pipeline_dtype (computed in _qwen3_vl_common()/set_qwen3_vl_common_configs)
reflects pipeline_model_parallel_size from base_cfg.
---
Nitpick comments:
In `@scripts/performance/configs/qwen_vl/qwen3_vl_pretrain.py`:
- Around line 127-136: The branch that disables overlap_param_gather for
precision "fp8_mx" is now redundant because set_qwen3_vl_common_configs()
already forces cfg.comm_overlap.overlap_param_gather = False,
cfg.ddp.overlap_param_gather = False, and cfg.optimizer.overlap_param_gather =
False; either remove the if precision == "fp8_mx" block in qwen3_vl_pretrain.py
(around the _qwen3_vl_pretrain_config call) to eliminate dead code, or if the
intent was to keep the disabling specific to B200/FP8-MX, move those three
assignments out of set_qwen3_vl_common_configs() back into the FP8-MX-specific
branch so the behavior remains configuration-specific (update references to
cfg.comm_overlap, cfg.ddp, and cfg.optimizer accordingly).
- Around line 61-69: The config builder signatures (e.g.,
_qwen3_vl_pretrain_config) accept multiple same-typed str parameters and must be
made keyword-only by adding a "*" separator in the parameter list (so positional
callers cannot mix up arguments); modify the function signature to place a "*"
after the required positional params (for example, keeping model_recipe_name
positional but making gpu, recipe_fn, precision, mock, config_variant,
tp_comm_overlap keyword-only) and apply the same change to the other config
builder functions in this file so all multi-str parameter builders follow the
repo convention while preserving existing type annotations and defaults.
In `@scripts/performance/configs/qwen_vl/qwen35_vl_pretrain.py`:
- Around line 143-151: The qwen35_vl_35b_a3b_pretrain_config_h100 function
redundantly passes tp_comm_overlap=True to _qwen35_vl_pretrain_config even
though that parameter defaults to True; remove the explicit tp_comm_overlap=True
argument from qwen35_vl_35b_a3b_pretrain_config_h100 so it relies on the default
in _qwen35_vl_pretrain_config, keeping the function signature and other
arguments unchanged.
In `@scripts/performance/configs/qwen_vl/qwen35_vl_workload_base_configs.py`:
- Around line 424-470: The __all__ list is not alphabetically sorted (RUF022)
and several H100 FP8_MX aliases are missing; sort the entries in __all__
alphabetically to satisfy ruff and, if H100 FP8_MX variants are required, add
the corresponding alias constants (e.g.,
QWEN35_VL_35B_A3B_PRETRAIN_CONFIG_H100_FP8_MX,
QWEN35_VL_122B_A10B_PRETRAIN_CONFIG_H100_FP8_MX,
QWEN35_VL_397B_A17B_PRETRAIN_CONFIG_H100_FP8_MX) mapping them to the appropriate
existing config objects; after changes run ruff --fix (or uv run ruff check
--fix .) to confirm the list is sorted.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: ec57236a-41d4-4054-9546-6bce98f5ec9e
📒 Files selected for processing (9)
scripts/performance/configs/qwen_vl/__init__.pyscripts/performance/configs/qwen_vl/qwen35_vl_pretrain.pyscripts/performance/configs/qwen_vl/qwen35_vl_workload_base_configs.pyscripts/performance/configs/qwen_vl/qwen3_vl_pretrain.pyscripts/performance/configs/qwen_vl/qwen3_vl_workload_base_configs.pysrc/megatron/bridge/data/vlm_datasets/mock_provider.pysrc/megatron/bridge/models/qwen_vl/qwen3_vl_step.pysrc/megatron/bridge/recipes/qwen_vl/qwen35_vl.pysrc/megatron/bridge/recipes/qwen_vl/qwen3_vl.py
|
/ok to test cecd5f7 |
|
/claude review |
| cfg.model.cuda_graph_impl = "transformer_engine" | ||
| cfg.model.cuda_graph_scope = ["moe_router", "moe_preprocess"] |
There was a problem hiding this comment.
Nit: These two lines are no-ops — set_qwen3_vl_common_configs() (called inside _qwen3_vl_pretrain_config) already sets cuda_graph_impl and cuda_graph_scope to these exact values. Only lines 118-119 (num_layers_in_first/last_pipeline_stage) are actually overriding anything here.
Same applies to the 30B GB200 function below (lines 175-176) where the entire post-call block is redundant.
| cfg.model.cuda_graph_impl = "transformer_engine" | |
| cfg.model.cuda_graph_scope = ["moe_router", "moe_preprocess"] | |
| cfg.model.num_layers_in_first_pipeline_stage = 10 |
| moe_flex_dispatcher_backend="hybridep", | ||
| cuda_graph_impl="transformer_engine", | ||
| cuda_graph_scope=["attn", "moe_router", "moe_preprocess"], | ||
| cuda_graph_scope=["moe_router", "moe_preprocess"], |
There was a problem hiding this comment.
The 235B GB200 base configs here were updated to remove "attn" from cuda_graph_scope (good), but the 30B GB200 BF16 and FP8_CS base configs at lines 175 and 185 still include "attn".
Not a runtime bug — set_qwen3_vl_common_configs() overrides the scope — but having stale values in the base configs is misleading and inconsistent with this PR's intent. Consider updating those too for consistency.
There was a problem hiding this comment.
The refactor to _qwen3_vl_pretrain_config / get_workload_base_config() looks clean and consistent with the Qwen3.5-VL pattern. HF path fix and "attn" scope removal make sense.
Two minor items flagged inline:
- Redundant cuda_graph overrides in GB200 functions —
set_qwen3_vl_common_configs()already sets these, so the explicit re-assignments in both GB200 functions are no-ops. - Stale
"attn"scope in 30B GB200 base configs — the 235B GB200 base configs were updated but the 30B GB200 BF16/FP8_CS configs inqwen3_vl_workload_base_configs.py(lines 175, 185) still include"attn". Not a runtime issue sinceset_qwen3_vl_common_configsoverrides it, but worth cleaning up for consistency.
|
/ok to test 1b50ae1 |
malay-nagda
left a comment
There was a problem hiding this comment.
LGTM.
but, minor comments from @claude seem relevant?
|
/ok to test 9f5f334 |
Add performance-optimized pretraining configs for Qwen3-VL 30B-A3B and 235B-A22B across GB300, GB200, B200, and H100 GPUs. Key changes: - Refactor qwen3_vl_pretrain.py to use get_workload_base_config() pattern (consistent with qwen35_vl_pretrain.py) - Register Qwen3-VL pretrain configs in __init__.py - Fix 235B HF path: Qwen/Qwen3-VL-235B-A22B -> Qwen/Qwen3-VL-235B-A22B-Instruct - Disable "attn" CUDA graph scope for Qwen3VLModel (incompatible with position_embedding_type access) — keep moe_router + moe_preprocess only - GB200 235B: set MBS=2 and uneven PP split (10/12) for best throughput - Set common VLM-specific configs: apply_rope_fusion=False, moe_router_force_load_balancing=True, disable overlap_grad/param_gather Signed-off-by: Lifu Zhang <lifuz@login-lyris02.lyris.clusters.nvidia.com>
Signed-off-by: Lifu Zhang <lifuz@login-lyris02.lyris.clusters.nvidia.com>
…ttn scope Signed-off-by: Lifu Zhang <lifuz@login-lyris02.lyris.clusters.nvidia.com>
9f5f334 to
66c52d5
Compare
|
/ok to test 66c52d5 |
Signed-off-by: Lifu Zhang <lifuz@login-lyris02.lyris.clusters.nvidia.com> Co-authored-by: Lifu Zhang <lifuz@login-lyris02.lyris.clusters.nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
What does this PR do ?
qwen3_vl_pretrain.pyto useget_workload_base_config()pattern (consistent withqwen35_vl_pretrain.py)Qwen/Qwen3-VL-235B-A22B→Qwen/Qwen3-VL-235B-A22B-Instruct)"attn"CUDA graph scope for Qwen3VLModel (causesposition_embedding_typeAttributeError) — keepmoe_router+moe_preprocessonlyapply_rope_fusion=False,moe_router_force_load_balancing=True, disableoverlap_grad/param_gatherChangelog
GitHub Actions CI
See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.
Before your PR is "Ready for review"
Pre checks:
If you haven't finished some of the above items you can still open "Draft" PR.
Additional Information
Summary by CodeRabbit
Release Notes
New Features
Improvements