Skip to content

Update Qwen3-VL pretrain perf configs for 30B and 235B#3327

Merged
tomlifu merged 3 commits intoNVIDIA-NeMo:mainfrom
tomlifu:lifuz/qwen3-vl-pretrain-perf
Apr 15, 2026
Merged

Update Qwen3-VL pretrain perf configs for 30B and 235B#3327
tomlifu merged 3 commits intoNVIDIA-NeMo:mainfrom
tomlifu:lifuz/qwen3-vl-pretrain-perf

Conversation

@tomlifu
Copy link
Copy Markdown
Contributor

@tomlifu tomlifu commented Apr 14, 2026

What does this PR do ?

  • Add performance-optimized pretraining configs for Qwen3-VL 30B-A3B and 235B-A22B across GB300, GB200, B200, and H100 GPUs
  • Refactor qwen3_vl_pretrain.py to use get_workload_base_config() pattern (consistent with qwen35_vl_pretrain.py)
  • Fix 235B HF path (Qwen/Qwen3-VL-235B-A22BQwen/Qwen3-VL-235B-A22B-Instruct)
  • Disable "attn" CUDA graph scope for Qwen3VLModel (causes position_embedding_type AttributeError) — keep moe_router + moe_preprocess only
  • GB200 235B: set MBS=2 and uneven PP split (10/12) for best throughput
  • Set common VLM-specific configs: apply_rope_fusion=False, moe_router_force_load_balancing=True, disable overlap_grad/param_gather

Changelog

  • Add specific line by line info of high level changes in this PR.

GitHub Actions CI

See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

  • Related to # (issue)

Summary by CodeRabbit

Release Notes

  • New Features

    • Added pretraining configurations for Qwen3.5-VL models (35B, 122B, and 397B variants) across multiple GPU types with BF16 and FP8 precision support.
  • Improvements

    • Updated Qwen3-VL recipes to use the Instruct variant checkpoint for better instruction-following capabilities.
    • Enhanced mock data generation for vision language model datasets with improved variability.
    • Improved parallel group size handling for encoder-parallel configurations.

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Apr 14, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@tomlifu tomlifu force-pushed the lifuz/qwen3-vl-pretrain-perf branch from ae8eb97 to cecd5f7 Compare April 14, 2026 22:19
@tomlifu tomlifu self-assigned this Apr 14, 2026
@tomlifu tomlifu added 26.04.01 r0.4.0 Auto-cherrypick to release branch. Apply before merge; cherrypick happens after merge. performance performance/release Performance items related with NeMo release area:perf Performance optimizations and benchmarking labels Apr 14, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 14, 2026

📝 Walkthrough

Walkthrough

Adds Qwen3.5-VL pretraining configuration modules with model-specific workload base configs for three model sizes (35B-A3B, 122B-A10B, 397B-A17B) across multiple GPU types. Refactors existing Qwen3-VL configuration to use a unified helper function pattern. Updates mock data generation to produce larger synthetic examples and modifies recipe HF model paths to instruct variants.

Changes

Cohort / File(s) Summary
Qwen3.5-VL Configuration Addition
scripts/performance/configs/qwen_vl/qwen35_vl_pretrain.py, scripts/performance/configs/qwen_vl/qwen35_vl_workload_base_configs.py, scripts/performance/configs/qwen_vl/__init__.py
Introduces new Qwen3.5-VL pretraining config module with shared common config routine and GPU-specific factory functions (35B-A3B, 122B-A10B, 397B-A17B across gb300/b300/gb200/b200/h100). Adds 45 new workload base config constants with hardware-dependent parallelism and batch size settings. Updates module exports to include all new config symbols.
Qwen3-VL Configuration Refactoring
scripts/performance/configs/qwen_vl/qwen3_vl_pretrain.py, scripts/performance/configs/qwen_vl/qwen3_vl_workload_base_configs.py
Refactors Qwen3-VL pretrain functions to use unified _qwen3_vl_pretrain_config helper replacing hardcoded constant selection. Renames exported factories from *_mock_config_* to *_pretrain_config_*. Updates GB200 presets to set micro_batch_size=2 and adjust CUDA graph scope.
Recipe and Bridge Updates
src/megatron/bridge/recipes/qwen_vl/qwen3_vl.py, src/megatron/bridge/recipes/qwen_vl/qwen35_vl.py, src/megatron/bridge/models/qwen_vl/qwen3_vl_step.py, src/megatron/bridge/data/vlm_datasets/mock_provider.py
Updates Qwen3-VL recipe HF model path from base to instruct variant. Modifies encoder-parallel group detection in step forward pass. Expands mock data generation to produce 1000 synthetic examples with variable response lengths.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested labels

performance, area:perf, feature, area:recipe, performance/release, needs-review

Suggested reviewers

  • cuichenx
  • yaoyu-33
🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Test Results For Major Changes ⚠️ Warning PR contains major performance-optimized configuration changes but lacks test results, performance benchmarks, or convergence validation information required for such changes. Add test results, before/after performance benchmarks on relevant hardware (GB300, GB200, B200, H100), and configuration context to validate no regression occurred.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Update Qwen3-VL pretrain perf configs for 30B and 235B' accurately summarizes the main objective of adding performance-optimized pretraining configurations for Qwen3-VL 30B and 235B variants across multiple GPU platforms.
Docstring Coverage ✅ Passed Docstring coverage is 94.12% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
⚔️ Resolve merge conflicts
  • Resolve merge conflict in branch lifuz/qwen3-vl-pretrain-perf

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
scripts/performance/configs/qwen_vl/qwen3_vl_pretrain.py (1)

71-88: ⚠️ Potential issue | 🟠 Major

Apply workload parallelism before building the recipe.

recipe_fn(...) computes some derived fields from its incoming parallelism, but set_workload_base_configs(...) mutates that parallelism afterward. The concrete breakage is pipeline_dtype: _qwen3_vl_common() sets it from pipeline_model_parallel_size in src/megatron/bridge/recipes/qwen_vl/qwen3_vl.py Line 121, while set_workload_base_configs() never recomputes it. That leaves configs like the GB300 variants with pipeline_model_parallel_size=1 but a stale pipeline dtype from the recipe defaults.

🔧 Suggested fix
     cfg = recipe_fn(
         mock=mock,
         precision_config=precision_config,
         comm_overlap_config=CommOverlapConfig(tp_comm_overlap=tp_comm_overlap),
         moe_flex_dispatcher_backend=base_cfg.moe_flex_dispatcher_backend,
+        tensor_model_parallel_size=base_cfg.tensor_model_parallel_size,
+        pipeline_model_parallel_size=base_cfg.pipeline_model_parallel_size,
+        expert_model_parallel_size=base_cfg.expert_model_parallel_size,
+        context_parallel_size=base_cfg.context_parallel_size,
+        sequence_parallel=base_cfg.sequence_parallel,
+        global_batch_size=base_cfg.global_batch_size,
+        micro_batch_size=base_cfg.micro_batch_size,
     )
     set_workload_base_configs(cfg, base_cfg)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/performance/configs/qwen_vl/qwen3_vl_pretrain.py` around lines 71 -
88, The recipe is built from stale parallelism because
set_workload_base_configs(cfg, base_cfg) runs after recipe_fn(...) — move the
workload-parallelism application before constructing the recipe so derived
fields are computed from the final parallelism: call
set_workload_base_configs(base_cfg) (i.e., apply base_cfg to cfg) prior to
invoking recipe_fn, or alternatively call recipe_fn after
set_workload_base_configs and then run set_qwen3_vl_common_configs(cfg); ensure
pipeline_dtype (computed in _qwen3_vl_common()/set_qwen3_vl_common_configs)
reflects pipeline_model_parallel_size from base_cfg.
🧹 Nitpick comments (4)
scripts/performance/configs/qwen_vl/qwen3_vl_pretrain.py (2)

127-136: The FP8-MX overlap guard is now dead code.

set_qwen3_vl_common_configs() already forces overlap_param_gather = False on cfg.comm_overlap, cfg.ddp, and cfg.optimizer for every config, so this branch no longer changes behavior. Either drop it or move the disabling back out of the common helper if it is meant to stay B200/FP8-MX-specific.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/performance/configs/qwen_vl/qwen3_vl_pretrain.py` around lines 127 -
136, The branch that disables overlap_param_gather for precision "fp8_mx" is now
redundant because set_qwen3_vl_common_configs() already forces
cfg.comm_overlap.overlap_param_gather = False, cfg.ddp.overlap_param_gather =
False, and cfg.optimizer.overlap_param_gather = False; either remove the if
precision == "fp8_mx" block in qwen3_vl_pretrain.py (around the
_qwen3_vl_pretrain_config call) to eliminate dead code, or if the intent was to
keep the disabling specific to B200/FP8-MX, move those three assignments out of
set_qwen3_vl_common_configs() back into the FP8-MX-specific branch so the
behavior remains configuration-specific (update references to cfg.comm_overlap,
cfg.ddp, and cfg.optimizer accordingly).

61-69: Make the new config builders keyword-only.

These signatures take multiple str parameters, so positional calls are easy to mix up. Add a * separator to match the repo’s Python API convention.

As per coding guidelines, "Use * separator for functions with multiple same-type parameters".

Also applies to: 98-100, 109-110, 123-125, 140-142, 156-158, 166-168, 179-181, 189-191

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/performance/configs/qwen_vl/qwen3_vl_pretrain.py` around lines 61 -
69, The config builder signatures (e.g., _qwen3_vl_pretrain_config) accept
multiple same-typed str parameters and must be made keyword-only by adding a "*"
separator in the parameter list (so positional callers cannot mix up arguments);
modify the function signature to place a "*" after the required positional
params (for example, keeping model_recipe_name positional but making gpu,
recipe_fn, precision, mock, config_variant, tp_comm_overlap keyword-only) and
apply the same change to the other config builder functions in this file so all
multi-str parameter builders follow the repo convention while preserving
existing type annotations and defaults.
scripts/performance/configs/qwen_vl/qwen35_vl_pretrain.py (1)

143-151: Minor: Redundant tp_comm_overlap=True parameter.

Line 150 explicitly passes tp_comm_overlap=True, which is already the default value in _qwen35_vl_pretrain_config. This is harmless but redundant compared to other non-H100 configs that rely on the default.

Optional cleanup
 def qwen35_vl_35b_a3b_pretrain_config_h100(
     precision: str = "bf16", mock: bool = True, config_variant: str = "v1"
 ) -> ConfigContainer:
     """H100, baseline config."""
     return _qwen35_vl_pretrain_config(
         "qwen35_vl_35b_a3b", "h100", qwen35_vl_35b_a3b_pretrain_mock_config,
         precision=precision, mock=mock, config_variant=config_variant,
-        tp_comm_overlap=True,
     )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/performance/configs/qwen_vl/qwen35_vl_pretrain.py` around lines 143 -
151, The qwen35_vl_35b_a3b_pretrain_config_h100 function redundantly passes
tp_comm_overlap=True to _qwen35_vl_pretrain_config even though that parameter
defaults to True; remove the explicit tp_comm_overlap=True argument from
qwen35_vl_35b_a3b_pretrain_config_h100 so it relies on the default in
_qwen35_vl_pretrain_config, keeping the function signature and other arguments
unchanged.
scripts/performance/configs/qwen_vl/qwen35_vl_workload_base_configs.py (1)

424-470: Consider sorting __all__ to satisfy ruff linting.

Static analysis (RUF022) flags that __all__ is not sorted. Per coding guidelines, ruff is used for linting. You can run uv run ruff check --fix . to auto-sort.

Also, note that H100 FP8_MX variants are not defined for any model size (no aliases like QWEN35_VL_35B_A3B_PRETRAIN_CONFIG_H100_FP8_MX = ...). If H100 FP8_MX support is needed, those aliases should be added.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/performance/configs/qwen_vl/qwen35_vl_workload_base_configs.py`
around lines 424 - 470, The __all__ list is not alphabetically sorted (RUF022)
and several H100 FP8_MX aliases are missing; sort the entries in __all__
alphabetically to satisfy ruff and, if H100 FP8_MX variants are required, add
the corresponding alias constants (e.g.,
QWEN35_VL_35B_A3B_PRETRAIN_CONFIG_H100_FP8_MX,
QWEN35_VL_122B_A10B_PRETRAIN_CONFIG_H100_FP8_MX,
QWEN35_VL_397B_A17B_PRETRAIN_CONFIG_H100_FP8_MX) mapping them to the appropriate
existing config objects; after changes run ruff --fix (or uv run ruff check
--fix .) to confirm the list is sorted.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@scripts/performance/configs/qwen_vl/qwen3_vl_pretrain.py`:
- Around line 71-88: The recipe is built from stale parallelism because
set_workload_base_configs(cfg, base_cfg) runs after recipe_fn(...) — move the
workload-parallelism application before constructing the recipe so derived
fields are computed from the final parallelism: call
set_workload_base_configs(base_cfg) (i.e., apply base_cfg to cfg) prior to
invoking recipe_fn, or alternatively call recipe_fn after
set_workload_base_configs and then run set_qwen3_vl_common_configs(cfg); ensure
pipeline_dtype (computed in _qwen3_vl_common()/set_qwen3_vl_common_configs)
reflects pipeline_model_parallel_size from base_cfg.

---

Nitpick comments:
In `@scripts/performance/configs/qwen_vl/qwen3_vl_pretrain.py`:
- Around line 127-136: The branch that disables overlap_param_gather for
precision "fp8_mx" is now redundant because set_qwen3_vl_common_configs()
already forces cfg.comm_overlap.overlap_param_gather = False,
cfg.ddp.overlap_param_gather = False, and cfg.optimizer.overlap_param_gather =
False; either remove the if precision == "fp8_mx" block in qwen3_vl_pretrain.py
(around the _qwen3_vl_pretrain_config call) to eliminate dead code, or if the
intent was to keep the disabling specific to B200/FP8-MX, move those three
assignments out of set_qwen3_vl_common_configs() back into the FP8-MX-specific
branch so the behavior remains configuration-specific (update references to
cfg.comm_overlap, cfg.ddp, and cfg.optimizer accordingly).
- Around line 61-69: The config builder signatures (e.g.,
_qwen3_vl_pretrain_config) accept multiple same-typed str parameters and must be
made keyword-only by adding a "*" separator in the parameter list (so positional
callers cannot mix up arguments); modify the function signature to place a "*"
after the required positional params (for example, keeping model_recipe_name
positional but making gpu, recipe_fn, precision, mock, config_variant,
tp_comm_overlap keyword-only) and apply the same change to the other config
builder functions in this file so all multi-str parameter builders follow the
repo convention while preserving existing type annotations and defaults.

In `@scripts/performance/configs/qwen_vl/qwen35_vl_pretrain.py`:
- Around line 143-151: The qwen35_vl_35b_a3b_pretrain_config_h100 function
redundantly passes tp_comm_overlap=True to _qwen35_vl_pretrain_config even
though that parameter defaults to True; remove the explicit tp_comm_overlap=True
argument from qwen35_vl_35b_a3b_pretrain_config_h100 so it relies on the default
in _qwen35_vl_pretrain_config, keeping the function signature and other
arguments unchanged.

In `@scripts/performance/configs/qwen_vl/qwen35_vl_workload_base_configs.py`:
- Around line 424-470: The __all__ list is not alphabetically sorted (RUF022)
and several H100 FP8_MX aliases are missing; sort the entries in __all__
alphabetically to satisfy ruff and, if H100 FP8_MX variants are required, add
the corresponding alias constants (e.g.,
QWEN35_VL_35B_A3B_PRETRAIN_CONFIG_H100_FP8_MX,
QWEN35_VL_122B_A10B_PRETRAIN_CONFIG_H100_FP8_MX,
QWEN35_VL_397B_A17B_PRETRAIN_CONFIG_H100_FP8_MX) mapping them to the appropriate
existing config objects; after changes run ruff --fix (or uv run ruff check
--fix .) to confirm the list is sorted.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: ec57236a-41d4-4054-9546-6bce98f5ec9e

📥 Commits

Reviewing files that changed from the base of the PR and between 9b38f65 and ae8eb97.

📒 Files selected for processing (9)
  • scripts/performance/configs/qwen_vl/__init__.py
  • scripts/performance/configs/qwen_vl/qwen35_vl_pretrain.py
  • scripts/performance/configs/qwen_vl/qwen35_vl_workload_base_configs.py
  • scripts/performance/configs/qwen_vl/qwen3_vl_pretrain.py
  • scripts/performance/configs/qwen_vl/qwen3_vl_workload_base_configs.py
  • src/megatron/bridge/data/vlm_datasets/mock_provider.py
  • src/megatron/bridge/models/qwen_vl/qwen3_vl_step.py
  • src/megatron/bridge/recipes/qwen_vl/qwen35_vl.py
  • src/megatron/bridge/recipes/qwen_vl/qwen3_vl.py

@yaoyu-33
Copy link
Copy Markdown
Contributor

/ok to test cecd5f7

@yaoyu-33
Copy link
Copy Markdown
Contributor

/claude review

Comment on lines +116 to +117
cfg.model.cuda_graph_impl = "transformer_engine"
cfg.model.cuda_graph_scope = ["moe_router", "moe_preprocess"]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: These two lines are no-ops — set_qwen3_vl_common_configs() (called inside _qwen3_vl_pretrain_config) already sets cuda_graph_impl and cuda_graph_scope to these exact values. Only lines 118-119 (num_layers_in_first/last_pipeline_stage) are actually overriding anything here.

Same applies to the 30B GB200 function below (lines 175-176) where the entire post-call block is redundant.

Suggested change
cfg.model.cuda_graph_impl = "transformer_engine"
cfg.model.cuda_graph_scope = ["moe_router", "moe_preprocess"]
cfg.model.num_layers_in_first_pipeline_stage = 10

moe_flex_dispatcher_backend="hybridep",
cuda_graph_impl="transformer_engine",
cuda_graph_scope=["attn", "moe_router", "moe_preprocess"],
cuda_graph_scope=["moe_router", "moe_preprocess"],
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 235B GB200 base configs here were updated to remove "attn" from cuda_graph_scope (good), but the 30B GB200 BF16 and FP8_CS base configs at lines 175 and 185 still include "attn".

Not a runtime bug — set_qwen3_vl_common_configs() overrides the scope — but having stale values in the base configs is misleading and inconsistent with this PR's intent. Consider updating those too for consistency.

Copy link
Copy Markdown
Contributor

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The refactor to _qwen3_vl_pretrain_config / get_workload_base_config() looks clean and consistent with the Qwen3.5-VL pattern. HF path fix and "attn" scope removal make sense.

Two minor items flagged inline:

  1. Redundant cuda_graph overrides in GB200 functionsset_qwen3_vl_common_configs() already sets these, so the explicit re-assignments in both GB200 functions are no-ops.
  2. Stale "attn" scope in 30B GB200 base configs — the 235B GB200 base configs were updated but the 30B GB200 BF16/FP8_CS configs in qwen3_vl_workload_base_configs.py (lines 175, 185) still include "attn". Not a runtime issue since set_qwen3_vl_common_configs overrides it, but worth cleaning up for consistency.

@tomlifu
Copy link
Copy Markdown
Contributor Author

tomlifu commented Apr 15, 2026

/ok to test 1b50ae1

malay-nagda
malay-nagda previously approved these changes Apr 15, 2026
Copy link
Copy Markdown
Contributor

@malay-nagda malay-nagda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

but, minor comments from @claude seem relevant?

@tomlifu
Copy link
Copy Markdown
Contributor Author

tomlifu commented Apr 15, 2026

/ok to test 9f5f334

Lifu Zhang added 3 commits April 14, 2026 22:11
Add performance-optimized pretraining configs for Qwen3-VL 30B-A3B and
235B-A22B across GB300, GB200, B200, and H100 GPUs.

Key changes:
- Refactor qwen3_vl_pretrain.py to use get_workload_base_config() pattern
  (consistent with qwen35_vl_pretrain.py)
- Register Qwen3-VL pretrain configs in __init__.py
- Fix 235B HF path: Qwen/Qwen3-VL-235B-A22B -> Qwen/Qwen3-VL-235B-A22B-Instruct
- Disable "attn" CUDA graph scope for Qwen3VLModel (incompatible with
  position_embedding_type access) — keep moe_router + moe_preprocess only
- GB200 235B: set MBS=2 and uneven PP split (10/12) for best throughput
- Set common VLM-specific configs: apply_rope_fusion=False,
  moe_router_force_load_balancing=True, disable overlap_grad/param_gather

Signed-off-by: Lifu Zhang <lifuz@login-lyris02.lyris.clusters.nvidia.com>
Signed-off-by: Lifu Zhang <lifuz@login-lyris02.lyris.clusters.nvidia.com>
…ttn scope

Signed-off-by: Lifu Zhang <lifuz@login-lyris02.lyris.clusters.nvidia.com>
@tomlifu tomlifu force-pushed the lifuz/qwen3-vl-pretrain-perf branch from 9f5f334 to 66c52d5 Compare April 15, 2026 05:13
@tomlifu
Copy link
Copy Markdown
Contributor Author

tomlifu commented Apr 15, 2026

/ok to test 66c52d5

@tomlifu tomlifu enabled auto-merge (squash) April 15, 2026 16:52
@tomlifu tomlifu merged commit 7778d4c into NVIDIA-NeMo:main Apr 15, 2026
57 checks passed
svcnvidia-nemo-ci pushed a commit that referenced this pull request Apr 15, 2026
Signed-off-by: Lifu Zhang <lifuz@login-lyris02.lyris.clusters.nvidia.com>
Co-authored-by: Lifu Zhang <lifuz@login-lyris02.lyris.clusters.nvidia.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

26.04.01 area:perf Performance optimizations and benchmarking performance/release Performance items related with NeMo release performance r0.4.0 Auto-cherrypick to release branch. Apply before merge; cherrypick happens after merge.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants