[model, training] fix: align Qwen3-VL padding for HybridEP by neiblegy · Pull Request #3294 · NVIDIA-NeMo/Megatron-Bridge

neiblegy · 2026-04-13T05:02:35Z

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Changelog

align Qwen3-VL batch padding with HybridEP per-rank token alignment requirements

GitHub Actions CI

See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

Related to # (issue)

Summary by CodeRabbit

New Features
- Added support for Hybrid Exchange-Padding (HybridEP) backend configuration for advanced padding scenarios
- Enabled sequence parallel processing capabilities with improved padding computation
Refactor
- Improved padding logic with enhanced configuration parameter support

Signed-off-by: protoss.gao <protoss.gao@shopee.com>

copy-pr-bot · 2026-04-13T05:02:44Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-04-13T05:09:03Z

📝 Walkthrough

Walkthrough

Added a helper function to compute padding-related values incorporating hybrid EP and sequence parallel settings. Extended pack_or_pad_batch_sequences() with new parameters for these features. Modified forward_step() to detect hybrid EP from model config and propagate both settings through the padding function.

Changes

Cohort / File(s)	Summary
Qwen3 VL Padding Configuration `src/megatron/bridge/models/qwen_vl/qwen3_vl_step.py`	Added `_get_qwen3_vl_padding_multiple()` helper function to compute padding multiples based on hybrid EP and sequence parallel flags. Extended `pack_or_pad_batch_sequences()` signature with `use_hybridep` and `sequence_parallel` parameters. Updated `forward_step()` to detect hybrid EP from model config and pass both settings to padding function.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Test Results For Major Changes	⚠️ Warning	PR modifies padding calculation logic affecting numerics and HybridEP token alignment, but lacks test results, new tests, performance metrics, or convergence validation documentation.	Add test results demonstrating no regressions, including convergence validation for HybridEP scenarios, unit tests for new padding functions, or performance metrics comparing old and new logic.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title directly describes the main change: fixing and aligning Qwen3-VL padding specifically for HybridEP support.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/megatron/bridge/models/qwen_vl/qwen3_vl_step.py (1)

210-222: ⚠️ Potential issue | 🟠 Major

Don’t bypass the new HybridEP alignment when PP forces seq_length.

Line 221 overwrites the HybridEP-aware target_len with seq_length. If that configured length is not divisible by divisible_by, PP+HybridEP can still hit the same per-rank token-alignment failure this change is meant to avoid.

Proposed fix

     target_len = math.ceil(cur_len / divisible_by) * divisible_by
     if force_to_pad_to_seq_len:
+        if seq_length % divisible_by != 0:
+            raise ValueError(
+                f"seq_length={seq_length} must be divisible by {divisible_by} "
+                "for the current TP/CP/SP/HybridEP padding requirements."
+            )
         target_len = seq_length

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/models/qwen_vl/qwen3_vl_step.py` around lines 210 - 222,
The code overwrites the HybridEP-aware target_len with seq_length when
force_to_pad_to_seq_len is true, which can break the divisible_by alignment
computed by _get_qwen3_vl_padding_multiple; instead, when
force_to_pad_to_seq_len is set, compute target_len by rounding seq_length up to
the nearest multiple of divisible_by (e.g., target_len = ceil(seq_length /
divisible_by) * divisible_by) so the HybridEP/PP alignment remains respected;
update the logic around target_len (the block that sets target_len and checks
force_to_pad_to_seq_len) to use divisible_by rather than directly assigning
seq_length.

🧹 Nitpick comments (1)

src/megatron/bridge/models/qwen_vl/qwen3_vl_step.py (1)

193-197: Make the padding flags keyword-only.

Adding two more positional bool args makes this API easier to misuse, and the current caller already passes them by name. A * here would match the repository convention and prevent accidental misordering.

Proposed fix

 def pack_or_pad_batch_sequences(
     tokens: torch.Tensor,
     labels: torch.Tensor,
     loss_mask: torch.Tensor,
     attention_mask: torch.Tensor,
     position_ids: torch.Tensor,
     this_pg_collection,
+    *,
     use_fp8_padding: bool = False,
     use_hybridep: bool = False,
     sequence_parallel: bool = False,
     force_to_pad_to_seq_len: bool = False,
     seq_length: int = None,

As per coding guidelines, "Use * separator for functions with multiple same-type parameters".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/models/qwen_vl/qwen3_vl_step.py` around lines 193 - 197,
The boolean padding/parallel flags should be keyword-only to prevent accidental
misordering: update the function/method signature that declares use_fp8_padding,
use_hybridep, sequence_parallel, force_to_pad_to_seq_len, seq_length (in
qwen3_vl_step.py) by inserting a '*' before those parameters so callers must
pass use_fp8_padding, use_hybridep, sequence_parallel, force_to_pad_to_seq_len,
and seq_length by name; ensure no other callers rely on positional ordering and
run tests/formatting after the change.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/megatron/bridge/models/qwen_vl/qwen3_vl_step.py`:
- Around line 41-70: The helper _get_qwen3_vl_padding_multiple currently returns
a tuple (divisible_by, local_token_divisor) but callers only use divisible_by;
change the function to return just divisible_by (remove local_token_divisor from
the return and from the public contract), keep the internal computation for
local_token_divisor if needed for computing divisible_by, and update all call
sites that unpack two values to accept a single return (e.g., replace
"divisible_by, _ = _get_qwen3_vl_padding_multiple(...)" or similar with
"divisible_by = _get_qwen3_vl_padding_multiple(...)"). Ensure the function
signature/type hints reflect -> int and run ruff formatting/lint fixes.

---

Outside diff comments:
In `@src/megatron/bridge/models/qwen_vl/qwen3_vl_step.py`:
- Around line 210-222: The code overwrites the HybridEP-aware target_len with
seq_length when force_to_pad_to_seq_len is true, which can break the
divisible_by alignment computed by _get_qwen3_vl_padding_multiple; instead, when
force_to_pad_to_seq_len is set, compute target_len by rounding seq_length up to
the nearest multiple of divisible_by (e.g., target_len = ceil(seq_length /
divisible_by) * divisible_by) so the HybridEP/PP alignment remains respected;
update the logic around target_len (the block that sets target_len and checks
force_to_pad_to_seq_len) to use divisible_by rather than directly assigning
seq_length.

---

Nitpick comments:
In `@src/megatron/bridge/models/qwen_vl/qwen3_vl_step.py`:
- Around line 193-197: The boolean padding/parallel flags should be keyword-only
to prevent accidental misordering: update the function/method signature that
declares use_fp8_padding, use_hybridep, sequence_parallel,
force_to_pad_to_seq_len, seq_length (in qwen3_vl_step.py) by inserting a '*'
before those parameters so callers must pass use_fp8_padding, use_hybridep,
sequence_parallel, force_to_pad_to_seq_len, and seq_length by name; ensure no
other callers rely on positional ordering and run tests/formatting after the
change.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: e0112263-bf0e-4e9e-b6a8-b4ceb8601b52

📥 Commits

Reviewing files that changed from the base of the PR and between 8dfe249 and e6e9be4.

📒 Files selected for processing (1)

src/megatron/bridge/models/qwen_vl/qwen3_vl_step.py

coderabbitai · 2026-04-13T05:09:06Z

src/megatron/bridge/models/qwen_vl/qwen3_vl_step.py

+def _get_qwen3_vl_padding_multiple(
+    *,
+    batch_size: int,
+    tp_size: int,
+    cp_size: int,
+    use_fp8_padding: bool,
+    use_hybridep: bool,
+    sequence_parallel: bool,
+) -> tuple[int, int]:
+    """Return the sequence padding multiple and local token divisor.
+
+    HybridEP requires the per-rank token count after CP/SP sharding to be
+    divisible by 128, otherwise the runtime JIT kernels can fail to compile.
+    """
+    base_divisible_by = tp_size * cp_size * 2 if cp_size > 1 else tp_size
+    if use_fp8_padding:
+        base_divisible_by = math.lcm(base_divisible_by, 16)
+
+    local_token_divisor = cp_size
+    if sequence_parallel:
+        local_token_divisor *= tp_size
+
+    divisible_by = base_divisible_by
+    if use_hybridep:
+        required_seq_multiple = (128 * local_token_divisor) // math.gcd(
+            batch_size, 128 * local_token_divisor
+        )
+        divisible_by = math.lcm(base_divisible_by, required_seq_multiple)
+
+    return divisible_by, local_token_divisor


⚠️ Potential issue | 🟡 Minor

Return only the value that callers actually use.

local_token_divisor never escapes any later computation, so the second tuple element just forces the caller to bind an unused name and triggers Ruff at Line 210.

Proposed fix

def _get_qwen3_vl_padding_multiple( @@ -) -> tuple[int, int]: - """Return the sequence padding multiple and local token divisor. +) -> int: + """Return the sequence padding multiple. @@ - return divisible_by, local_token_divisor + return divisible_by

- divisible_by, local_token_divisor = _get_qwen3_vl_padding_multiple( + divisible_by = _get_qwen3_vl_padding_multiple( batch_size=batch_size, tp_size=tp_size, cp_size=cp_size, use_fp8_padding=use_fp8_padding, use_hybridep=use_hybridep, sequence_parallel=sequence_parallel, )

As per coding guidelines, "Use ruff for linting and formatting Python code. Run uv run ruff check --fix . and uv run ruff format . to fix most issues. CI does not auto-fix linting and formatting issues."

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def _get_qwen3_vl_padding_multiple(

*,

batch_size: int,

tp_size: int,

cp_size: int,

use_fp8_padding: bool,

use_hybridep: bool,

sequence_parallel: bool,

) -> tuple[int, int]:

"""Return the sequence padding multiple and local token divisor.

HybridEP requires the per-rank token count after CP/SP sharding to be

divisible by 128, otherwise the runtime JIT kernels can fail to compile.

"""

base_divisible_by = tp_size * cp_size * 2 if cp_size > 1 else tp_size

if use_fp8_padding:

base_divisible_by = math.lcm(base_divisible_by, 16)

local_token_divisor = cp_size

if sequence_parallel:

local_token_divisor *= tp_size

divisible_by = base_divisible_by

if use_hybridep:

required_seq_multiple = (128 * local_token_divisor) // math.gcd(

batch_size, 128 * local_token_divisor

)

divisible_by = math.lcm(base_divisible_by, required_seq_multiple)

return divisible_by, local_token_divisor

def _get_qwen3_vl_padding_multiple(

*,

batch_size: int,

tp_size: int,

cp_size: int,

use_fp8_padding: bool,

use_hybridep: bool,

sequence_parallel: bool,

) -> int:

"""Return the sequence padding multiple.

HybridEP requires the per-rank token count after CP/SP sharding to be

divisible by 128, otherwise the runtime JIT kernels can fail to compile.

"""

base_divisible_by = tp_size * cp_size * 2 if cp_size > 1 else tp_size

if use_fp8_padding:

base_divisible_by = math.lcm(base_divisible_by, 16)

local_token_divisor = cp_size

if sequence_parallel:

local_token_divisor *= tp_size

divisible_by = base_divisible_by

if use_hybridep:

required_seq_multiple = (128 * local_token_divisor) // math.gcd(

batch_size, 128 * local_token_divisor

)

divisible_by = math.lcm(base_divisible_by, required_seq_multiple)

return divisible_by

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/megatron/bridge/models/qwen_vl/qwen3_vl_step.py` around lines 41 - 70, The helper _get_qwen3_vl_padding_multiple currently returns a tuple (divisible_by, local_token_divisor) but callers only use divisible_by; change the function to return just divisible_by (remove local_token_divisor from the return and from the public contract), keep the internal computation for local_token_divisor if needed for computing divisible_by, and update all call sites that unpack two values to accept a single return (e.g., replace "divisible_by, _ = _get_qwen3_vl_padding_multiple(...)" or similar with "divisible_by = _get_qwen3_vl_padding_multiple(...)"). Ensure the function signature/type hints reflect -> int and run ruff formatting/lint fixes.

neiblegy · 2026-04-13T05:10:43Z

## Problem

When running Qwen3.5-VL MoE training with HybridEP enabled, the training can fail during the first MoE dispatch/combine path with a runtime JIT compilation error from deep_ep.

In our reproduction, the failure happens with a setup like:

moe_token_dispatcher_type=flex
moe_flex_dispatcher_backend=hybridep
tp=1
ep=8
pp=1
micro_batch_size=4

The underlying issue is that qwen3_vl_step.py only pads sequences according to the existing TP/CP/FP8 alignment logic, and only forces padding to model.seq_length when PP > 1. For HybridEP, however, the per-rank token count must additionally satisfy stricter alignment requirements for the runtime JIT kernels.

As a result, some valid batches can produce a per-rank token count that is not divisible by the chunk size expected by HybridEP. In our case, this led to a runtime compile failure with an error like:

MAX_NUM_OF_TOKENS_PER_RANK must be multiple of NUM_OF_TOKENS_PER_CHUNK

So this is not a model-quality issue or a user misconfiguration of MoE routing itself; it is a padding/alignment gap in the Qwen3-VL step path when HybridEP is used.

neiblegy · 2026-04-13T07:07:10Z

With the updated HybridEP, the previous runtime JIT failure no longer reproduces even without the Megatron-Bridge-side fix. This suggests the root cause has been fixed upstream in HybridEP / DeepEP rather than in Megatron-Bridge.

updated HybridEP version: 34152ae

[model, training] fix: align Qwen3-VL padding for HybridEP

e6e9be4

Signed-off-by: protoss.gao <protoss.gao@shopee.com>

github-actions bot added the community-request label Apr 13, 2026

coderabbitai bot reviewed Apr 13, 2026

View reviewed changes

neiblegy closed this Apr 13, 2026

yaoyu-33 reopened this Apr 15, 2026

yaoyu-33 added needs-review PR is ready for code review and waiting on a reviewer x-shopee labels Apr 15, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[model, training] fix: align Qwen3-VL padding for HybridEP#3294

[model, training] fix: align Qwen3-VL padding for HybridEP#3294
neiblegy wants to merge 1 commit intoNVIDIA-NeMo:mainfrom
neiblegy:fix/qwen3-vl-hybridep-padding

neiblegy commented Apr 13, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

copy-pr-bot bot commented Apr 13, 2026

Uh oh!

coderabbitai bot commented Apr 13, 2026

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Apr 13, 2026

Uh oh!

neiblegy commented Apr 13, 2026

Uh oh!

neiblegy commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

neiblegy commented Apr 13, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Changelog

GitHub Actions CI

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Apr 13, 2026

Uh oh!

coderabbitai bot commented Apr 13, 2026

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

neiblegy commented Apr 13, 2026

Uh oh!

neiblegy commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

neiblegy commented Apr 13, 2026 •

edited by coderabbitai bot

Loading