Skip to content

[model, training] fix: align Qwen3-VL padding for HybridEP#3294

Open
neiblegy wants to merge 1 commit intoNVIDIA-NeMo:mainfrom
neiblegy:fix/qwen3-vl-hybridep-padding
Open

[model, training] fix: align Qwen3-VL padding for HybridEP#3294
neiblegy wants to merge 1 commit intoNVIDIA-NeMo:mainfrom
neiblegy:fix/qwen3-vl-hybridep-padding

Conversation

@neiblegy
Copy link
Copy Markdown

@neiblegy neiblegy commented Apr 13, 2026

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Changelog

  • align Qwen3-VL batch padding with HybridEP per-rank token alignment requirements

GitHub Actions CI

See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

  • Related to # (issue)

Summary by CodeRabbit

  • New Features

    • Added support for Hybrid Exchange-Padding (HybridEP) backend configuration for advanced padding scenarios
    • Enabled sequence parallel processing capabilities with improved padding computation
  • Refactor

    • Improved padding logic with enhanced configuration parameter support

Signed-off-by: protoss.gao <protoss.gao@shopee.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Apr 13, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 13, 2026

📝 Walkthrough

Walkthrough

Added a helper function to compute padding-related values incorporating hybrid EP and sequence parallel settings. Extended pack_or_pad_batch_sequences() with new parameters for these features. Modified forward_step() to detect hybrid EP from model config and propagate both settings through the padding function.

Changes

Cohort / File(s) Summary
Qwen3 VL Padding Configuration
src/megatron/bridge/models/qwen_vl/qwen3_vl_step.py
Added _get_qwen3_vl_padding_multiple() helper function to compute padding multiples based on hybrid EP and sequence parallel flags. Extended pack_or_pad_batch_sequences() signature with use_hybridep and sequence_parallel parameters. Updated forward_step() to detect hybrid EP from model config and pass both settings to padding function.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Test Results For Major Changes ⚠️ Warning PR modifies padding calculation logic affecting numerics and HybridEP token alignment, but lacks test results, new tests, performance metrics, or convergence validation documentation. Add test results demonstrating no regressions, including convergence validation for HybridEP scenarios, unit tests for new padding functions, or performance metrics comparing old and new logic.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly describes the main change: fixing and aligning Qwen3-VL padding specifically for HybridEP support.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/megatron/bridge/models/qwen_vl/qwen3_vl_step.py (1)

210-222: ⚠️ Potential issue | 🟠 Major

Don’t bypass the new HybridEP alignment when PP forces seq_length.

Line 221 overwrites the HybridEP-aware target_len with seq_length. If that configured length is not divisible by divisible_by, PP+HybridEP can still hit the same per-rank token-alignment failure this change is meant to avoid.

Proposed fix
     target_len = math.ceil(cur_len / divisible_by) * divisible_by
     if force_to_pad_to_seq_len:
+        if seq_length % divisible_by != 0:
+            raise ValueError(
+                f"seq_length={seq_length} must be divisible by {divisible_by} "
+                "for the current TP/CP/SP/HybridEP padding requirements."
+            )
         target_len = seq_length
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/models/qwen_vl/qwen3_vl_step.py` around lines 210 - 222,
The code overwrites the HybridEP-aware target_len with seq_length when
force_to_pad_to_seq_len is true, which can break the divisible_by alignment
computed by _get_qwen3_vl_padding_multiple; instead, when
force_to_pad_to_seq_len is set, compute target_len by rounding seq_length up to
the nearest multiple of divisible_by (e.g., target_len = ceil(seq_length /
divisible_by) * divisible_by) so the HybridEP/PP alignment remains respected;
update the logic around target_len (the block that sets target_len and checks
force_to_pad_to_seq_len) to use divisible_by rather than directly assigning
seq_length.
🧹 Nitpick comments (1)
src/megatron/bridge/models/qwen_vl/qwen3_vl_step.py (1)

193-197: Make the padding flags keyword-only.

Adding two more positional bool args makes this API easier to misuse, and the current caller already passes them by name. A * here would match the repository convention and prevent accidental misordering.

Proposed fix
 def pack_or_pad_batch_sequences(
     tokens: torch.Tensor,
     labels: torch.Tensor,
     loss_mask: torch.Tensor,
     attention_mask: torch.Tensor,
     position_ids: torch.Tensor,
     this_pg_collection,
+    *,
     use_fp8_padding: bool = False,
     use_hybridep: bool = False,
     sequence_parallel: bool = False,
     force_to_pad_to_seq_len: bool = False,
     seq_length: int = None,

As per coding guidelines, "Use * separator for functions with multiple same-type parameters".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/models/qwen_vl/qwen3_vl_step.py` around lines 193 - 197,
The boolean padding/parallel flags should be keyword-only to prevent accidental
misordering: update the function/method signature that declares use_fp8_padding,
use_hybridep, sequence_parallel, force_to_pad_to_seq_len, seq_length (in
qwen3_vl_step.py) by inserting a '*' before those parameters so callers must
pass use_fp8_padding, use_hybridep, sequence_parallel, force_to_pad_to_seq_len,
and seq_length by name; ensure no other callers rely on positional ordering and
run tests/formatting after the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/megatron/bridge/models/qwen_vl/qwen3_vl_step.py`:
- Around line 41-70: The helper _get_qwen3_vl_padding_multiple currently returns
a tuple (divisible_by, local_token_divisor) but callers only use divisible_by;
change the function to return just divisible_by (remove local_token_divisor from
the return and from the public contract), keep the internal computation for
local_token_divisor if needed for computing divisible_by, and update all call
sites that unpack two values to accept a single return (e.g., replace
"divisible_by, _ = _get_qwen3_vl_padding_multiple(...)" or similar with
"divisible_by = _get_qwen3_vl_padding_multiple(...)"). Ensure the function
signature/type hints reflect -> int and run ruff formatting/lint fixes.

---

Outside diff comments:
In `@src/megatron/bridge/models/qwen_vl/qwen3_vl_step.py`:
- Around line 210-222: The code overwrites the HybridEP-aware target_len with
seq_length when force_to_pad_to_seq_len is true, which can break the
divisible_by alignment computed by _get_qwen3_vl_padding_multiple; instead, when
force_to_pad_to_seq_len is set, compute target_len by rounding seq_length up to
the nearest multiple of divisible_by (e.g., target_len = ceil(seq_length /
divisible_by) * divisible_by) so the HybridEP/PP alignment remains respected;
update the logic around target_len (the block that sets target_len and checks
force_to_pad_to_seq_len) to use divisible_by rather than directly assigning
seq_length.

---

Nitpick comments:
In `@src/megatron/bridge/models/qwen_vl/qwen3_vl_step.py`:
- Around line 193-197: The boolean padding/parallel flags should be keyword-only
to prevent accidental misordering: update the function/method signature that
declares use_fp8_padding, use_hybridep, sequence_parallel,
force_to_pad_to_seq_len, seq_length (in qwen3_vl_step.py) by inserting a '*'
before those parameters so callers must pass use_fp8_padding, use_hybridep,
sequence_parallel, force_to_pad_to_seq_len, and seq_length by name; ensure no
other callers rely on positional ordering and run tests/formatting after the
change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: e0112263-bf0e-4e9e-b6a8-b4ceb8601b52

📥 Commits

Reviewing files that changed from the base of the PR and between 8dfe249 and e6e9be4.

📒 Files selected for processing (1)
  • src/megatron/bridge/models/qwen_vl/qwen3_vl_step.py

Comment on lines +41 to +70
def _get_qwen3_vl_padding_multiple(
*,
batch_size: int,
tp_size: int,
cp_size: int,
use_fp8_padding: bool,
use_hybridep: bool,
sequence_parallel: bool,
) -> tuple[int, int]:
"""Return the sequence padding multiple and local token divisor.

HybridEP requires the per-rank token count after CP/SP sharding to be
divisible by 128, otherwise the runtime JIT kernels can fail to compile.
"""
base_divisible_by = tp_size * cp_size * 2 if cp_size > 1 else tp_size
if use_fp8_padding:
base_divisible_by = math.lcm(base_divisible_by, 16)

local_token_divisor = cp_size
if sequence_parallel:
local_token_divisor *= tp_size

divisible_by = base_divisible_by
if use_hybridep:
required_seq_multiple = (128 * local_token_divisor) // math.gcd(
batch_size, 128 * local_token_divisor
)
divisible_by = math.lcm(base_divisible_by, required_seq_multiple)

return divisible_by, local_token_divisor
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Return only the value that callers actually use.

local_token_divisor never escapes any later computation, so the second tuple element just forces the caller to bind an unused name and triggers Ruff at Line 210.

Proposed fix
 def _get_qwen3_vl_padding_multiple(
@@
-) -> tuple[int, int]:
-    """Return the sequence padding multiple and local token divisor.
+) -> int:
+    """Return the sequence padding multiple.
@@
-    return divisible_by, local_token_divisor
+    return divisible_by
-    divisible_by, local_token_divisor = _get_qwen3_vl_padding_multiple(
+    divisible_by = _get_qwen3_vl_padding_multiple(
         batch_size=batch_size,
         tp_size=tp_size,
         cp_size=cp_size,
         use_fp8_padding=use_fp8_padding,
         use_hybridep=use_hybridep,
         sequence_parallel=sequence_parallel,
     )

As per coding guidelines, "Use ruff for linting and formatting Python code. Run uv run ruff check --fix . and uv run ruff format . to fix most issues. CI does not auto-fix linting and formatting issues."

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def _get_qwen3_vl_padding_multiple(
*,
batch_size: int,
tp_size: int,
cp_size: int,
use_fp8_padding: bool,
use_hybridep: bool,
sequence_parallel: bool,
) -> tuple[int, int]:
"""Return the sequence padding multiple and local token divisor.
HybridEP requires the per-rank token count after CP/SP sharding to be
divisible by 128, otherwise the runtime JIT kernels can fail to compile.
"""
base_divisible_by = tp_size * cp_size * 2 if cp_size > 1 else tp_size
if use_fp8_padding:
base_divisible_by = math.lcm(base_divisible_by, 16)
local_token_divisor = cp_size
if sequence_parallel:
local_token_divisor *= tp_size
divisible_by = base_divisible_by
if use_hybridep:
required_seq_multiple = (128 * local_token_divisor) // math.gcd(
batch_size, 128 * local_token_divisor
)
divisible_by = math.lcm(base_divisible_by, required_seq_multiple)
return divisible_by, local_token_divisor
def _get_qwen3_vl_padding_multiple(
*,
batch_size: int,
tp_size: int,
cp_size: int,
use_fp8_padding: bool,
use_hybridep: bool,
sequence_parallel: bool,
) -> int:
"""Return the sequence padding multiple.
HybridEP requires the per-rank token count after CP/SP sharding to be
divisible by 128, otherwise the runtime JIT kernels can fail to compile.
"""
base_divisible_by = tp_size * cp_size * 2 if cp_size > 1 else tp_size
if use_fp8_padding:
base_divisible_by = math.lcm(base_divisible_by, 16)
local_token_divisor = cp_size
if sequence_parallel:
local_token_divisor *= tp_size
divisible_by = base_divisible_by
if use_hybridep:
required_seq_multiple = (128 * local_token_divisor) // math.gcd(
batch_size, 128 * local_token_divisor
)
divisible_by = math.lcm(base_divisible_by, required_seq_multiple)
return divisible_by
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/models/qwen_vl/qwen3_vl_step.py` around lines 41 - 70,
The helper _get_qwen3_vl_padding_multiple currently returns a tuple
(divisible_by, local_token_divisor) but callers only use divisible_by; change
the function to return just divisible_by (remove local_token_divisor from the
return and from the public contract), keep the internal computation for
local_token_divisor if needed for computing divisible_by, and update all call
sites that unpack two values to accept a single return (e.g., replace
"divisible_by, _ = _get_qwen3_vl_padding_multiple(...)" or similar with
"divisible_by = _get_qwen3_vl_padding_multiple(...)"). Ensure the function
signature/type hints reflect -> int and run ruff formatting/lint fixes.

@neiblegy
Copy link
Copy Markdown
Author

image image ## Problem

When running Qwen3.5-VL MoE training with HybridEP enabled, the training can fail during the first MoE dispatch/combine path with a runtime JIT compilation error from deep_ep.

In our reproduction, the failure happens with a setup like:

  • moe_token_dispatcher_type=flex
  • moe_flex_dispatcher_backend=hybridep
  • tp=1
  • ep=8
  • pp=1
  • micro_batch_size=4

The underlying issue is that qwen3_vl_step.py only pads sequences according to the existing TP/CP/FP8 alignment logic, and only forces padding to model.seq_length when PP > 1. For HybridEP, however, the per-rank token count must additionally satisfy stricter alignment requirements for the runtime JIT kernels.

As a result, some valid batches can produce a per-rank token count that is not divisible by the chunk size expected by HybridEP. In our case, this led to a runtime compile failure with an error like:

MAX_NUM_OF_TOKENS_PER_RANK must be multiple of NUM_OF_TOKENS_PER_CHUNK

So this is not a model-quality issue or a user misconfiguration of MoE routing itself; it is a padding/alignment gap in the Qwen3-VL step path when HybridEP is used.

@neiblegy
Copy link
Copy Markdown
Author

With the updated HybridEP, the previous runtime JIT failure no longer reproduces even without the Megatron-Bridge-side fix. This suggests the root cause has been fixed upstream in HybridEP / DeepEP rather than in Megatron-Bridge.

updated HybridEP version: 34152ae

@neiblegy neiblegy closed this Apr 13, 2026
@yaoyu-33 yaoyu-33 reopened this Apr 15, 2026
@yaoyu-33 yaoyu-33 added needs-review PR is ready for code review and waiting on a reviewer x-shopee labels Apr 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-request needs-review PR is ready for code review and waiting on a reviewer x-shopee

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants