[model, training] fix: align Qwen3-VL padding for HybridEP#3294
[model, training] fix: align Qwen3-VL padding for HybridEP#3294neiblegy wants to merge 1 commit intoNVIDIA-NeMo:mainfrom
Conversation
Signed-off-by: protoss.gao <protoss.gao@shopee.com>
📝 WalkthroughWalkthroughAdded a helper function to compute padding-related values incorporating hybrid EP and sequence parallel settings. Extended Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/megatron/bridge/models/qwen_vl/qwen3_vl_step.py (1)
210-222:⚠️ Potential issue | 🟠 MajorDon’t bypass the new HybridEP alignment when PP forces
seq_length.Line 221 overwrites the HybridEP-aware
target_lenwithseq_length. If that configured length is not divisible bydivisible_by, PP+HybridEP can still hit the same per-rank token-alignment failure this change is meant to avoid.Proposed fix
target_len = math.ceil(cur_len / divisible_by) * divisible_by if force_to_pad_to_seq_len: + if seq_length % divisible_by != 0: + raise ValueError( + f"seq_length={seq_length} must be divisible by {divisible_by} " + "for the current TP/CP/SP/HybridEP padding requirements." + ) target_len = seq_length🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/megatron/bridge/models/qwen_vl/qwen3_vl_step.py` around lines 210 - 222, The code overwrites the HybridEP-aware target_len with seq_length when force_to_pad_to_seq_len is true, which can break the divisible_by alignment computed by _get_qwen3_vl_padding_multiple; instead, when force_to_pad_to_seq_len is set, compute target_len by rounding seq_length up to the nearest multiple of divisible_by (e.g., target_len = ceil(seq_length / divisible_by) * divisible_by) so the HybridEP/PP alignment remains respected; update the logic around target_len (the block that sets target_len and checks force_to_pad_to_seq_len) to use divisible_by rather than directly assigning seq_length.
🧹 Nitpick comments (1)
src/megatron/bridge/models/qwen_vl/qwen3_vl_step.py (1)
193-197: Make the padding flags keyword-only.Adding two more positional
boolargs makes this API easier to misuse, and the current caller already passes them by name. A*here would match the repository convention and prevent accidental misordering.Proposed fix
def pack_or_pad_batch_sequences( tokens: torch.Tensor, labels: torch.Tensor, loss_mask: torch.Tensor, attention_mask: torch.Tensor, position_ids: torch.Tensor, this_pg_collection, + *, use_fp8_padding: bool = False, use_hybridep: bool = False, sequence_parallel: bool = False, force_to_pad_to_seq_len: bool = False, seq_length: int = None,As per coding guidelines, "Use
*separator for functions with multiple same-type parameters".🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/megatron/bridge/models/qwen_vl/qwen3_vl_step.py` around lines 193 - 197, The boolean padding/parallel flags should be keyword-only to prevent accidental misordering: update the function/method signature that declares use_fp8_padding, use_hybridep, sequence_parallel, force_to_pad_to_seq_len, seq_length (in qwen3_vl_step.py) by inserting a '*' before those parameters so callers must pass use_fp8_padding, use_hybridep, sequence_parallel, force_to_pad_to_seq_len, and seq_length by name; ensure no other callers rely on positional ordering and run tests/formatting after the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/megatron/bridge/models/qwen_vl/qwen3_vl_step.py`:
- Around line 41-70: The helper _get_qwen3_vl_padding_multiple currently returns
a tuple (divisible_by, local_token_divisor) but callers only use divisible_by;
change the function to return just divisible_by (remove local_token_divisor from
the return and from the public contract), keep the internal computation for
local_token_divisor if needed for computing divisible_by, and update all call
sites that unpack two values to accept a single return (e.g., replace
"divisible_by, _ = _get_qwen3_vl_padding_multiple(...)" or similar with
"divisible_by = _get_qwen3_vl_padding_multiple(...)"). Ensure the function
signature/type hints reflect -> int and run ruff formatting/lint fixes.
---
Outside diff comments:
In `@src/megatron/bridge/models/qwen_vl/qwen3_vl_step.py`:
- Around line 210-222: The code overwrites the HybridEP-aware target_len with
seq_length when force_to_pad_to_seq_len is true, which can break the
divisible_by alignment computed by _get_qwen3_vl_padding_multiple; instead, when
force_to_pad_to_seq_len is set, compute target_len by rounding seq_length up to
the nearest multiple of divisible_by (e.g., target_len = ceil(seq_length /
divisible_by) * divisible_by) so the HybridEP/PP alignment remains respected;
update the logic around target_len (the block that sets target_len and checks
force_to_pad_to_seq_len) to use divisible_by rather than directly assigning
seq_length.
---
Nitpick comments:
In `@src/megatron/bridge/models/qwen_vl/qwen3_vl_step.py`:
- Around line 193-197: The boolean padding/parallel flags should be keyword-only
to prevent accidental misordering: update the function/method signature that
declares use_fp8_padding, use_hybridep, sequence_parallel,
force_to_pad_to_seq_len, seq_length (in qwen3_vl_step.py) by inserting a '*'
before those parameters so callers must pass use_fp8_padding, use_hybridep,
sequence_parallel, force_to_pad_to_seq_len, and seq_length by name; ensure no
other callers rely on positional ordering and run tests/formatting after the
change.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: e0112263-bf0e-4e9e-b6a8-b4ceb8601b52
📒 Files selected for processing (1)
src/megatron/bridge/models/qwen_vl/qwen3_vl_step.py
| def _get_qwen3_vl_padding_multiple( | ||
| *, | ||
| batch_size: int, | ||
| tp_size: int, | ||
| cp_size: int, | ||
| use_fp8_padding: bool, | ||
| use_hybridep: bool, | ||
| sequence_parallel: bool, | ||
| ) -> tuple[int, int]: | ||
| """Return the sequence padding multiple and local token divisor. | ||
|
|
||
| HybridEP requires the per-rank token count after CP/SP sharding to be | ||
| divisible by 128, otherwise the runtime JIT kernels can fail to compile. | ||
| """ | ||
| base_divisible_by = tp_size * cp_size * 2 if cp_size > 1 else tp_size | ||
| if use_fp8_padding: | ||
| base_divisible_by = math.lcm(base_divisible_by, 16) | ||
|
|
||
| local_token_divisor = cp_size | ||
| if sequence_parallel: | ||
| local_token_divisor *= tp_size | ||
|
|
||
| divisible_by = base_divisible_by | ||
| if use_hybridep: | ||
| required_seq_multiple = (128 * local_token_divisor) // math.gcd( | ||
| batch_size, 128 * local_token_divisor | ||
| ) | ||
| divisible_by = math.lcm(base_divisible_by, required_seq_multiple) | ||
|
|
||
| return divisible_by, local_token_divisor |
There was a problem hiding this comment.
Return only the value that callers actually use.
local_token_divisor never escapes any later computation, so the second tuple element just forces the caller to bind an unused name and triggers Ruff at Line 210.
Proposed fix
def _get_qwen3_vl_padding_multiple(
@@
-) -> tuple[int, int]:
- """Return the sequence padding multiple and local token divisor.
+) -> int:
+ """Return the sequence padding multiple.
@@
- return divisible_by, local_token_divisor
+ return divisible_by- divisible_by, local_token_divisor = _get_qwen3_vl_padding_multiple(
+ divisible_by = _get_qwen3_vl_padding_multiple(
batch_size=batch_size,
tp_size=tp_size,
cp_size=cp_size,
use_fp8_padding=use_fp8_padding,
use_hybridep=use_hybridep,
sequence_parallel=sequence_parallel,
)As per coding guidelines, "Use ruff for linting and formatting Python code. Run uv run ruff check --fix . and uv run ruff format . to fix most issues. CI does not auto-fix linting and formatting issues."
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def _get_qwen3_vl_padding_multiple( | |
| *, | |
| batch_size: int, | |
| tp_size: int, | |
| cp_size: int, | |
| use_fp8_padding: bool, | |
| use_hybridep: bool, | |
| sequence_parallel: bool, | |
| ) -> tuple[int, int]: | |
| """Return the sequence padding multiple and local token divisor. | |
| HybridEP requires the per-rank token count after CP/SP sharding to be | |
| divisible by 128, otherwise the runtime JIT kernels can fail to compile. | |
| """ | |
| base_divisible_by = tp_size * cp_size * 2 if cp_size > 1 else tp_size | |
| if use_fp8_padding: | |
| base_divisible_by = math.lcm(base_divisible_by, 16) | |
| local_token_divisor = cp_size | |
| if sequence_parallel: | |
| local_token_divisor *= tp_size | |
| divisible_by = base_divisible_by | |
| if use_hybridep: | |
| required_seq_multiple = (128 * local_token_divisor) // math.gcd( | |
| batch_size, 128 * local_token_divisor | |
| ) | |
| divisible_by = math.lcm(base_divisible_by, required_seq_multiple) | |
| return divisible_by, local_token_divisor | |
| def _get_qwen3_vl_padding_multiple( | |
| *, | |
| batch_size: int, | |
| tp_size: int, | |
| cp_size: int, | |
| use_fp8_padding: bool, | |
| use_hybridep: bool, | |
| sequence_parallel: bool, | |
| ) -> int: | |
| """Return the sequence padding multiple. | |
| HybridEP requires the per-rank token count after CP/SP sharding to be | |
| divisible by 128, otherwise the runtime JIT kernels can fail to compile. | |
| """ | |
| base_divisible_by = tp_size * cp_size * 2 if cp_size > 1 else tp_size | |
| if use_fp8_padding: | |
| base_divisible_by = math.lcm(base_divisible_by, 16) | |
| local_token_divisor = cp_size | |
| if sequence_parallel: | |
| local_token_divisor *= tp_size | |
| divisible_by = base_divisible_by | |
| if use_hybridep: | |
| required_seq_multiple = (128 * local_token_divisor) // math.gcd( | |
| batch_size, 128 * local_token_divisor | |
| ) | |
| divisible_by = math.lcm(base_divisible_by, required_seq_multiple) | |
| return divisible_by |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/megatron/bridge/models/qwen_vl/qwen3_vl_step.py` around lines 41 - 70,
The helper _get_qwen3_vl_padding_multiple currently returns a tuple
(divisible_by, local_token_divisor) but callers only use divisible_by; change
the function to return just divisible_by (remove local_token_divisor from the
return and from the public contract), keep the internal computation for
local_token_divisor if needed for computing divisible_by, and update all call
sites that unpack two values to accept a single return (e.g., replace
"divisible_by, _ = _get_qwen3_vl_padding_multiple(...)" or similar with
"divisible_by = _get_qwen3_vl_padding_multiple(...)"). Ensure the function
signature/type hints reflect -> int and run ruff formatting/lint fixes.
|
With the updated HybridEP, the previous runtime JIT failure no longer reproduces even without the Megatron-Bridge-side fix. This suggests the root cause has been fixed upstream in HybridEP / DeepEP rather than in Megatron-Bridge. updated HybridEP version: 34152ae |


What does this PR do ?
Add a one line overview of what this PR aims to accomplish.
Changelog
GitHub Actions CI
See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.
Before your PR is "Ready for review"
Pre checks:
If you haven't finished some of the above items you can still open "Draft" PR.
Additional Information
Summary by CodeRabbit
New Features
Refactor