[model] fix: Qwen3.5-VL MTP standard attn specs patch#3330
[model] fix: Qwen3.5-VL MTP standard attn specs patch#3330HollowMan6 wants to merge 2 commits intoNVIDIA-NeMo:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Fixes Qwen3.5-VL MTP (next-token prediction) initialization by ensuring the MTP block spec’s self-attention specs are patched to use Qwen3VLSelfAttention, preventing incorrect RoPE/mRoPE rotary embedding behavior.
Changes:
- Patch
mtp_block_spec(...)output with_patch_standard_attention_specs(...)before constructing Qwen3.5-VL dense and MoE models. - Extend
_patch_standard_attention_specsto recurse into nested MTP layer specs. - Add unit tests validating recursion into MTP specs and that
provide()passes patched specs to the model constructor.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
src/megatron/bridge/models/qwen_vl/qwen35_vl_provider.py |
Patches MTP block spec attention modules to Qwen3VLSelfAttention and adds recursive patching support. |
tests/unit_tests/models/qwen_vl/test_qwen35_vl_provider.py |
Adds unit tests covering recursive patching into MTP specs and provide() behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughThe model provider's Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
The main decoder was patched to use `Qwen3VLSelfAttention`, but the separately generated MTP spec was still using standard Megatron SelfAttention, so MTP took the wrong RoPE path and threw error in rotary embedding. Fixed that by recursively patching standard attention specs inside MTP specs as well, and by applying that patch before constructing both dense and MoE Qwen3.5-VL models. Signed-off-by: Hollow Man <hollowman@opensuse.org>
0709738 to
fd9dad0
Compare
…n inputs Signed-off-by: Hollow Man <hollowman@opensuse.org>
What does this PR do ?
The main decoder was patched to use
Qwen3VLSelfAttention, but the separately generated MTP spec was still using standard Megatron SelfAttention, so MTP took the wrong RoPE path and threw error in rotary embedding.Changelog
GitHub Actions CI
See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.
Before your PR is "Ready for review"
Pre checks:
If you haven't finished some of the above items you can still open "Draft" PR.
Additional Information
Summary by CodeRabbit
Refactor
Tests