cp: fix(gemma3-vl): force right-padding in VLM collate to prevent token loss (3331) into r0.4.0#3332
cp: fix(gemma3-vl): force right-padding in VLM collate to prevent token loss (3331) into r0.4.0#3332svcnvidia-nemo-ci wants to merge 1 commit intor0.4.0from
fix(gemma3-vl): force right-padding in VLM collate to prevent token loss (3331) into r0.4.0#3332Conversation
…oss (#3331) Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
|
/ok to test 9238344 |
📝 WalkthroughWalkthroughA single-file modification that temporarily overrides a tokenizer's Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/megatron/bridge/data/vlm_datasets/collate.py (1)
543-558:⚠️ Potential issue | 🟠 MajorRestore
padding_sidein afinallyblock.If
processor.apply_chat_template(...)fails on Lines 547-554, Lines 556-558 never run and the shared tokenizer stays pinned to"right"for later calls in the same worker. That leaks state outside the collate path.Suggested fix
saved_padding_side = getattr(tokenizer, "padding_side", None) if tokenizer is not None: tokenizer.padding_side = "right" - - batch = processor.apply_chat_template( - [example["conversation"] for example in examples], - tokenize=True, - padding=can_pad, - truncation=True, - return_tensors="pt", - return_dict=True, - ) - - # Restore original padding side so generation paths are unaffected. - if tokenizer is not None and saved_padding_side is not None: - tokenizer.padding_side = saved_padding_side + try: + batch = processor.apply_chat_template( + [example["conversation"] for example in examples], + tokenize=True, + padding=can_pad, + truncation=True, + return_tensors="pt", + return_dict=True, + ) + finally: + # Restore original padding side so generation paths are unaffected. + if tokenizer is not None and saved_padding_side is not None: + tokenizer.padding_side = saved_padding_side🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/megatron/bridge/data/vlm_datasets/collate.py` around lines 543 - 558, Save the original padding_side from tokenizer (saved_padding_side = getattr(tokenizer, "padding_side", None)), set tokenizer.padding_side = "right" only if tokenizer is not None, then call processor.apply_chat_template([...], tokenize=True, ...); ensure the restoration of tokenizer.padding_side uses a finally block so that if processor.apply_chat_template raises, the original saved_padding_side is restored; reference the tokenizer variable, saved_padding_side, processor.apply_chat_template, and padding_side in your change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@src/megatron/bridge/data/vlm_datasets/collate.py`:
- Around line 543-558: Save the original padding_side from tokenizer
(saved_padding_side = getattr(tokenizer, "padding_side", None)), set
tokenizer.padding_side = "right" only if tokenizer is not None, then call
processor.apply_chat_template([...], tokenize=True, ...); ensure the restoration
of tokenizer.padding_side uses a finally block so that if
processor.apply_chat_template raises, the original saved_padding_side is
restored; reference the tokenizer variable, saved_padding_side,
processor.apply_chat_template, and padding_side in your change.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 806f3518-de5a-4218-a72e-075df0ec7b5f
📒 Files selected for processing (1)
src/megatron/bridge/data/vlm_datasets/collate.py
|
patch |
beep boop [🤖]: Hi @yaoyu-33 👋,
Summary by CodeRabbit
Release Notes