cp: `fix(gemma3-vl): force right-padding in VLM collate to prevent token loss (3331)` into `r0.4.0` by svcnvidia-nemo-ci · Pull Request #3332 · NVIDIA-NeMo/Megatron-Bridge

svcnvidia-nemo-ci · 2026-04-15T04:22:37Z

beep boop [🤖]: Hi @yaoyu-33 👋,

we've cherry picked #3331 into  for you! 🚀

Please review and approve this cherry pick by your convenience!

Summary by CodeRabbit

Release Notes

Bug Fixes
- Corrected batch processing to ensure data integrity in sequence preparation and masking operations during data collation.

…oss (#3331) Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>

svcnvidia-nemo-ci · 2026-04-15T04:22:40Z

/ok to test 9238344

copy-pr-bot · 2026-04-15T04:22:41Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-04-15T04:29:25Z

📝 Walkthrough

Walkthrough

A single-file modification that temporarily overrides a tokenizer's padding_side setting to "right" during chat template application in the collate function, then restores the original setting afterward to prevent left-padded sequences from corrupting downstream sequence packing and masking logic.

Changes

Cohort / File(s)	Summary
Padding Side Fix `src/megatron/bridge/data/vlm_datasets/collate.py`	Modified `default_collate_fn` to temporarily set tokenizer's `padding_side` to `"right"` before `processor.apply_chat_template()` call, then restore the original `padding_side` afterward to prevent left-padding corruption in downstream sequence packing/masking operations.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Test Results For Major Changes	⚠️ Warning	PR lacks test results documentation. While a 12-line bug fix cherry-picked from PR `#3331`, the description does not include test results, regression validation, or performance metrics for this padding change affecting data pipeline and model numerics.	Update PR description to document test results, regression validation, or reference original PR `#3331`'s testing to confirm the padding fix resolves token loss without introducing convergence regressions.
Title check	❓ Inconclusive	The title references a cherry-pick operation and includes the original PR number and branch, making it somewhat unclear as a standalone title for understanding the actual change.	Consider simplifying the title to focus on the core change. A clearer title might be: 'fix: force right-padding in VLM collate to prevent token loss' without the cherry-pick notation and PR metadata.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch cherry-pick-3331-r0.4.0

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/megatron/bridge/data/vlm_datasets/collate.py (1)

543-558: ⚠️ Potential issue | 🟠 Major

Restore padding_side in a finally block.

If processor.apply_chat_template(...) fails on Lines 547-554, Lines 556-558 never run and the shared tokenizer stays pinned to "right" for later calls in the same worker. That leaks state outside the collate path.

Suggested fix

     saved_padding_side = getattr(tokenizer, "padding_side", None)
     if tokenizer is not None:
         tokenizer.padding_side = "right"
-
-    batch = processor.apply_chat_template(
-        [example["conversation"] for example in examples],
-        tokenize=True,
-        padding=can_pad,
-        truncation=True,
-        return_tensors="pt",
-        return_dict=True,
-    )
-
-    # Restore original padding side so generation paths are unaffected.
-    if tokenizer is not None and saved_padding_side is not None:
-        tokenizer.padding_side = saved_padding_side
+    try:
+        batch = processor.apply_chat_template(
+            [example["conversation"] for example in examples],
+            tokenize=True,
+            padding=can_pad,
+            truncation=True,
+            return_tensors="pt",
+            return_dict=True,
+        )
+    finally:
+        # Restore original padding side so generation paths are unaffected.
+        if tokenizer is not None and saved_padding_side is not None:
+            tokenizer.padding_side = saved_padding_side

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/data/vlm_datasets/collate.py` around lines 543 - 558,
Save the original padding_side from tokenizer (saved_padding_side =
getattr(tokenizer, "padding_side", None)), set tokenizer.padding_side = "right"
only if tokenizer is not None, then call processor.apply_chat_template([...],
tokenize=True, ...); ensure the restoration of tokenizer.padding_side uses a
finally block so that if processor.apply_chat_template raises, the original
saved_padding_side is restored; reference the tokenizer variable,
saved_padding_side, processor.apply_chat_template, and padding_side in your
change.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@src/megatron/bridge/data/vlm_datasets/collate.py`:
- Around line 543-558: Save the original padding_side from tokenizer
(saved_padding_side = getattr(tokenizer, "padding_side", None)), set
tokenizer.padding_side = "right" only if tokenizer is not None, then call
processor.apply_chat_template([...], tokenize=True, ...); ensure the restoration
of tokenizer.padding_side uses a finally block so that if
processor.apply_chat_template raises, the original saved_padding_side is
restored; reference the tokenizer variable, saved_padding_side,
processor.apply_chat_template, and padding_side in your change.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 806f3518-de5a-4218-a72e-075df0ec7b5f

📥 Commits

Reviewing files that changed from the base of the PR and between f4d10a3 and 9238344.

📒 Files selected for processing (1)

src/megatron/bridge/data/vlm_datasets/collate.py

ko3n1g · 2026-04-15T10:49:55Z

patch

fix(gemma3-vl): force right-padding in VLM collate to prevent token l…

9238344

…oss (#3331) Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>

svcnvidia-nemo-ci requested a review from yaoyu-33 April 15, 2026 04:22

svcnvidia-nemo-ci added cherry-pick Run CICD labels Apr 15, 2026

copy-pr-bot bot temporarily deployed to test April 15, 2026 04:23 Inactive

coderabbitai bot reviewed Apr 15, 2026

View reviewed changes

ko3n1g marked this pull request as draft April 15, 2026 10:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cp: `fix(gemma3-vl): force right-padding in VLM collate to prevent token loss (3331)` into `r0.4.0`#3332

cp: `fix(gemma3-vl): force right-padding in VLM collate to prevent token loss (3331)` into `r0.4.0`#3332
svcnvidia-nemo-ci wants to merge 1 commit intor0.4.0from
cherry-pick-3331-r0.4.0

svcnvidia-nemo-ci commented Apr 15, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

svcnvidia-nemo-ci commented Apr 15, 2026

Uh oh!

copy-pr-bot bot commented Apr 15, 2026

Uh oh!

coderabbitai bot commented Apr 15, 2026

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

coderabbitai bot left a comment

Uh oh!

ko3n1g commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

svcnvidia-nemo-ci commented Apr 15, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Uh oh!

svcnvidia-nemo-ci commented Apr 15, 2026

Uh oh!

copy-pr-bot bot commented Apr 15, 2026

Uh oh!

coderabbitai bot commented Apr 15, 2026

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

ko3n1g commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

svcnvidia-nemo-ci commented Apr 15, 2026 •

edited by coderabbitai bot

Loading