Skip to content

cp: fix(gemma3-vl): force right-padding in VLM collate to prevent token loss (3331) into r0.4.0#3332

Draft
svcnvidia-nemo-ci wants to merge 1 commit intor0.4.0from
cherry-pick-3331-r0.4.0
Draft

cp: fix(gemma3-vl): force right-padding in VLM collate to prevent token loss (3331) into r0.4.0#3332
svcnvidia-nemo-ci wants to merge 1 commit intor0.4.0from
cherry-pick-3331-r0.4.0

Conversation

@svcnvidia-nemo-ci
Copy link
Copy Markdown
Contributor

@svcnvidia-nemo-ci svcnvidia-nemo-ci commented Apr 15, 2026

beep boop [🤖]: Hi @yaoyu-33 👋,

we've cherry picked #3331 into  for you! 🚀

Please review and approve this cherry pick by your convenience!

Summary by CodeRabbit

Release Notes

  • Bug Fixes
    • Corrected batch processing to ensure data integrity in sequence preparation and masking operations during data collation.

…oss (#3331)

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
@svcnvidia-nemo-ci
Copy link
Copy Markdown
Contributor Author

/ok to test 9238344

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Apr 15, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 15, 2026

📝 Walkthrough

Walkthrough

A single-file modification that temporarily overrides a tokenizer's padding_side setting to "right" during chat template application in the collate function, then restores the original setting afterward to prevent left-padded sequences from corrupting downstream sequence packing and masking logic.

Changes

Cohort / File(s) Summary
Padding Side Fix
src/megatron/bridge/data/vlm_datasets/collate.py
Modified default_collate_fn to temporarily set tokenizer's padding_side to "right" before processor.apply_chat_template() call, then restore the original padding_side afterward to prevent left-padding corruption in downstream sequence packing/masking operations.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Test Results For Major Changes ⚠️ Warning PR lacks test results documentation. While a 12-line bug fix cherry-picked from PR #3331, the description does not include test results, regression validation, or performance metrics for this padding change affecting data pipeline and model numerics. Update PR description to document test results, regression validation, or reference original PR #3331's testing to confirm the padding fix resolves token loss without introducing convergence regressions.
Title check ❓ Inconclusive The title references a cherry-pick operation and includes the original PR number and branch, making it somewhat unclear as a standalone title for understanding the actual change. Consider simplifying the title to focus on the core change. A clearer title might be: 'fix: force right-padding in VLM collate to prevent token loss' without the cherry-pick notation and PR metadata.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch cherry-pick-3331-r0.4.0

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/megatron/bridge/data/vlm_datasets/collate.py (1)

543-558: ⚠️ Potential issue | 🟠 Major

Restore padding_side in a finally block.

If processor.apply_chat_template(...) fails on Lines 547-554, Lines 556-558 never run and the shared tokenizer stays pinned to "right" for later calls in the same worker. That leaks state outside the collate path.

Suggested fix
     saved_padding_side = getattr(tokenizer, "padding_side", None)
     if tokenizer is not None:
         tokenizer.padding_side = "right"
-
-    batch = processor.apply_chat_template(
-        [example["conversation"] for example in examples],
-        tokenize=True,
-        padding=can_pad,
-        truncation=True,
-        return_tensors="pt",
-        return_dict=True,
-    )
-
-    # Restore original padding side so generation paths are unaffected.
-    if tokenizer is not None and saved_padding_side is not None:
-        tokenizer.padding_side = saved_padding_side
+    try:
+        batch = processor.apply_chat_template(
+            [example["conversation"] for example in examples],
+            tokenize=True,
+            padding=can_pad,
+            truncation=True,
+            return_tensors="pt",
+            return_dict=True,
+        )
+    finally:
+        # Restore original padding side so generation paths are unaffected.
+        if tokenizer is not None and saved_padding_side is not None:
+            tokenizer.padding_side = saved_padding_side
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/data/vlm_datasets/collate.py` around lines 543 - 558,
Save the original padding_side from tokenizer (saved_padding_side =
getattr(tokenizer, "padding_side", None)), set tokenizer.padding_side = "right"
only if tokenizer is not None, then call processor.apply_chat_template([...],
tokenize=True, ...); ensure the restoration of tokenizer.padding_side uses a
finally block so that if processor.apply_chat_template raises, the original
saved_padding_side is restored; reference the tokenizer variable,
saved_padding_side, processor.apply_chat_template, and padding_side in your
change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@src/megatron/bridge/data/vlm_datasets/collate.py`:
- Around line 543-558: Save the original padding_side from tokenizer
(saved_padding_side = getattr(tokenizer, "padding_side", None)), set
tokenizer.padding_side = "right" only if tokenizer is not None, then call
processor.apply_chat_template([...], tokenize=True, ...); ensure the restoration
of tokenizer.padding_side uses a finally block so that if
processor.apply_chat_template raises, the original saved_padding_side is
restored; reference the tokenizer variable, saved_padding_side,
processor.apply_chat_template, and padding_side in your change.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 806f3518-de5a-4218-a72e-075df0ec7b5f

📥 Commits

Reviewing files that changed from the base of the PR and between f4d10a3 and 9238344.

📒 Files selected for processing (1)
  • src/megatron/bridge/data/vlm_datasets/collate.py

@ko3n1g
Copy link
Copy Markdown
Contributor

ko3n1g commented Apr 15, 2026

patch

@ko3n1g ko3n1g marked this pull request as draft April 15, 2026 10:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants