Skip to content

Fix: convert chunk_overlap from pixel to latent frames in autoregressive generation#137

Merged
aryamancodes merged 1 commit intonvidia-cosmos:mainfrom
rubenohana:fix/autoregressive-chunk-overlap-latent-conversion
Mar 19, 2026
Merged

Fix: convert chunk_overlap from pixel to latent frames in autoregressive generation#137
aryamancodes merged 1 commit intonvidia-cosmos:mainfrom
rubenohana:fix/autoregressive-chunk-overlap-latent-conversion

Conversation

@rubenohana
Copy link
Copy Markdown
Contributor

Summary

generate_autoregressive_from_batch in video2world.py passes chunk_overlap directly as num_latent_conditional_frames to the model. However, chunk_overlap is in pixel frames while num_latent_conditional_frames expects latent frames. With the tokenizer's 4:1 causal temporal compression, this causes a mismatch between what the model conditions on and what pixel-space slicing assumes, producing artifacts at chunk boundaries during autoregressive video generation.

For example, with chunk_overlap=5:

  • Before (bug): model conditions on 5 latent frames = 17 pixel frames, but only 5 pixel frames are skipped in concatenation
  • After (fix): model conditions on get_latent_num_frames(5) = 2 latent frames = 5 pixel frames, matching the 5 pixel frames skipped

The fix converts chunk_overlap to latent frames only when passing to the model, and uses the original pixel value for all pixel-space operations. This matches the approach already used in cosmos-transfer2.5 for the equivalent chunked generation logic.

Why it went unnoticed

The default chunk_overlap=1, and get_latent_num_frames(1) = 1 — the conversion is identity for this value. The bug only manifests when chunk_overlap >= 2.

Changes

  • Convert chunk_overlap (pixel) to latent frames via tokenizer.get_latent_num_frames() before passing to the model
  • Keep all pixel-space operations (effective_chunk_size, chunk concatenation, input buffer update) using the original chunk_overlap value
  • Update docstring to clarify that chunk_overlap and chunk_size are in pixel frames

Test plan

  • Verify chunk_overlap=1 (default) produces identical output to the unfixed code (no behavior change for the common case)
  • Verify chunk_overlap=5 produces smooth transitions at chunk boundaries (no artifacts)
  • Verify output frame count is correct for various num_output_frames / chunk_overlap combinations

@aryamancodes
Copy link
Copy Markdown
Contributor

Looks like you're failing lint. Can you run just lint?

@aryamancodes
Copy link
Copy Markdown
Contributor

Thanks for the contribution and nice catch! MR LGTM!

…ive generation

chunk_overlap (pixel frames) was passed directly as num_latent_conditional_frames
to the model without conversion through tokenizer.get_latent_num_frames(). With
4:1 causal temporal compression, this caused a mismatch between model conditioning
and pixel-space slicing, producing artifacts at chunk boundaries.

The fix converts chunk_overlap to latent frames only for the model, keeping all
pixel-space operations using the original value. This matches the approach already
used in cosmos-transfer2.5.

Also adds input validation, a warning for non-latent-aligned overlap values,
and unit tests.
@rubenohana rubenohana force-pushed the fix/autoregressive-chunk-overlap-latent-conversion branch from 0a57a6d to 761fc10 Compare March 19, 2026 21:23
@rubenohana
Copy link
Copy Markdown
Contributor Author

No problems! I just ran ruff format to fix the lint issue. I think it should be good now!

@aryamancodes aryamancodes self-requested a review March 19, 2026 22:13
@aryamancodes aryamancodes merged commit 315e424 into nvidia-cosmos:main Mar 19, 2026
1 check passed
@rubenohana rubenohana deleted the fix/autoregressive-chunk-overlap-latent-conversion branch March 19, 2026 22:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants