Fix: convert chunk_overlap from pixel to latent frames in autoregressive generation#137
Merged
aryamancodes merged 1 commit intonvidia-cosmos:mainfrom Mar 19, 2026
Conversation
Contributor
|
Looks like you're failing lint. Can you run |
Contributor
|
Thanks for the contribution and nice catch! MR LGTM! |
…ive generation chunk_overlap (pixel frames) was passed directly as num_latent_conditional_frames to the model without conversion through tokenizer.get_latent_num_frames(). With 4:1 causal temporal compression, this caused a mismatch between model conditioning and pixel-space slicing, producing artifacts at chunk boundaries. The fix converts chunk_overlap to latent frames only for the model, keeping all pixel-space operations using the original value. This matches the approach already used in cosmos-transfer2.5. Also adds input validation, a warning for non-latent-aligned overlap values, and unit tests.
0a57a6d to
761fc10
Compare
Contributor
Author
|
No problems! I just ran |
aryamancodes
approved these changes
Mar 19, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
generate_autoregressive_from_batchinvideo2world.pypasseschunk_overlapdirectly asnum_latent_conditional_framesto the model. However,chunk_overlapis in pixel frames whilenum_latent_conditional_framesexpects latent frames. With the tokenizer's 4:1 causal temporal compression, this causes a mismatch between what the model conditions on and what pixel-space slicing assumes, producing artifacts at chunk boundaries during autoregressive video generation.For example, with
chunk_overlap=5:get_latent_num_frames(5)= 2 latent frames = 5 pixel frames, matching the 5 pixel frames skippedThe fix converts
chunk_overlapto latent frames only when passing to the model, and uses the original pixel value for all pixel-space operations. This matches the approach already used in cosmos-transfer2.5 for the equivalent chunked generation logic.Why it went unnoticed
The default
chunk_overlap=1, andget_latent_num_frames(1) = 1— the conversion is identity for this value. The bug only manifests whenchunk_overlap >= 2.Changes
chunk_overlap(pixel) to latent frames viatokenizer.get_latent_num_frames()before passing to the modeleffective_chunk_size, chunk concatenation, input buffer update) using the originalchunk_overlapvaluechunk_overlapandchunk_sizeare in pixel framesTest plan
chunk_overlap=1(default) produces identical output to the unfixed code (no behavior change for the common case)chunk_overlap=5produces smooth transitions at chunk boundaries (no artifacts)num_output_frames/chunk_overlapcombinations