Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request resolves a critical issue in the sequence parallelism implementation for the HunyuanVideo-1.5 model, specifically concerning how attention masks and encoder hidden states are managed during distributed processing. The changes ensure that padded tokens do not interfere with attention calculations and streamline the sequence parallel operations by leveraging direct PyTorch and distributed group functionalities, ultimately fixing issue #655. Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request aims to fix an issue with sequence parallelism (SP) for HunyuanVideo-1.5 by correctly modifying how encoder_hidden_states are handled, specifically by removing padding and using USP with a joint_strategy when the sequence length is not divisible by the SP world size. However, a critical security concern has been identified: the current batch truncation implementation is flawed as it relies on the mask of the first item in the batch. This can lead to incorrect processing or data loss for other items in the batch with different prompt lengths, particularly in multi-tenant environments where requests are batched together. Furthermore, the implementation introduces a critical regression by removing necessary padding and unpadding logic for hidden_states and image_rotary_emb, which will cause errors during the all_gather operation if the sequence length is not divisible by the SP world size. Minor import and style issues also require attention.
| encoder_attention_mask = torch.stack(new_encoder_attention_mask) | ||
|
|
||
| # sequence parallel | ||
| hidden_states = torch.chunk(hidden_states, sp_world_size, dim=1)[sp_world_rank] |
There was a problem hiding this comment.
The padding for hidden_states has been removed. If hidden_states.shape[1] is not divisible by sp_world_size, torch.chunk will create uneven chunks across sequence parallel ranks. This will cause the all_gather operation at the end of the forward pass to fail. The padding logic from the previous version should be restored to ensure correctness. You'll also need to store the pad_amount to use for unpadding later.
hidden_states_pad_amount = (sp_world_size - hidden_states.shape[1] % sp_world_size) % sp_world_size
if hidden_states_pad_amount > 0:
pad_shape = list(hidden_states.shape)
pad_shape[1] = hidden_states_pad_amount
padding = torch.zeros(pad_shape, dtype=hidden_states.dtype, device=hidden_states.device)
hidden_states = torch.cat([hidden_states, padding], dim=1)
hidden_states = torch.chunk(hidden_states, sp_world_size, dim=1)[sp_world_rank]There was a problem hiding this comment.
hidden_states.shape[1] represents the sequence length of hidden_states, i.e., the number of image tokens. Since HunyuanVideo-1.5 typically generates videos at resolutions like 480p and 720p, this sequence length is usually divisible by common sp_world_size values (e.g., 2, 4, 8, ...).
xfuser/model_executor/models/transformers/transformer_hunyuan_video15.py
Outdated
Show resolved
Hide resolved
xfuser/model_executor/models/transformers/transformer_hunyuan_video15.py
Outdated
Show resolved
Hide resolved
| encoder_attention_mask = encoder_attention_mask[0].to(torch.bool) | ||
| encoder_hidden_states_indices = torch.arange( | ||
| encoder_hidden_states.shape[1], | ||
| device=encoder_hidden_states.device) | ||
| encoder_hidden_states_indices = encoder_hidden_states_indices[ | ||
| encoder_attention_mask] | ||
| encoder_hidden_states = encoder_hidden_states[ | ||
| ..., encoder_hidden_states_indices, :] |
There was a problem hiding this comment.
The truncation of encoder_hidden_states based on encoder_attention_mask[0] assumes that all items in a batch have the same number of valid tokens. In a multi-tenant or shared environment where requests with different prompt lengths are batched together, this logic will lead to either truncation of valid tokens for some users (if the first item has a shorter prompt) or inclusion of padding tokens for others (if the first item has a longer prompt). This violates the integrity of the model's output and can be used to influence other users' results.
Specifically, at lines 306-313, the code uses the mask of the first item in the batch to determine the indices of valid tokens for the entire batch. If subsequent items in the batch have more valid tokens than the first item, those tokens will be lost. If they have fewer, padding tokens will be included in the attention computation, which the PR explicitly aimed to avoid.
There was a problem hiding this comment.
Yeah this now breaks support for BS > 1 where the prompts aren't identical in length.
For HunyuanVideo, we are using this:
encoder_attention_mask = encoder_attention_mask.to(torch.bool).any(dim=0)
encoder_hidden_states = encoder_hidden_states[:, encoder_attention_mask, :]
this should do the same trick while still retaining support for BS > 1.
| from xfuser.core.distributed import ( | ||
| get_runtime_state, | ||
| get_classifier_free_guidance_world_size, | ||
| get_classifier_free_guidance_rank, | ||
| get_cfg_group, | ||
| get_sequence_parallel_world_size, | ||
| get_sequence_parallel_rank, | ||
| get_sp_group, | ||
| get_sequence_parallel_world_size, | ||
| get_pipeline_parallel_world_size | ||
| ) |
There was a problem hiding this comment.
There are a couple of issues with the imports in this block:
get_sequence_parallel_world_sizeis imported twice.get_classifier_free_guidance_world_size,get_classifier_free_guidance_rank, andget_cfg_groupare imported but are not used anywhere in this file.
Please clean up the imports to remove the duplicate and the unused functions for better code maintainability.
from xfuser.core.distributed import (
get_runtime_state,
get_sequence_parallel_world_size,
get_sequence_parallel_rank,
get_sp_group,
get_pipeline_parallel_world_size,
)| encoder_hidden_states = encoder_hidden_states[ | ||
| ..., encoder_hidden_states_indices, :] | ||
|
|
||
| if encoder_hidden_states.shape[1] % sp_world_size!= 0: |
There was a problem hiding this comment.
For better readability and to follow common Python style guides (like PEP 8), please add spaces around the != operator.
if encoder_hidden_states.shape[1] % sp_world_size != 0:References
- PEP 8 recommends using spaces around binary operators for better readability. (link)
|
Thanks for the fix! Yeah, the padding we do does have an effect on the output, so this is a great fix for that :) However, I'm not sure if assumption holds that we can remove padding from |
|
Thank you for your comment. Padding is necessary when the I retested the following video generation resolutions, where the
hunyuan_video_15_t2v_result_ulysses4_ring1_tc_False_592x976.mp4
hunyuan_video1.5_592x976_81_steps50.mp4 |
What?
Fix the issue #655 , primarily by modifying the SP in
transformers_hunyuan_video15.py.How?
In the original implementation, both
encoder_hidden_statesandencoder_attention_maskwere processed by_chunk_and_pad_sequence. However,USPdoes not support the argumentattention_mask, which caused the padded (masked) portions inencoder_attention_maskto incorrectly participate in the attention computation.Referencing SP in
transformers_hunyuan_video.py, I fixed this issue: first truncating the padded portions fromencoder_hidden_states, and usingUSPwithjoint_strategywhen the sequence length ofencoder_hidden_statesis not divisible byulysses_degree.Test
The test code is the same as the one in the #655 .
prompt: A cat holding a paper with words "Hello, world!"
hunyuan_video_15_base.mp4
hunyuan_video_15_t2v_result_ulysses4_ring1_tc_False_720x1280.mp4