Skip to content

Add Qwen2_VL#2599

Open
sam-shubham wants to merge 22 commits intokeras-team:masterfrom
sam-shubham:qwen2_vl
Open

Add Qwen2_VL#2599
sam-shubham wants to merge 22 commits intokeras-team:masterfrom
sam-shubham:qwen2_vl

Conversation

@sam-shubham
Copy link

@sam-shubham sam-shubham commented Feb 14, 2026

Description of the change

Qwen2 VL Integration

#2323

Notebooks

Running OOM on Kaggle. (not able to use colab in my region.. need review)

Checklist

  • I have added all the necessary unit tests for my change.
  • I have verified that my change does not break existing code and works with all backends (TensorFlow, JAX, and PyTorch).
  • My PR is based on the latest changes of the main branch (if unsure, rebase the code).
  • I have followed the Keras Hub Model contribution guidelines in making these changes.
  • I have followed the Keras Hub API design guidelines in making these changes.
  • I have signed the Contributor License Agreement.

@google-cla
Copy link

google-cla bot commented Feb 14, 2026

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @sam-shubham, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request initiates the integration of the Qwen2-VL multimodal model into KerasHub. It introduces the core architectural elements necessary for handling both text and visual inputs, leveraging a sophisticated multimodal rotary position embedding scheme. The changes lay the groundwork for a powerful new model capable of processing and understanding diverse data types within the Keras ecosystem.

Highlights

  • Qwen2-VL Model Integration: Introduced the foundational components for integrating the Qwen2-VL multimodal large language model, including its unique attention mechanism and backbone architecture.
  • Multimodal Rotary Position Embedding (M-RoPE): Implemented a novel M-RoPE mechanism that decomposes positional embeddings into temporal, height, and width components, enabling unified position encoding for text, images, and video.
  • Vision Encoder Implementation: Added a comprehensive Vision Transformer (ViT) based vision encoder, featuring 3D patch embedding, 2D rotary position embeddings, vision attention blocks, and spatial patch merging.
  • Modular Keras Layers: Developed several new Keras layers specific to Qwen2-VL, such as Qwen2VLAttention, Qwen2VLTransformerDecoder, Qwen2VLPatchEmbed, and Qwen2VLVisionEncoder components, promoting modularity and reusability.
  • Extensive Unit Testing: Included dedicated unit tests for the Qwen2VLBackbone (text-only mode) and various Qwen2VLVisionEncoder components to ensure correctness and stability of the new implementations.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • keras_hub/src/models/qwen2_vl/init.py
    • Added Apache 2.0 license header.
  • keras_hub/src/models/qwen2_vl/qwen2_vl_attention.py
    • Added Qwen2VLAttention layer, implementing multi-head attention with Multimodal RoPE (M-RoPE) and support for Grouped-Query Attention (GQA) and sliding window attention.
    • Included helper functions _rotate_half and _apply_multimodal_rotary_pos_emb for M-RoPE application.
    • Provided _cumsum_sections utility for splitting tensor sections.
  • keras_hub/src/models/qwen2_vl/qwen2_vl_backbone.py
    • Added Qwen2VLBackbone class, which integrates the vision encoder with the text decoder and supports both multimodal and text-only inputs.
    • Implemented _compute_mrope_embeddings function to generate M-RoPE cosine and sine embeddings from position IDs.
    • Defined _qwen2_vl_kernel_initializer for consistent weight initialization.
  • keras_hub/src/models/qwen2_vl/qwen2_vl_backbone_test.py
    • Added Qwen2VLBackboneTextOnlyTest class to verify the functionality of the Qwen2-VL backbone in text-only mode.
    • Included tests for basic backbone operation, model saving, and architectural characteristics.
  • keras_hub/src/models/qwen2_vl/qwen2_vl_decoder.py
    • Added Qwen2VLTransformerDecoder layer, a single Transformer decoder block utilizing Qwen2VLAttention and a SwiGLU MLP.
    • Implemented _compute_self_attention_mask to generate causal attention masks, combining padding and causal masking.
  • keras_hub/src/models/qwen2_vl/qwen2_vl_layernorm.py
    • Added Qwen2VLLayerNorm as an alias to the existing QwenLayerNorm for consistency within the Qwen2-VL module.
  • keras_hub/src/models/qwen2_vl/qwen2_vl_vision_encoder.py
    • Added Qwen2VLVisionEncoder class, representing the full Vision Transformer encoder for Qwen2-VL.
    • Implemented Qwen2VLPatchEmbed for 3D convolution-based patch embedding.
    • Introduced Qwen2VLVisionRotaryEmbedding for 2D rotary position embeddings in the vision encoder.
    • Created Qwen2VLVisionMLP for the feedforward network within vision blocks.
    • Developed Qwen2VLVisionAttention for multi-head attention with 2D rotary position embeddings.
    • Added Qwen2VLVisionBlock for individual transformer blocks in the vision encoder.
    • Implemented Qwen2VLPatchMerger for spatial patch merging.
    • Included _apply_rotary_pos_emb_vision helper function for applying rotary embeddings to vision query and key tensors.
  • keras_hub/src/models/qwen2_vl/qwen2_vl_vision_encoder_test.py
    • Added unit tests for Qwen2VLPatchEmbed, Qwen2VLVisionRotaryEmbedding, Qwen2VLVisionBlock, Qwen2VLPatchMerger, and Qwen2VLVisionEncoder to verify their functionality and output shapes.
Activity
  • The pull request is a Work In Progress (WIP) for Qwen2-VL integration, as indicated by the title and description.
  • A Colab Notebook for high-level usage is also marked as Work In Progress.
  • The author has completed all checklist items, including adding necessary unit tests, verifying compatibility with all backends (TensorFlow, JAX, PyTorch), rebasing on the main branch, following Keras Hub model and API design guidelines, and signing the Contributor License Agreement.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the Qwen2-VL model, a significant contribution that includes the backbone, a custom attention mechanism with M-RoPE, and a vision encoder. The code is well-structured and generally follows the repository's style guide for file organization and naming. However, as this is a work-in-progress, there are several critical issues that need to be addressed. The most important ones are the missing vision processing logic in the backbone's forward pass and a non-vectorized loop in the vision encoder that will prevent graph compilation and be highly inefficient. I've also identified a bug in the decoder's cache handling and other areas for improvement regarding code clarity and performance. My detailed comments below provide specific suggestions for these points.

@sam-shubham sam-shubham changed the title Qwen2 vl Add Qwen2_VL Feb 16, 2026
@sam-shubham sam-shubham marked this pull request as ready for review February 16, 2026 04:17
@sam-shubham
Copy link
Author

Hii .. @divyashreepathihalli @laxmareddyp , can i get a review on this

@samudraneel05
Copy link

Unfortunately this is duplicated effort, #2574 has an incorrect title but plans to do the same addition. Unfortunate, but maintainers will have to take a call on which PR to proceed with. Maybe you can contact and collaborate ith @jaytiwarihub.

@sam-shubham
Copy link
Author

hii @jaytiwarihub .. i'll love to discuss and contribute together let me know if i can help..

@jaytiwarihub
Copy link
Contributor

@sam-shubham I see you uploaded the presets and tokenizer too in #2599.

In my PR, I did not add backbone_presets in qwen2_vl_backbone.py or causal_lm_presets in qwen2_vl_causal_lm.py, clearly lacking the "qwen2_vl_7b_instruct". Instead, I used placeholders which return raw token ids.

Why? Because the model works if someone wants to train it from scratch (random weights).

Also in #2574, I see Qwen2-VL reuses the Qwen 2.5 Tokenizer (for text). I see you use Qwen2VLInterleaveEmbeddings to handle vision_features and scattering them into the specific slots in the text_embeddings. You implemented _compute_mrope_embeddings --> This confirms you are generating the 3D Rotary Embeddings required for Qwen2-VL.

One thing that looks bad:
In __init__.py you need to import the other classes (CausalLM, Preprocessor, ImageConverter) so they are available in the public API. Most users will use Qwen2VLCausalLM, not the Backbone. If you hide it, they can't use your model!

Overall it looks good to me, although I pulled this first and I haven't modified #2574 accordingly because I'm waiting for review.

Let maintainers take a call on which PR to proceed with. cc @samudraneel05 thanks for updating!

@sam-shubham
Copy link
Author

Contributor

thanks @jaytiwarihub for a kind review will be exposing these apis as well CausalLM, Preprocessor, ImageConverter specially CausalLM.
thanks for highlighting...

looking forward for mantainers let's see what they go with..

@sachinprasadhs sachinprasadhs mentioned this pull request Feb 19, 2026
6 tasks
@samudraneel05
Copy link

hey @sam-shubham, there's been some miscommunication on the assignment of issues.
i coded up support for this model too and I have numeric verifications working on kaggle, i was wondering if we could collaborate and potentially co-author this PR? I'm happy to do the debugging and the numerics verification on top of your work?

@sam-shubham
Copy link
Author

sam-shubham commented Feb 19, 2026

thanks @samudraneel05 glad you want to colab. but ig the pr is already complete except exposing some apis i.e. one commit away.. will be glad if you review and let me now what you are adding..

@samudraneel05
Copy link

Sure, happy to do a review and if things pop out we can go along from there!

@sam-shubham
Copy link
Author

Sure, happy to do a review and if things pop out we can go along from there!

Suree!!

Copy link

@samudraneel05 samudraneel05 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed about 7 files in my initial pass, please have a look and lmk if we agree or differ on anything.

Comment on lines +1 to +2
"""Tests for Qwen2-VL Backbone."""

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""Tests for Qwen2-VL Backbone."""
import pytest

instead of the docstring for testing, pytest should be present

@sam-shubham
Copy link
Author

Hi @jaytiwarihub,

I just went through the exports of the other models, and I believe there might be a structural misunderstanding.

From what I see, the inner __init__.py follows the standard pattern.
src/models/qwen2_vl/__init__.py only imports Qwen2VLBackbone and registers backbone_presets, which is exactly the same pattern used by other models (Gemma, Llama, PaliGemma, etc.).

No model in the codebase registers causal_lm_presets separately.

If I’ve misunderstood anything, please let me know — happy to revisit this.

@jaytiwarihub
Copy link
Contributor

@sam-shubham you are absolutely correct , my bad

@sam-shubham
Copy link
Author

Hi @samudraneel05 thanks for higlighting those silly lines will take care of that rest other fixes are impressive thanks for pointing out.

@samudraneel05
Copy link

samudraneel05 commented Feb 19, 2026

no worries, i have some other comments too - once you get these in order I can expand more

feel free to shoot me a dm @ my socials for better collab

@sam-shubham
Copy link
Author

@samudraneel05 sure !! thanks

@sam-shubham
Copy link
Author

thanks @samudraneel05 addresd much of what you pointed!! let me know if i missed something..

@sam-shubham
Copy link
Author

hii @sachinprasadhs , @divyashreepathihalli @laxmareddyp can you please review if it need anything more..

@sachinprasadhs sachinprasadhs self-requested a review February 20, 2026 01:27
@sam-shubham
Copy link
Author

hii @sachinprasadhs any update on this..?

@sachinprasadhs sachinprasadhs added the new model For PRs that contribute a new model to the Keras Hub registry. label Feb 24, 2026
Copy link
Collaborator

@sachinprasadhs sachinprasadhs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution. I have viewed few files and made some comments. Please check.

mrope_section: List of 3 ints specifying how many half-head-dim
elements are allocated to [temporal, height, width].
rope_max_wavelength: float. Max wavelength for RoPE base.
kernel_initializer: Initializer for the kernel weights.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add defaults to

elements are allocated to [temporal, height, width].
rope_max_wavelength: float. Max wavelength for RoPE base.
kernel_initializer: Initializer for the kernel weights.
bias_initializer: Initializer for the bias weights.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add Deafaults to


@keras_hub_export("keras_hub.models.Qwen2VLBackbone")
class Qwen2VLBackbone(Backbone):
"""Qwen2-VL core network with optional vision encoder."""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add

  1. Description of the model
  2. arg names and it's details and default values.
  3. Usage example and loading from preset example.
    Refer other implementation if needed.

text_only_model = vision_encoder is None
head_dim = hidden_dim // num_query_heads

token_embedding = ReversibleEmbedding(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add comment on top

# === Layers ===

name="sequence_output_layernorm",
)

token_id_input = keras.Input(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add below comment
# === Functional Model ===

Comment on lines +257 to +258
del vision_embeddings_shape
del vision_indices_shape
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this

mrope_position_ids, head_dim, rope_max_wavelength, mrope_section
):
"""Compute M-RoPE cos/sin embeddings from position IDs."""
del mrope_section # Sections are applied in attention, not here.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this


from keras_hub.src.models.qwen.qwen_layernorm import QwenLayerNorm

Qwen2VLLayerNorm = QwenLayerNorm
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can rather subclass class Qwen2VLLayerNorm(QwenLayerNorm) and have the description and example defined inside the class and just pass at the end.

@sam-shubham
Copy link
Author

hii @sachinprasadhs addresed your commentend issues..

let me know if i missed something..

@sam-shubham
Copy link
Author

sam-shubham commented Feb 28, 2026

and for that channel_first thing

The data_format="channels_first" is intentional here — the Qwen2-VL model internally always processes vision patches in (batch, C, T, H, W) format (matching the HuggingFace reference implementation). The Keras Conv3D layer with data_format="channels_first" already handles cross-backend compatibility internally — Keras transposes under the hood for Torch/JAX backends automatically.

i tried doing it but ended up retransposing the already transposed date.. (This Commit)

Changing the internal pipeline to support both formats would require restructuring how patches are produced throughout the backbone (Qwen2VLFlattenVisionInputs, the images input shape, etc.), which would be a larger
change. I've added a comment explaining why channels_first is used there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

new model For PRs that contribute a new model to the Keras Hub registry.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants