Add Qwen2_VL by sam-shubham · Pull Request #2599 · keras-team/keras-hub

sam-shubham · 2026-02-14T21:01:43Z

Description of the change

Qwen2 VL Integration

#2323

Notebooks

Running OOM on Kaggle. (not able to use colab in my region.. need review)

Checklist

I have added all the necessary unit tests for my change.
I have verified that my change does not break existing code and works with all backends (TensorFlow, JAX, and PyTorch).
My PR is based on the latest changes of the main branch (if unsure, rebase the code).
I have followed the Keras Hub Model contribution guidelines in making these changes.
I have followed the Keras Hub API design guidelines in making these changes.
I have signed the Contributor License Agreement.

google-cla · 2026-02-14T21:01:54Z

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

gemini-code-assist · 2026-02-14T21:02:02Z

Summary of Changes

Hello @sam-shubham, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request initiates the integration of the Qwen2-VL multimodal model into KerasHub. It introduces the core architectural elements necessary for handling both text and visual inputs, leveraging a sophisticated multimodal rotary position embedding scheme. The changes lay the groundwork for a powerful new model capable of processing and understanding diverse data types within the Keras ecosystem.

Highlights

Qwen2-VL Model Integration: Introduced the foundational components for integrating the Qwen2-VL multimodal large language model, including its unique attention mechanism and backbone architecture.
Multimodal Rotary Position Embedding (M-RoPE): Implemented a novel M-RoPE mechanism that decomposes positional embeddings into temporal, height, and width components, enabling unified position encoding for text, images, and video.
Vision Encoder Implementation: Added a comprehensive Vision Transformer (ViT) based vision encoder, featuring 3D patch embedding, 2D rotary position embeddings, vision attention blocks, and spatial patch merging.
Modular Keras Layers: Developed several new Keras layers specific to Qwen2-VL, such as Qwen2VLAttention, Qwen2VLTransformerDecoder, Qwen2VLPatchEmbed, and Qwen2VLVisionEncoder components, promoting modularity and reusability.
Extensive Unit Testing: Included dedicated unit tests for the Qwen2VLBackbone (text-only mode) and various Qwen2VLVisionEncoder components to ensure correctness and stability of the new implementations.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

keras_hub/src/models/qwen2_vl/init.py
- Added Apache 2.0 license header.
keras_hub/src/models/qwen2_vl/qwen2_vl_attention.py
- Added Qwen2VLAttention layer, implementing multi-head attention with Multimodal RoPE (M-RoPE) and support for Grouped-Query Attention (GQA) and sliding window attention.
- Included helper functions _rotate_half and _apply_multimodal_rotary_pos_emb for M-RoPE application.
- Provided _cumsum_sections utility for splitting tensor sections.
keras_hub/src/models/qwen2_vl/qwen2_vl_backbone.py
- Added Qwen2VLBackbone class, which integrates the vision encoder with the text decoder and supports both multimodal and text-only inputs.
- Implemented _compute_mrope_embeddings function to generate M-RoPE cosine and sine embeddings from position IDs.
- Defined _qwen2_vl_kernel_initializer for consistent weight initialization.
keras_hub/src/models/qwen2_vl/qwen2_vl_backbone_test.py
- Added Qwen2VLBackboneTextOnlyTest class to verify the functionality of the Qwen2-VL backbone in text-only mode.
- Included tests for basic backbone operation, model saving, and architectural characteristics.
keras_hub/src/models/qwen2_vl/qwen2_vl_decoder.py
- Added Qwen2VLTransformerDecoder layer, a single Transformer decoder block utilizing Qwen2VLAttention and a SwiGLU MLP.
- Implemented _compute_self_attention_mask to generate causal attention masks, combining padding and causal masking.
keras_hub/src/models/qwen2_vl/qwen2_vl_layernorm.py
- Added Qwen2VLLayerNorm as an alias to the existing QwenLayerNorm for consistency within the Qwen2-VL module.
keras_hub/src/models/qwen2_vl/qwen2_vl_vision_encoder.py
- Added Qwen2VLVisionEncoder class, representing the full Vision Transformer encoder for Qwen2-VL.
- Implemented Qwen2VLPatchEmbed for 3D convolution-based patch embedding.
- Introduced Qwen2VLVisionRotaryEmbedding for 2D rotary position embeddings in the vision encoder.
- Created Qwen2VLVisionMLP for the feedforward network within vision blocks.
- Developed Qwen2VLVisionAttention for multi-head attention with 2D rotary position embeddings.
- Added Qwen2VLVisionBlock for individual transformer blocks in the vision encoder.
- Implemented Qwen2VLPatchMerger for spatial patch merging.
- Included _apply_rotary_pos_emb_vision helper function for applying rotary embeddings to vision query and key tensors.
keras_hub/src/models/qwen2_vl/qwen2_vl_vision_encoder_test.py
- Added unit tests for Qwen2VLPatchEmbed, Qwen2VLVisionRotaryEmbedding, Qwen2VLVisionBlock, Qwen2VLPatchMerger, and Qwen2VLVisionEncoder to verify their functionality and output shapes.

Activity

The pull request is a Work In Progress (WIP) for Qwen2-VL integration, as indicated by the title and description.
A Colab Notebook for high-level usage is also marked as Work In Progress.
The author has completed all checklist items, including adding necessary unit tests, verifying compatibility with all backends (TensorFlow, JAX, PyTorch), rebasing on the main branch, following Keras Hub model and API design guidelines, and signing the Contributor License Agreement.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces the Qwen2-VL model, a significant contribution that includes the backbone, a custom attention mechanism with M-RoPE, and a vision encoder. The code is well-structured and generally follows the repository's style guide for file organization and naming. However, as this is a work-in-progress, there are several critical issues that need to be addressed. The most important ones are the missing vision processing logic in the backbone's forward pass and a non-vectorized loop in the vision encoder that will prevent graph compilation and be highly inefficient. I've also identified a bug in the decoder's cache handling and other areas for improvement regarding code clarity and performance. My detailed comments below provide specific suggestions for these points.

keras_hub/src/models/qwen2_vl/qwen2_vl_attention.py

keras_hub/src/models/qwen2_vl/qwen2_vl_backbone.py

keras_hub/src/models/qwen2_vl/qwen2_vl_vision_encoder.py

keras_hub/src/models/qwen2_vl/qwen2_vl_decoder.py

keras_hub/src/models/qwen2_vl/qwen2_vl_vision_encoder.py

sam-shubham · 2026-02-16T04:19:11Z

Hii .. @divyashreepathihalli @laxmareddyp , can i get a review on this

samudraneel05 · 2026-02-16T07:39:03Z

Unfortunately this is duplicated effort, #2574 has an incorrect title but plans to do the same addition. Unfortunate, but maintainers will have to take a call on which PR to proceed with. Maybe you can contact and collaborate ith @jaytiwarihub.

sam-shubham · 2026-02-16T07:54:00Z

hii @jaytiwarihub .. i'll love to discuss and contribute together let me know if i can help..

jaytiwarihub · 2026-02-16T09:44:07Z

@sam-shubham I see you uploaded the presets and tokenizer too in #2599.

In my PR, I did not add backbone_presets in qwen2_vl_backbone.py or causal_lm_presets in qwen2_vl_causal_lm.py, clearly lacking the "qwen2_vl_7b_instruct". Instead, I used placeholders which return raw token ids.

Why? Because the model works if someone wants to train it from scratch (random weights).

Also in #2574, I see Qwen2-VL reuses the Qwen 2.5 Tokenizer (for text). I see you use Qwen2VLInterleaveEmbeddings to handle vision_features and scattering them into the specific slots in the text_embeddings. You implemented _compute_mrope_embeddings --> This confirms you are generating the 3D Rotary Embeddings required for Qwen2-VL.

One thing that looks bad:
In __init__.py you need to import the other classes (CausalLM, Preprocessor, ImageConverter) so they are available in the public API. Most users will use Qwen2VLCausalLM, not the Backbone. If you hide it, they can't use your model!

Overall it looks good to me, although I pulled this first and I haven't modified #2574 accordingly because I'm waiting for review.

Let maintainers take a call on which PR to proceed with. cc @samudraneel05 thanks for updating!

sam-shubham · 2026-02-16T09:49:46Z

Contributor

thanks @jaytiwarihub for a kind review will be exposing these apis as well CausalLM, Preprocessor, ImageConverter specially CausalLM.
thanks for highlighting...

looking forward for mantainers let's see what they go with..

samudraneel05 · 2026-02-19T07:23:47Z

hey @sam-shubham, there's been some miscommunication on the assignment of issues.
i coded up support for this model too and I have numeric verifications working on kaggle, i was wondering if we could collaborate and potentially co-author this PR? I'm happy to do the debugging and the numerics verification on top of your work?

sam-shubham · 2026-02-19T07:45:34Z

thanks @samudraneel05 glad you want to colab. but ig the pr is already complete except exposing some apis i.e. one commit away.. will be glad if you review and let me now what you are adding..

samudraneel05 · 2026-02-19T07:47:31Z

Sure, happy to do a review and if things pop out we can go along from there!

sam-shubham · 2026-02-19T07:51:41Z

Sure, happy to do a review and if things pop out we can go along from there!

Suree!!

samudraneel05

I reviewed about 7 files in my initial pass, please have a look and lmk if we agree or differ on anything.

keras_hub/src/models/qwen2_vl/__init__.py

keras_hub/src/models/qwen2_vl/qwen2_vl_backbone.py

samudraneel05 · 2026-02-19T08:14:31Z

keras_hub/src/models/qwen2_vl/qwen2_vl_backbone_test.py

+"""Tests for Qwen2-VL Backbone."""
+


Suggested change

"""Tests for Qwen2-VL Backbone."""

import pytest

instead of the docstring for testing, pytest should be present

keras_hub/src/models/qwen2_vl/qwen2_vl_causal_lm.py

keras_hub/src/models/qwen2_vl/qwen2_vl_tokenizer.py

sam-shubham · 2026-02-19T09:11:22Z

Hi @jaytiwarihub,

I just went through the exports of the other models, and I believe there might be a structural misunderstanding.

From what I see, the inner __init__.py follows the standard pattern.
src/models/qwen2_vl/__init__.py only imports Qwen2VLBackbone and registers backbone_presets, which is exactly the same pattern used by other models (Gemma, Llama, PaliGemma, etc.).

No model in the codebase registers causal_lm_presets separately.

If I’ve misunderstood anything, please let me know — happy to revisit this.

jaytiwarihub · 2026-02-19T19:08:42Z

@sam-shubham you are absolutely correct , my bad

sam-shubham · 2026-02-19T23:08:15Z

Hi @samudraneel05 thanks for higlighting those silly lines will take care of that rest other fixes are impressive thanks for pointing out.

samudraneel05 · 2026-02-19T23:09:48Z

no worries, i have some other comments too - once you get these in order I can expand more

feel free to shoot me a dm @ my socials for better collab

sam-shubham · 2026-02-19T23:11:53Z

@samudraneel05 sure !! thanks

sam-shubham · 2026-02-20T00:12:37Z

thanks @samudraneel05 addresd much of what you pointed!! let me know if i missed something..

sam-shubham · 2026-02-20T00:30:19Z

hii @sachinprasadhs , @divyashreepathihalli @laxmareddyp can you please review if it need anything more..

sam-shubham · 2026-02-24T05:04:43Z

hii @sachinprasadhs any update on this..?

sachinprasadhs

Thanks for your contribution. I have viewed few files and made some comments. Please check.

sachinprasadhs · 2026-02-20T01:31:18Z

keras_hub/src/models/qwen2_vl/qwen2_vl_attention.py

+        mrope_section: List of 3 ints specifying how many half-head-dim
+            elements are allocated to [temporal, height, width].
+        rope_max_wavelength: float. Max wavelength for RoPE base.
+        kernel_initializer: Initializer for the kernel weights.


Add defaults to

sachinprasadhs · 2026-02-20T01:31:42Z

keras_hub/src/models/qwen2_vl/qwen2_vl_attention.py

+            elements are allocated to [temporal, height, width].
+        rope_max_wavelength: float. Max wavelength for RoPE base.
+        kernel_initializer: Initializer for the kernel weights.
+        bias_initializer: Initializer for the bias weights.


Add Deafaults to

sachinprasadhs · 2026-02-27T21:50:12Z

keras_hub/src/models/qwen2_vl/qwen2_vl_backbone.py

+
+@keras_hub_export("keras_hub.models.Qwen2VLBackbone")
+class Qwen2VLBackbone(Backbone):
+    """Qwen2-VL core network with optional vision encoder."""


Add

Description of the model

arg names and it's details and default values.

Usage example and loading from preset example.
Refer other implementation if needed.

sachinprasadhs · 2026-02-27T21:57:05Z

keras_hub/src/models/qwen2_vl/qwen2_vl_backbone.py

+        text_only_model = vision_encoder is None
+        head_dim = hidden_dim // num_query_heads
+
+        token_embedding = ReversibleEmbedding(


Add comment on top

# === Layers ===

sachinprasadhs · 2026-02-27T21:58:07Z

keras_hub/src/models/qwen2_vl/qwen2_vl_backbone.py

+            name="sequence_output_layernorm",
+        )
+
+        token_id_input = keras.Input(


Add below comment
# === Functional Model ===

keras_hub/src/models/qwen2_vl/qwen2_vl_presets.py

sachinprasadhs · 2026-02-28T00:30:32Z

keras_hub/src/models/qwen2_vl/qwen2_vl_backbone.py

+        del vision_embeddings_shape
+        del vision_indices_shape


Remove this

sachinprasadhs · 2026-02-28T00:31:03Z

keras_hub/src/models/qwen2_vl/qwen2_vl_backbone.py

+    mrope_position_ids, head_dim, rope_max_wavelength, mrope_section
+):
+    """Compute M-RoPE cos/sin embeddings from position IDs."""
+    del mrope_section  # Sections are applied in attention, not here.


Remove this

sachinprasadhs · 2026-02-28T00:55:49Z

keras_hub/src/models/qwen2_vl/qwen2_vl_layernorm.py

+
+from keras_hub.src.models.qwen.qwen_layernorm import QwenLayerNorm
+
+Qwen2VLLayerNorm = QwenLayerNorm


You can rather subclass class Qwen2VLLayerNorm(QwenLayerNorm) and have the description and example defined inside the class and just pass at the end.

keras_hub/src/models/qwen2_vl/qwen2_vl_layernorm.py

…version and improve model documentation.

sam-shubham · 2026-02-28T12:53:45Z

hii @sachinprasadhs addresed your commentend issues..

let me know if i missed something..

sam-shubham · 2026-02-28T13:43:01Z

and for that channel_first thing

The data_format="channels_first" is intentional here — the Qwen2-VL model internally always processes vision patches in (batch, C, T, H, W) format (matching the HuggingFace reference implementation). The Keras Conv3D layer with data_format="channels_first" already handles cross-backend compatibility internally — Keras transposes under the hood for Torch/JAX backends automatically.

i tried doing it but ended up retransposing the already transposed date.. (This Commit)

Changing the internal pipeline to support both formats would require restructuring how patches are produced throughout the backbone (Qwen2VLFlattenVisionInputs, the images input shape, etc.), which would be a larger
change. I've added a comment explaining why channels_first is used there.

…3D as Keras now manages it internally.

sam-shubham and others added 3 commits February 14, 2026 20:31

added baremin backbone and vision encoders for qwen2_vl

fcc53d6

fixed backbone and tests

95041f1

Merge branch 'keras-team:master' into qwen2_vl

d11660e

gemini-code-assist bot reviewed Feb 14, 2026

View reviewed changes

sam-shubham and others added 6 commits February 15, 2026 02:38

added causal_lm image converter and tokenizer

e7faa8d

fixed causal lm and tests

2154671

fixed decoder attention and vision encoder

ea5c16f

added tokenizer tests

be04323

added presets , converter and checkpoints

a00bdb3

Added Qwen2_VL

bc9e3db

sam-shubham force-pushed the qwen2_vl branch from 3802c01 to bc9e3db Compare February 14, 2026 23:36

sam-shubham and others added 6 commits February 15, 2026 05:31

fix tokenizer as per hugging face req

08a5d2b

fixed tokenizer as per hugging face

acc021c

fixed backbone for inputs not connected to output

d0f3c40

fixed backbone for inputs not connected to output

ac569b9

added visiion encoder to backbone constructor

2a120a8

fixed configuration and multimodel transformer issues

3eb4470

sam-shubham changed the title ~~Qwen2 vl~~ Add Qwen2_VL Feb 16, 2026

sam-shubham marked this pull request as ready for review February 16, 2026 04:17

sachinprasadhs mentioned this pull request Feb 19, 2026

Add Qwen2 VL #2604

Open

6 tasks

samudraneel05 reviewed Feb 19, 2026

View reviewed changes

sam-shubham added 2 commits February 20, 2026 05:31

fix M-RoPE generation bug, refactor tokenizer, add smart_resize

bb69a65

fix M-RoPE generation bug, refactor tokenizer, add smart_resize

035496c

fixed formatting issues

bc62497

sachinprasadhs self-requested a review February 20, 2026 01:27

dixed torch specific M-RoPE fixes

03312a1

sachinprasadhs added the new model For PRs that contribute a new model to the Keras Hub registry. label Feb 24, 2026

sachinprasadhs reviewed Feb 28, 2026

View reviewed changes

sam-shubham and others added 2 commits February 28, 2026 17:54

feat: add end-to-end generation validation to Qwen2-VL checkpoint con…

130a230

…version and improve model documentation.

fixed formating errors and others

a1148bd

sam-shubham requested a review from sachinprasadhs February 28, 2026 12:54

refactor: Remove explicit channels_last data format handling for Conv…

342e534

…3D as Keras now manages it internally.


		from keras_hub.src.models.qwen.qwen_layernorm import QwenLayerNorm

		Qwen2VLLayerNorm = QwenLayerNorm

Conversation

sam-shubham commented Feb 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description of the change

Notebooks

Checklist

Uh oh!

google-cla bot commented Feb 14, 2026

Uh oh!

gemini-code-assist bot commented Feb 14, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sam-shubham commented Feb 16, 2026

Uh oh!

samudraneel05 commented Feb 16, 2026

Uh oh!

sam-shubham commented Feb 16, 2026

Uh oh!

jaytiwarihub commented Feb 16, 2026

Uh oh!

sam-shubham commented Feb 16, 2026

Uh oh!

samudraneel05 commented Feb 19, 2026

Uh oh!

sam-shubham commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

samudraneel05 commented Feb 19, 2026

Uh oh!

sam-shubham commented Feb 19, 2026

Uh oh!

samudraneel05 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sam-shubham commented Feb 19, 2026

Uh oh!

jaytiwarihub commented Feb 19, 2026

Uh oh!

sam-shubham commented Feb 19, 2026

Uh oh!

samudraneel05 commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sam-shubham commented Feb 19, 2026

Uh oh!

sam-shubham commented Feb 20, 2026

Uh oh!

sam-shubham commented Feb 20, 2026

Uh oh!

sam-shubham commented Feb 24, 2026

Uh oh!

sachinprasadhs left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

sam-shubham commented Feb 14, 2026 •

edited

Loading

sam-shubham commented Feb 19, 2026 •

edited

Loading

samudraneel05 commented Feb 19, 2026 •

edited

Loading

sam-shubham commented Feb 28, 2026 •

edited

Loading