Conversation
sachinprasadhs
left a comment
There was a problem hiding this comment.
Thanks! I took a look at a few files and left comments. Please address those comments and mark them as resolved once complete. I will perform another review after the comments have been handled.
- removed whisper dependency - reduced tf usage - made test values smaller in line with moonshine/whisper
|
/gemini review |
There was a problem hiding this comment.
Code Review
This is an impressive and well-structured contribution, adding the multimodal Qwen3-Omni model to KerasHub. The code adheres well to the repository's extensive style guide, including modular components, comprehensive tests, and the necessary converters. I have one main piece of feedback regarding code duplication and a related bug in the Qwen3OmniBackbone implementation, which I've detailed in a specific comment. Overall, this is a high-quality pull request.
|
hi, i've addressed the current set of comments. ready for review @sachinprasadhs! |
sachinprasadhs
left a comment
There was a problem hiding this comment.
Thanks for addressing all the comments, overall it looks good.
Few small comments and would it be possible to attach the screenshots of numerics matching with 1e-3 level.
Also, add one usage example notebook for different types of input and expected output format.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces the Qwen3Omni multimodal model, a significant addition to the library. The implementation is comprehensive, covering the backbone, audio and vision encoders, preprocessors, and the causal language modeling task. The code is well-structured and adheres to the repository's conventions for new model contributions. I have identified a few areas for improvement related to performance and style, which are detailed in the review comments. These include an opportunity to optimize the application of RoPE during cached generation and to make the creation of positional embeddings in the audio encoder more efficient. I also noted a minor deviation from the style guide regarding attribute naming. Overall, this is a well-executed contribution.
Description of the change
Add Qwen3-Omni model to Keras-Hub. It has a thinker-talker architecture So far, I've tried to implement the thinker component, which is the core text transformer with audio and vision encoder integration.
Reference
Fixes #2413, Fixes #2523, and fixes #2530
Colab Notebook
Checklist
I have some follow-up questions, which I'll elaborate on in the comments under this.