Add Deepseek 3.1 & Deepseek 3.1 base by cheachu · Pull Request #2607 · keras-team/keras-hub

cheachu · 2026-02-22T08:29:05Z

Description of the change

This PR introduces the DeepSeek-V3.1 model architecture to Keras Hub, fully compliant with the Keras 3 backend-agnostic design (TensorFlow, JAX, PyTorch).
Key architectural components implemented in this PR:

Multi-head Latent Attention (MLA): Implemented the MLA absorption trick to significantly reduce KV cache size during inference without materializing full per-head Key/Value tensors.
DeepSeekMoE (Mixture-of-Experts): Implemented the auxiliary-loss-free top-K routing mechanism using Sigmoid affinity scores. Crucially, the expert routing and computation loop has been fully vectorized using ops.einsum and batched kernel tensors, ensuring it is 100% XLA-compatible (jit_compile=True) and avoids the severe graph bloat associated with iterating over 256 experts.
YaRN RoPE Scaling: Added Yet another RoPE extensioN (YaRN) for effective long-context scaling.
RMSNorm & SwiGLU FFN: Implemented precise RMS normalization (using float32 casting for numerical stability) and dense SwiGLU layers for the initial non-MoE transformer layers.
Tokenizer & CausalLM: Added the BytePairTokenizer (based on the official DeepSeek implementation) along with the CausalLM and CausalLMPreprocessor wrappers for text generation.

Reference

Research Paper: DeepSeek-V3 Technical Report (arXiv:2412.19437)
Reference Implementation: DeepSeek-V3 Official GitHub Repository
Resolves Issue: #2391

Checklist

I have added all the necessary unit tests for my change.
I have verified that my change does not break existing code and works with all backends (TensorFlow, JAX, and PyTorch).
My PR is based on the latest changes of the main branch (if unsure, rebase the code).
I have followed the Keras Hub Model contribution guidelines in making these changes.
I have followed the Keras Hub API design guidelines in making these changes.
I have signed the Contributor License Agreement.

…on guide

gemini-code-assist · 2026-02-22T08:29:31Z

Summary of Changes

Hello @cheachu, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly expands Keras Hub's capabilities by integrating the advanced DeepSeek-V3.1 large language model. The implementation focuses on cutting-edge architectural components like Multi-head Latent Attention for efficient memory usage during inference, a highly optimized Mixture-of-Experts system for scalable computation, and advanced positional encoding techniques for handling longer contexts. This addition provides Keras users with a powerful, state-of-the-art language model that is fully compatible with Keras 3's multi-backend design.

Highlights

DeepSeek-V3.1 Model Integration: Introduced the DeepSeek-V3.1 model architecture to Keras Hub, ensuring full compliance with the Keras 3 backend-agnostic design (TensorFlow, JAX, PyTorch).
Multi-head Latent Attention (MLA): Implemented MLA with an absorption trick to significantly reduce KV cache size during inference by compressing keys and values through a shared low-rank latent vector, avoiding materialization of full per-head Key/Value tensors.
DeepSeekMoE (Mixture-of-Experts): Integrated an auxiliary-loss-free top-K routing mechanism using Sigmoid affinity scores, with expert routing and computation fully vectorized using 'ops.einsum' for 100% XLA-compatibility and reduced graph bloat.
YaRN RoPE Scaling: Added Yet another RoPE extensioN (YaRN) to support effective long-context scaling by applying differential scaling to RoPE frequencies based on wavelength.
RMSNorm & SwiGLU FFN: Implemented precise RMS normalization (using float32 casting for numerical stability) and dense SwiGLU layers for the initial non-MoE transformer layers.
Tokenizer & CausalLM: Included the BytePairTokenizer (based on the official DeepSeek implementation) along with 'CausalLM' and 'CausalLMPreprocessor' wrappers for text generation.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

keras_hub/api/models/init.py
- Added imports for DeepSeekV31Backbone, DeepSeekV31CausalLM, DeepSeekV31CausalLMPreprocessor, and DeepSeekV31Tokenizer to expose them in the API.
keras_hub/api/tokenizers/init.py
- Added import for DeepSeekV31Tokenizer to make it accessible via the tokenizers API.
keras_hub/src/models/deepseek_v31/init.py
- Added new file to export DeepSeekV31 components and register their presets.
keras_hub/src/models/deepseek_v31/deepseek_v31_attention.py
- Added new file implementing the DeepSeek V31 Multi-head Latent Attention (MLA) layer, including YaRN RoPE scaling.
keras_hub/src/models/deepseek_v31/deepseek_v31_backbone.py
- Added new file defining the DeepSeek V31 core transformer backbone, incorporating MLA and a configurable Mixture-of-Experts (MoE) or dense FFN.
keras_hub/src/models/deepseek_v31/deepseek_v31_backbone_test.py
- Added new file containing unit tests for the DeepSeekV31Backbone, including cache building and parameter count verification.
keras_hub/src/models/deepseek_v31/deepseek_v31_causal_lm.py
- Added new file implementing the DeepSeek V31 Causal Language Model, which pairs the backbone with an LM head and supports MLA-compatible KV caching for generation.
keras_hub/src/models/deepseek_v31/deepseek_v31_causal_lm_preprocessor.py
- Added new file for the DeepSeek V31 Causal LM preprocessor, designed to prepare text inputs for the DeepSeekV31CausalLM.
keras_hub/src/models/deepseek_v31/deepseek_v31_causal_lm_preprocessor_test.py
- Added new file with unit tests for the DeepSeekV31CausalLMPreprocessor, verifying preprocessing logic and special token handling.
keras_hub/src/models/deepseek_v31/deepseek_v31_causal_lm_test.py
- Added new file containing unit tests for the DeepSeekV31CausalLM, covering basic functionality and text generation.
keras_hub/src/models/deepseek_v31/deepseek_v31_decoder_block.py
- Added new file defining the DeepSeek V31 transformer decoder block, including DeepSeekV31RMSNorm and DeepSeekV31DenseFeedForward layers.
keras_hub/src/models/deepseek_v31/deepseek_v31_moe.py
- Added new file implementing the DeepSeek V31 Mixture-of-Experts (MoE) layer with sigmoid-based routing and vectorized expert computation.
keras_hub/src/models/deepseek_v31/deepseek_v31_presets.py
- Added new file to define preset configurations and metadata for DeepSeek V31 backbone, tokenizer, and preprocessor models.
keras_hub/src/models/deepseek_v31/deepseek_v31_tokenizer.py
- Added new file implementing the DeepSeek V31 BytePairTokenizer, including special token handling and SentencePiece proto support.
keras_hub/src/models/deepseek_v31/deepseek_v31_tokenizer_test.py
- Added new file with unit tests for the DeepSeekV31Tokenizer, covering basic tokenization and special token properties.

Activity

No human activity (comments, reviews) was detected on this pull request yet.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

The pull request introduces the DeepSeek-V3.1 model architecture to Keras Hub, including Multi-head Latent Attention (MLA), DeepSeekMoE, YaRN RoPE Scaling, RMSNorm, SwiGLU FFN, Tokenizer, and CausalLM. A critical security vulnerability was identified in the DeepSeekV31Tokenizer class where the proto argument, accepting a file path to a SentencePiece model, is loaded without checking Keras's safe_mode. This bypasses security protections against arbitrary file reads during model deserialization. It is recommended to include the necessary safe_mode check. Additionally, there are minor issues related to naming conventions in the presets file and a potential for improved clarity in the _yarn_inv_freq method's constant.

gemini-code-assist · 2026-02-22T08:31:43Z

keras_hub/src/models/deepseek_v31/deepseek_v31_tokenizer.py

+        if proto is not None:
+            try:
+                import sentencepiece as spm
+
+                sp = spm.SentencePieceProcessor()
+                sp.Load(proto)


The proto argument allows loading a SentencePiece model from an arbitrary file path. This bypasses Keras's safe_mode protection, which is intended to prevent arbitrary file reads during deserialization of untrusted models. The parent class BytePairTokenizer implements a check for safe_mode when loading from a path, and DeepSeekV31Tokenizer should do the same for the proto argument to maintain a consistent security posture.

Suggested change

if proto is not None:

try:

import sentencepiece as spm

sp = spm.SentencePieceProcessor()

sp.Load(proto)

if proto is not None:

from keras.src.saving import serialization_lib

if isinstance(proto, str) and serialization_lib.in_safe_mode():

raise ValueError(

"Requested the loading of a SentencePiece proto file outside of the "

"model archive. This carries a potential risk of loading "

"arbitrary and sensitive files and thus it is disallowed "

"by default. If you trust the source of the artifact, you "

"can override this error by passing `safe_mode=False` to "

"the loading function, or calling "

"`keras.config.enable_unsafe_deserialization()`."

)

try:

import sentencepiece as spm

sp = spm.SentencePieceProcessor()

sp.Load(proto)

gemini-code-assist · 2026-02-22T08:31:43Z

keras_hub/src/models/deepseek_v31/deepseek_v31_attention.py

+
+        # Wavelength = 2π / freq. High-freq → small wavelength, low-freq →
+        # large wavelength. YaRN applies more scaling to low-freq dimensions.
+        wavelengths = 2.0 * 3.14159265358979 / freqs


The constant 3.14159265358979 is a magic number. It would be more readable and maintainable to define this as a named constant, e.g., PI = 3.14159265358979.

Suggested change

wavelengths = 2.0 * 3.14159265358979 / freqs

PI = 3.14159265358979

wavelengths = 2.0 * PI / freqs

gemini-code-assist · 2026-02-22T08:31:43Z