Support for Qwen 3.5 Model Series (Dense and Quantized)#3396
Support for Qwen 3.5 Model Series (Dense and Quantized)#3396woodRock wants to merge 1 commit intohuggingface:mainfrom
Conversation
- Implement Qwen 3.5 hybrid architecture (Gated DeltaNet + Gated Attention) - Optimize linear attention prefill using GEMM kernels - Add GGUF support for quantized Qwen 3.5 variants - Update examples to support the new model series
|
Would love to have support for this. The jina ranking models also need it. Thanks for adding support @woodRock |
|
does the model able to review now? I really want add qwen3.5 as local alternate to openclaw |
|
I was able to integrate this branch successfully in my project. Works great |
|
@rupurt hi, how's the speed |
|
Slow as a dawg on CPU, but pretty decent on apple hardware. Takes about 1 second per reranking result for me in sift on CPU |
|
so cool. thanks for doing this. is there quantized support for the 35B? |
|
Hello, I run your examples on Dense and Quantized ones. It worked. However, when I try different prompts the repeating occurred. I tried Dense 0.8B, 2B, 4B, 9B, 27B with different temperature, top_k and repeat_penalty. The repeating issue happened. The different is when the model is bigger repeating occurred later. Do you have any idea what is the issue? Thanks. |
This PR implements support for the Qwen 3.5 model series, covering the dense variants (0.8B, 2B, 4B, 9B, and 27B) and their corresponding GGUF quantized versions.
Qwen 3.5 introduces a sophisticated hybrid architecture that alternates between Gated Attention (standard softmax-based attention) and Gated DeltaNet (a linear attention/SSM variant).
Key Changes:
architecture.
support, including proper mapping for SSM-specific tensors (ssm_in, ssm_conv1d, ssm_alpha,
ssm_beta, etc.).
rope_parameters structure used in Qwen 3.5.
Performance & Optimization:$S_t = S_{t-1} \times g_t + \dots$ ).
The Gated DeltaNet layers involve a recurrent state update that is inherently sequential (
to use Matrix Multiplications (GEMM) instead of broadcasting/sums and hoisted all loop-invariants
(dtype conversions, exponentials). This significantly reduces kernel launch overhead on Metal and
CUDA.
operations, the prefill (prompt processing) phase is still
prefill speeds, this architecture would benefit from a custom Fused Parallel Scan kernel (similar
to the Selective Scan in Mamba) to parallelize the recurrence across the sequence dimension.
Testing:
Verified with Qwen/Qwen3.5-0.8B (BF16) and unsloth/Qwen3.5-0.8B-GGUF (Q4_K_M) on Metal.
Standard Dense:
cargo run --features metal --example qwen --release -- --model 3.5-0.8b --prompt "Tell me a short story."Quantized GGUF:
cargo run --features metal --example quantized-qwen3_5 --release -- --which 0.8b --prompt "Explain the concept of quantum entanglement."Closes #3393