Skip to content

Support for Qwen 3.5 Model Series (Dense and Quantized)#3396

Open
woodRock wants to merge 1 commit intohuggingface:mainfrom
woodRock:feat-qwen3.5-support
Open

Support for Qwen 3.5 Model Series (Dense and Quantized)#3396
woodRock wants to merge 1 commit intohuggingface:mainfrom
woodRock:feat-qwen3.5-support

Conversation

@woodRock
Copy link

@woodRock woodRock commented Mar 6, 2026

This PR implements support for the Qwen 3.5 model series, covering the dense variants (0.8B, 2B, 4B, 9B, and 27B) and their corresponding GGUF quantized versions.

Qwen 3.5 introduces a sophisticated hybrid architecture that alternates between Gated Attention (standard softmax-based attention) and Gated DeltaNet (a linear attention/SSM variant).

Key Changes:

  • Core Implementation: Added candle-transformers/src/models/qwen3_5.rs to handle the hybrid
    architecture.
  • Quantization Support: Added candle-transformers/src/models/quantized_qwen3_5.rs with full GGUF
    support, including proper mapping for SSM-specific tensors (ssm_in, ssm_conv1d, ssm_alpha,
    ssm_beta, etc.).
  • Metadata Handling: Updated configuration logic to handle the nested text_config and
    rope_parameters structure used in Qwen 3.5.
  • Examples:
    • Updated the existing qwen example to support the new 3.5 sizes.
    • Added a new quantized-qwen3_5 example for running GGUF models.

Performance & Optimization:
The Gated DeltaNet layers involve a recurrent state update that is inherently sequential ($S_t = S_{t-1} \times g_t + \dots$).

  • Optimized Tensor Path: To maximize speed without custom kernels, I refactored the sequential loop
    to use Matrix Multiplications (GEMM) instead of broadcasting/sums and hoisted all loop-invariants
    (dtype conversions, exponentials). This significantly reduces kernel launch overhead on Metal and
    CUDA.
  • Future Work (Fused Kernels): While the current implementation is highly optimized for pure tensor
    operations, the prefill (prompt processing) phase is still $O(N)$. To achieve production-grade
    prefill speeds, this architecture would benefit from a custom Fused Parallel Scan kernel (similar
    to the Selective Scan in Mamba) to parallelize the recurrence across the sequence dimension.

Testing:
Verified with Qwen/Qwen3.5-0.8B (BF16) and unsloth/Qwen3.5-0.8B-GGUF (Q4_K_M) on Metal.

Standard Dense:

cargo run --features metal --example qwen --release -- --model 3.5-0.8b --prompt "Tell me a short story."

Quantized GGUF:

  cargo run --features metal --example quantized-qwen3_5 --release -- --which 0.8b --prompt "Explain the concept of quantum entanglement."

Closes #3393

- Implement Qwen 3.5 hybrid architecture (Gated DeltaNet + Gated Attention)
- Optimize linear attention prefill using GEMM kernels
- Add GGUF support for quantized Qwen 3.5 variants
- Update examples to support the new model series
@rupurt
Copy link

rupurt commented Mar 11, 2026

Would love to have support for this. The jina ranking models also need it. Thanks for adding support @woodRock

@lucasjinreal
Copy link

does the model able to review now? I really want add qwen3.5 as local alternate to openclaw

@rupurt
Copy link

rupurt commented Mar 13, 2026

I was able to integrate this branch successfully in my project. Works great

@lucasjinreal
Copy link

@rupurt hi, how's the speed

@rupurt
Copy link

rupurt commented Mar 14, 2026

Slow as a dawg on CPU, but pretty decent on apple hardware. Takes about 1 second per reranking result for me in sift on CPU

@AlpineVibrations
Copy link

so cool. thanks for doing this. is there quantized support for the 35B?

@minybot
Copy link

minybot commented Mar 20, 2026

Hello, I run your examples on Dense and Quantized ones. It worked. However, when I try different prompts the repeating occurred. I tried Dense 0.8B, 2B, 4B, 9B, 27B with different temperature, top_k and repeat_penalty. The repeating issue happened. The different is when the model is bigger repeating occurred later. Do you have any idea what is the issue? Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Qwen3.5 Support

5 participants