We are very pleased to announce that Ktransformers now supports Qwen3-Next-80B-A3B-Thinking and Qwen3-Next-80B-A3B-Instruct.
-
Official Qwen3-Next-80B-A3B-Thinking Release:
-
Official Qwen3-Next-80B-A3B-Instruct Release
The model running with 512 Experts requires approximately 320 GB of memory and 6 GB of GPU memory.
# download gguf
huggingface-cli download --resume-download Qwen/Qwen3-Next-80B-A3B-Instruct
To install KTransformers, follow the official Installation Guide.
python ktransformers/server/main.py \
--port 10021 \
--model_path path-to-Qwen3-Next-80B-A3B-Thinking \
--gguf_path path-to-Qwen3-Next-80B-A3B-Thinking \
--model_name Qwen3NextForCausalLM \
--optimize_config_path <local_path>/ktransformers/optimize/optimize_rules/Qwen3Next-serve.yaml \
--max_new_tokens 1024 \
--cache_lens 32768 \
--chunk_size 256 \
--max_batch_size 4 \
--no-use_cuda_graph \
--backend_type balance_servecurl -X POST http://localhost:10021/v1/chat/completions \
-H "accept: application/json" \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "user", "content": "hello"}
],
"model": "Qwen3-Next-80B-A3B-Instruct",
"temperature": 0.3,
"top_p": 1.0,
"stream": true
}'
Due to Qwen3-Next’s use of linear attention, CUDA Graph optimization is not yet support — but it’s coming soon! 🚀