Releases: intel/llm-scaler
Releases · intel/llm-scaler
llm-scaler-vllm beta release 0.14.0-b8.1
Highlights
Resources
- Docker Image: intel/llm-scaler-vllm:0.14.0-b8.1
What’s new
-
vLLM:
- Added support for Qwen3.5-27B, Qwen3.5-35B-A3B and Qwen3.5-122B-A10B (FP8/INT4 online quantization, GPTQ)
- Added support for Qwen3-ASR-1.7B
llm-scaler-omni beta release 0.1.0-b6
Highlights
Resources
- Docker Image: intel/llm-scaler-omni:0.1.0-b6
What’s new
-
ComfyUI (base commit
3dd10a59c00248d00f0cb0ab794ff1bb9fb00a5f, v0.15.1)- Added CacheDiT and torch.compile() acceleration support
- Added support for new models and workflows: SeedVR2, FlashVSR, Anima, FireRed-Image-Edit-1.1
- Introduced initial SYCL acceleration for ComfyUI-GGUF via
omni_xpu_kernel(Q4_K, Q4_0, and Q8_0 optimized) - Bug fixes:
- Fix IndexTTS-2 workflow and Wan2.2-I2V-14B-multi-XPU workflow
-
SGLD (base commit
0e53cee1f69beba0b1ae598e6fd5646fc52afb32)- Code cleanup and refactoring
- Added FP8 precision support on XPU
llm-scaler-vllm beta release 0.14.0-b8
Highlights
Resources
- Docker Image: intel/llm-scaler-vllm:0.14.0-b8
- Offline Installer: 26.13.7.1 - Internal Only
Ingredients
| Ingredients | Version |
|---|---|
| Host OS | Ubuntu 24.04 Desktop/Server |
| vllm | 0.14.0 |
| PyTorch | 2.10 |
| OneAPI | 2025.3.2 |
| OneCCL | 15.7.8 |
| UMD Driver | 25.40.36300.8 |
| KMD Driver | 6.17.0-1007.7 |
| GuC Firmware | 70.55.3 |
| XPU Manager | 1.3.5 |
| Offline Installer | 26.13.7.1 |
What’s new
-
vLLM:
- Upgrade: vLLM upgrade to 0.14.0, Pytorch upgrade to 2.10. oneAPI uplifted to 2025.3.2(hotfix) with LTS support on UR adaptor v2. Oneccl upgrade to 2021.15.7.8.
- Int4 onednn optimizations are included and up to 25% throughput improvement is achieved VS last release.
- Bug fixes.
- Added support for Qwen3-VL-Reranker-2B/8B
- Added support for Qwen3-VL-Embedding-2B/8B
- Added support for GLM-4.7-Flash
- Added support for Ministral models
- Added support for DeepSeek-OCR-2
- Added support for Qwen3-Coder-Next
- Fix InternVL issue
-
Offline Installer:
- Add 24.04 support
- Add script to configure PCIe switch downstream port ACS
- Add script for P2P issue diagnostic
llm-scaler-vllm PV release 1.3
Highlights
Resources
- Docker Image: intel/llm-scaler-vllm:1.3
What’s new
-
vLLM:
- Upgrade: vLLM upgrade to 0.11.1, Pytorch upgrade to 2.9. oneAPI upgrade to 2025.2.2(hotfix), oneccl upgrade to 2021.15.7.6.
- 8 New models supported: Qwen3-Next-80B-A3B-Instruct, Qwen3-Next-80B-A3B-Thinking, InternVL3.5-30B-A3B, DeepSeek-OCR,PaddleOCR-VL, Seed-OSS-36B-Instruct, Qwen3-30B-A3B-Instruct-2507 and openai/whisper-large-v3.
- Key bug fixes for timeout/accuracy issues found in long time stress run.
- Key bug fixes communication accuracy issue on long run scenarios. Sub-communicator hang issue on oneCCL side.
- vLLM 0.11.1 with new features: cpu kv cache offload, speculative decoding support with 2 more methods(medusa, suffix), experimental feature:FP8 kv cache, Experts parallelism is supported with scenarios TP+EP and DP+EP.
- Bug fixes.
- Supported sym_int4 for Qwen3-30B-A3B on TP 4/8.
- Supported sym_int4 for Qwen3-235B-A22B on TP 16.
- Added support for the PaddleOCR model.
- Added support for GLM-4.6v-Flash.
- Fixed crash errors with 2DP + 4TP configuration.
- Fixed abnormal output observed during JMeter stress testing.
- Fixed UR_ERROR_DEVICE_LOST errors triggered by frequent preemption under high load.
- Fixed output errors for InternVL-38B.
- Refine logic for profile_run to provide more GPU blocks by default
llm-scaler-vllm beta release 0.11.1-b7
Highlights
Resources
- Docker Image: intel/llm-scaler-vllm:0.11.1-b7
What’s new
-
vLLM:
- Upgrade: vLLM upgrade to 0.11.1, Pytorch upgrade to 2.9. oneAPI upgrade to 2025.2.2(hotfix), oneccl upgrade to 2021.15.7.6.
- 8 New models supported: Qwen3-Next-80B-A3B-Instruct, Qwen3-Next-80B-A3B-Thinking, InternVL3.5-30B-A3B, DeepSeek-OCR,PaddleOCR-VL, Seed-OSS-36B-Instruct, Qwen3-30B-A3B-Instruct-2507 and openai/whisper-large-v3.
- Key bug fixes for timeout/accuracy issues found in long time stress run.
- Key bug fixes communication accuracy issue on long run scenarios. Sub-communicator hang issue on oneCCL side.
- vLLM 0.11.1 with new features: cpu kv cache offload, speculative decoding support with 2 more methods(medusa, suffix), experimental feature:FP8 kv cache, Experts parallelism is supported with scenarios TP+EP and DP+EP.
- Bug fixes.
- Supported sym_int4 for Qwen3-30B-A3B on TP 4/8.
- Supported sym_int4 for Qwen3-235B-A22B on TP 16.
- Added support for the PaddleOCR model.
- Added support for GLM-4.6v-Flash.
- Fixed crash errors with 2DP + 4TP configuration.
- Fixed abnormal output observed during JMeter stress testing.
- Fixed UR_ERROR_DEVICE_LOST errors triggered by frequent preemption under high load.
- Fixed output errors for InternVL-38B.
- Refine logic for profile_run to provide more GPU blocks by default
llm-scaler-omni beta release 0.1.0-b5
Highlights
Resources
- Docker Image: intel/llm-scaler-omni:0.1.0-b5
What’s new
- Core Upgrades
- Upgraded to Python 3.12 and PyTorch 2.9 for improved performance and compatibility
- ComfyUI
- Fixed a stochastic rounding issue on XPU, resolving the LoRA black-screen output problem. LoRA workflows are now supported (e.g, Z-Image-Turbo, Qwen-Image, Qwen-Image-Edit).
- Added support for new models and workflows: Qwen-Image-Layered, Qwen-Image-Edit-2511, Qwen-Image-2512, HY-Motion, and more.
- Added support for ComfyUI-GGUF, enabling GGUF models (e.g., FLUX.2-dev Q4_0) with reduced VRAM usage.
- Fixed image format issue in the Hunyuan3D-2.1 workflow.
- Refined documentation for improved clarity.
- LTX2 support on XPU.
- Updated Windows environment setup script.
- SGLang Diffusion
- Added support for CacheDiT.
- Added Tensor Parallelism (TP) support for selected model with better performance (e.g, Z-Image-Turbo).
- Added SGLD ComfyUI custom node support, allowing SGLang Diffusion to serve as a backend for ComfyUI image generation workflows.
- Standalone Examples
- Added support for HY-WorldPlay.
- Add audio models in standalone example
llm-scaler-vllm PV release 1.2
Highlights
Resources
- Docker Image: intel/llm-scaler-vllm:1.2
- Offline Installer: offline-installer:25.45.5.4
Ingredients
| Ingredients | Version |
|---|---|
| Host OS | Ubuntu 25.04 Desktop/Server |
| vllm | 0.10.2 |
| PyTorch | 2.8.0 |
| OneAPI | 2025.1.3-7 |
| OneCCL | 15.6.2 |
| UMD Driver | 25.40.35563.7 |
| KMD Driver | 6.14.0-1008-intel |
| GuC Firmware | 70.45.2 |
| XPU Manager | 1.3.3 |
| Offline Installer | 25.45.5.4 |
What’s new
- vLLM:
- MoE-Int4 support for Qwen3-30B-A3B
- Bpe-Qwen tokenizer support
- Enable Qwen3-VL Dense/MoE models
- Enable Qwen3-Omni models
- MinerU 2.5 Support
- Enable whisper transcription models
- Fix minicpmv4.5 OOM issue and output error
- Enable ERNIE-4.5-vl models
- Enable Glyph based GLM-4.1V-9B-Base
- Attention kernel optimizations for decoding phases for all workloads (>10% e2e throughput on 10+ models with all in/out seq length)
- Gpt-oss 20B and 120B support in mxfp4 with optimized performance
- MoE models optimizations, output throughput:Qwen3-30B-A3B 2.6x e2e improvement; DeeSeek-V2-lite 1.5x improvement.
- New models: added 8 multi-modality models, image/video are supported.
- vLLM 0.10.2 with new features: P/D disaggregation(experimental), tooling, reasoning output, structured output,
- fp16/bf16 gemm optimizations for batch size 1-128. obvious improvement for small batch sizes.
- Bug fixes
Known issue
- Crash during initialization with 2DP x 4TP configuration.
- Status: Scheduled to be fixed in release b7.
- Abnormal output (excessive "!!!") observed during JMeter stress testing.
- Status: Scheduled to be fixed in release b7.
- UR_ERROR_DEVICE_LOST occurs due to excessive preemption under high load.
- Description: Requests exceeding server capacity trigger frequent preemption, eventually leading to device loss.
- Workaround: Temporarily mitigate by increasing the number of GPU blocks(set higher gpu_memory_utilization) or adjusting the --max-num-seqs parameter.
- An abnormal value for gpu_blocks_num will cause a performance degradation when running large batches with gpt-oss-120b.
- Description: The vllm profile_run() logic will cause the kv_cache's gpu_blocks_num decrease, and lead to performance drop.
- Workaround(hotfix): Temporarily update determine_available_memory()'s logic to allow for a larger and more efficient KV Cache.
file: /usr/local/lib/python3.12/dist-packages/vllm-0.10.3.dev0+g01efc7ef7.d20251125.xpu-py3.12-linux-x86_64.egg/vllm/v1/worker/xpu_worker.py
line: 83
# Change the entire function to:
```python
@torch.inference_mode()
def determine_available_memory(self) -> int:
"""Profiles the peak memory usage of the model to determine how many
KV blocks may be allocated without OOMs.
The engine will first conduct a profiling of the existing memory usage.
Then, it calculates the maximum possible number of GPU and CPU blocks
that can be allocated with the remaining free memory.
.. tip::
You may limit the usage of GPU memory
by adjusting the `gpu_memory_utilization` parameter.
"""
torch.xpu.empty_cache()
torch.xpu.synchronize()
torch.xpu.reset_peak_memory_stats()
self.model_runner.profile_run()
torch.xpu.synchronize()
stats = torch.xpu.memory_stats()
peak_allocated = stats.get("allocated_bytes.all.peak", 0)
current_reserved = torch.xpu.memory_reserved()
fragmentation_bytes = current_reserved - peak_allocated
fragmentation_gb = fragmentation_bytes / 1024**3
peak_gb = peak_allocated / 1024**3
reserved_gb = current_reserved / 1024**3
model_memory = self.model_runner.model_memory_usage / 1024 ** 3
peak_memory = peak_allocated
total_gpu_memory = torch.xpu.get_device_properties(self.local_rank).total_memory
available_kv_cache_memory = (
total_gpu_memory * self.cache_config.gpu_memory_utilization -
peak_memory)
return int(available_kv_cache_memory)llm-scaler-omni beta release 0.1.0-b4
Highlights
Resources
- Docker Image: intel/llm-scaler-omni:0.1.0-b4
What’s new
-
omni:
- Added SGLang Diffusion support. 10% perf improve for ComfyUI in single card scenario
- Added ComfyUI workflows for Hunyuan-Video-1.5 (T2V, I2V, Multi-B60), Z-Image
llm-scaler-vllm beta release 0.10.2-b6
Highlights
Resources
- Docker Image: intel/llm-scaler-vllm:0.10.2-b6
What’s new
-
vLLM:
- MoE-Int4 support for Qwen3-30B-A3B
- Bpe-Qwen tokenizer support
- Enable Qwen3-VL Dense/MoE models
- Enable Qwen3-Omni models
- MinerU 2.5 Support
- Enable whisper transcription models
- Fix minicpmv4.5 OOM issue and output error
- Enable ERNIE-4.5-vl models
- Enable Glyph based GLM-4.1V-9B-Base
- Attention kernel optimizations for decoding phases for all workloads (>10% e2e throughput on 10+ models with all in/out seq length)
- Gpt-oss 20B and 120B support in mxfp4 with optimized performance
- MoE models optimizations, output throughput:Qwen3-30B-A3B 2.6x e2e improvement; DeeSeek-V2-lite 1.5x improvement.
- New models: added 8 multi-modality models, image/video are supported.
- vLLM 0.10.2 with new features: P/D disaggregation(experimental), tooling, reasoning output, structured output,
- fp16/bf16 gemm optimizations for batch size 1-128. obvious improvement for small batch sizes.
- Bug fixes
llm-scaler-omni beta release 0.1.0-b3
Highlights
Resources
- Docker Image: intel/llm-scaler-omni:0.1.0-b3
What’s new
-
omni:
- More workflows support:
- Hunyuan 3D 2.1
- Controlnet on SD3.5, FLUX.1, etc.
- Multi XPU support for Wan 2.2 I2V 14B rapid aio
- AnimateDiff lightning
- Add Windows installation
- More workflows support: