Releases · intel/llm-scaler

12 Mar 01:49

Wesley-Du

vllm-0.14.0-b8.1

93507d2

llm-scaler-vllm beta release 0.14.0-b8.1 Pre-release

Pre-release

Highlights

Resources

Docker Image: intel/llm-scaler-vllm:0.14.0-b8.1

What’s new

vLLM:
- Added support for Qwen3.5-27B, Qwen3.5-35B-A3B and Qwen3.5-122B-A10B (FP8/INT4 online quantization, GPTQ)
- Added support for Qwen3-ASR-1.7B

Assets 2

10 Mar 03:03

liu-shaojun

omni-0.1.0-b6

4e36eb3

llm-scaler-omni beta release 0.1.0-b6 Pre-release

Pre-release

Highlights

Resources

Docker Image: intel/llm-scaler-omni:0.1.0-b6

What’s new

ComfyUI (base commit 3dd10a59c00248d00f0cb0ab794ff1bb9fb00a5f, v0.15.1)
- Added CacheDiT and torch.compile() acceleration support
- Added support for new models and workflows: SeedVR2, FlashVSR, Anima, FireRed-Image-Edit-1.1
- Introduced initial SYCL acceleration for ComfyUI-GGUF via omni_xpu_kernel (Q4_K, Q4_0, and Q8_0 optimized)
- Bug fixes:
- Fix IndexTTS-2 workflow and Wan2.2-I2V-14B-multi-XPU workflow
SGLD (base commit 0e53cee1f69beba0b1ae598e6fd5646fc52afb32)
- Code cleanup and refactoring
- Added FP8 precision support on XPU

Assets 2

02 Mar 07:12

liu-shaojun

vllm-0.14.0-b8

4a65586

llm-scaler-vllm beta release 0.14.0-b8 Pre-release

Pre-release

Highlights

Resources

Docker Image: intel/llm-scaler-vllm:0.14.0-b8
Offline Installer: 26.13.7.1 - Internal Only

Ingredients

Ingredients	Version
Host OS	Ubuntu 24.04 Desktop/Server
vllm	0.14.0
PyTorch	2.10
OneAPI	2025.3.2
OneCCL	15.7.8
UMD Driver	25.40.36300.8
KMD Driver	6.17.0-1007.7
GuC Firmware	70.55.3
XPU Manager	1.3.5
Offline Installer	26.13.7.1

What’s new

vLLM:
- Upgrade: vLLM upgrade to 0.14.0, Pytorch upgrade to 2.10. oneAPI uplifted to 2025.3.2(hotfix) with LTS support on UR adaptor v2. Oneccl upgrade to 2021.15.7.8.
- Int4 onednn optimizations are included and up to 25% throughput improvement is achieved VS last release.
- Bug fixes.
- Added support for Qwen3-VL-Reranker-2B/8B
- Added support for Qwen3-VL-Embedding-2B/8B
- Added support for GLM-4.7-Flash
- Added support for Ministral models
- Added support for DeepSeek-OCR-2
- Added support for Qwen3-Coder-Next
- Fix InternVL issue
Offline Installer:
- Add 24.04 support
- Add script to configure PCIe switch downstream port ACS
- Add script for P2P issue diagnostic

Assets 2

30 Jan 06:59

liu-shaojun

vllm-1.3

6a86b24

llm-scaler-vllm PV release 1.3 Latest

Latest

Highlights

Resources

Docker Image: intel/llm-scaler-vllm:1.3

What’s new

vLLM:
- Upgrade: vLLM upgrade to 0.11.1, Pytorch upgrade to 2.9. oneAPI upgrade to 2025.2.2(hotfix), oneccl upgrade to 2021.15.7.6.
- 8 New models supported: Qwen3-Next-80B-A3B-Instruct, Qwen3-Next-80B-A3B-Thinking, InternVL3.5-30B-A3B, DeepSeek-OCR,PaddleOCR-VL, Seed-OSS-36B-Instruct, Qwen3-30B-A3B-Instruct-2507 and openai/whisper-large-v3.
- Key bug fixes for timeout/accuracy issues found in long time stress run.
- Key bug fixes communication accuracy issue on long run scenarios. Sub-communicator hang issue on oneCCL side.
- vLLM 0.11.1 with new features: cpu kv cache offload, speculative decoding support with 2 more methods(medusa, suffix), experimental feature:FP8 kv cache, Experts parallelism is supported with scenarios TP+EP and DP+EP.
- Bug fixes.
- Supported sym_int4 for Qwen3-30B-A3B on TP 4/8.
- Supported sym_int4 for Qwen3-235B-A22B on TP 16.
- Added support for the PaddleOCR model.
- Added support for GLM-4.6v-Flash.
- Fixed crash errors with 2DP + 4TP configuration.
- Fixed abnormal output observed during JMeter stress testing.
- Fixed UR_ERROR_DEVICE_LOST errors triggered by frequent preemption under high load.
- Fixed output errors for InternVL-38B.
- Refine logic for profile_run to provide more GPU blocks by default

Assets 2

19 Jan 07:01

liu-shaojun

vllm-0.11.1-b7

e09993b

llm-scaler-vllm beta release 0.11.1-b7 Pre-release

Pre-release

Highlights

Resources

Docker Image: intel/llm-scaler-vllm:0.11.1-b7

What’s new

vLLM:
- Upgrade: vLLM upgrade to 0.11.1, Pytorch upgrade to 2.9. oneAPI upgrade to 2025.2.2(hotfix), oneccl upgrade to 2021.15.7.6.
- 8 New models supported: Qwen3-Next-80B-A3B-Instruct, Qwen3-Next-80B-A3B-Thinking, InternVL3.5-30B-A3B, DeepSeek-OCR,PaddleOCR-VL, Seed-OSS-36B-Instruct, Qwen3-30B-A3B-Instruct-2507 and openai/whisper-large-v3.
- Key bug fixes for timeout/accuracy issues found in long time stress run.
- Key bug fixes communication accuracy issue on long run scenarios. Sub-communicator hang issue on oneCCL side.
- vLLM 0.11.1 with new features: cpu kv cache offload, speculative decoding support with 2 more methods(medusa, suffix), experimental feature:FP8 kv cache, Experts parallelism is supported with scenarios TP+EP and DP+EP.
- Bug fixes.
- Supported sym_int4 for Qwen3-30B-A3B on TP 4/8.
- Supported sym_int4 for Qwen3-235B-A22B on TP 16.
- Added support for the PaddleOCR model.
- Added support for GLM-4.6v-Flash.
- Fixed crash errors with 2DP + 4TP configuration.
- Fixed abnormal output observed during JMeter stress testing.
- Fixed UR_ERROR_DEVICE_LOST errors triggered by frequent preemption under high load.
- Fixed output errors for InternVL-38B.
- Refine logic for profile_run to provide more GPU blocks by default

Assets 2

19 Jan 01:43

liu-shaojun

omni-0.1.0-b5

e09993b

llm-scaler-omni beta release 0.1.0-b5 Pre-release

Pre-release

Highlights

Resources

Docker Image: intel/llm-scaler-omni:0.1.0-b5

What’s new

Core Upgrades
- Upgraded to Python 3.12 and PyTorch 2.9 for improved performance and compatibility
ComfyUI
- Fixed a stochastic rounding issue on XPU, resolving the LoRA black-screen output problem. LoRA workflows are now supported (e.g, Z-Image-Turbo, Qwen-Image, Qwen-Image-Edit).
- Added support for new models and workflows: Qwen-Image-Layered, Qwen-Image-Edit-2511, Qwen-Image-2512, HY-Motion, and more.
- Added support for ComfyUI-GGUF, enabling GGUF models (e.g., FLUX.2-dev Q4_0) with reduced VRAM usage.
- Fixed image format issue in the Hunyuan3D-2.1 workflow.
- Refined documentation for improved clarity.
- LTX2 support on XPU.
- Updated Windows environment setup script.
SGLang Diffusion
- Added support for CacheDiT.
- Added Tensor Parallelism (TP) support for selected model with better performance (e.g, Z-Image-Turbo).
- Added SGLD ComfyUI custom node support, allowing SGLang Diffusion to serve as a backend for ComfyUI image generation workflows.
Standalone Examples
- Added support for HY-WorldPlay.
- Add audio models in standalone example

Assets 2

11 Dec 08:24

liu-shaojun

vllm-1.2

426cf68

llm-scaler-vllm PV release 1.2

Highlights

Resources

Docker Image: intel/llm-scaler-vllm:1.2
Offline Installer: offline-installer:25.45.5.4

Ingredients

Ingredients	Version
Host OS	Ubuntu 25.04 Desktop/Server
vllm	0.10.2
PyTorch	2.8.0
OneAPI	2025.1.3-7
OneCCL	15.6.2
UMD Driver	25.40.35563.7
KMD Driver	6.14.0-1008-intel
GuC Firmware	70.45.2
XPU Manager	1.3.3
Offline Installer	25.45.5.4

What’s new

vLLM:
- MoE-Int4 support for Qwen3-30B-A3B
- Bpe-Qwen tokenizer support
- Enable Qwen3-VL Dense/MoE models
- Enable Qwen3-Omni models
- MinerU 2.5 Support
- Enable whisper transcription models
- Fix minicpmv4.5 OOM issue and output error
- Enable ERNIE-4.5-vl models
- Enable Glyph based GLM-4.1V-9B-Base
- Attention kernel optimizations for decoding phases for all workloads (>10% e2e throughput on 10+ models with all in/out seq length)
- Gpt-oss 20B and 120B support in mxfp4 with optimized performance
- MoE models optimizations, output throughput:Qwen3-30B-A3B 2.6x e2e improvement; DeeSeek-V2-lite 1.5x improvement.
- New models: added 8 multi-modality models, image/video are supported.
- vLLM 0.10.2 with new features: P/D disaggregation(experimental), tooling, reasoning output, structured output,
- fp16/bf16 gemm optimizations for batch size 1-128. obvious improvement for small batch sizes.
- Bug fixes

Known issue

Crash during initialization with 2DP x 4TP configuration.
- Status: Scheduled to be fixed in release b7.
Abnormal output (excessive "!!!") observed during JMeter stress testing.
- Status: Scheduled to be fixed in release b7.
UR_ERROR_DEVICE_LOST occurs due to excessive preemption under high load.
- Description: Requests exceeding server capacity trigger frequent preemption, eventually leading to device loss.
- Workaround: Temporarily mitigate by increasing the number of GPU blocks(set higher gpu_memory_utilization) or adjusting the --max-num-seqs parameter.
An abnormal value for gpu_blocks_num will cause a performance degradation when running large batches with gpt-oss-120b.
- Description: The vllm profile_run() logic will cause the kv_cache's gpu_blocks_num decrease, and lead to performance drop.
- Workaround(hotfix): Temporarily update determine_available_memory()'s logic to allow for a larger and more efficient KV Cache.

file: /usr/local/lib/python3.12/dist-packages/vllm-0.10.3.dev0+g01efc7ef7.d20251125.xpu-py3.12-linux-x86_64.egg/vllm/v1/worker/xpu_worker.py
line: 83
# Change the entire function to:
```python
    @torch.inference_mode()
    def determine_available_memory(self) -> int:
        """Profiles the peak memory usage of the model to determine how many
        KV blocks may be allocated without OOMs.

        The engine will first conduct a profiling of the existing memory usage.
        Then, it calculates the maximum possible number of GPU and CPU blocks
        that can be allocated with the remaining free memory.

        .. tip::
            You may limit the usage of GPU memory
            by adjusting the `gpu_memory_utilization` parameter.
        """
        torch.xpu.empty_cache()
        torch.xpu.synchronize()

        torch.xpu.reset_peak_memory_stats()

        self.model_runner.profile_run()
        torch.xpu.synchronize()

        stats = torch.xpu.memory_stats()
        
        peak_allocated = stats.get("allocated_bytes.all.peak", 0)
        
        current_reserved = torch.xpu.memory_reserved()

        fragmentation_bytes = current_reserved - peak_allocated
        fragmentation_gb = fragmentation_bytes / 1024**3
        peak_gb = peak_allocated / 1024**3
        reserved_gb = current_reserved / 1024**3
        model_memory = self.model_runner.model_memory_usage / 1024 ** 3

      
        peak_memory = peak_allocated 

        total_gpu_memory = torch.xpu.get_device_properties(self.local_rank).total_memory
        
        available_kv_cache_memory = (
            total_gpu_memory * self.cache_config.gpu_memory_utilization -
            peak_memory)

        return int(available_kv_cache_memory)

Assets 2

10 Dec 01:15

liu-shaojun

omni-0.1.0-b4

f0019a1

llm-scaler-omni beta release 0.1.0-b4 Pre-release

Pre-release

Highlights

Resources

Docker Image: intel/llm-scaler-omni:0.1.0-b4

What’s new

omni:
- Added SGLang Diffusion support. 10% perf improve for ComfyUI in single card scenario
- Added ComfyUI workflows for Hunyuan-Video-1.5 (T2V, I2V, Multi-B60), Z-Image

Assets 2

26 Nov 07:22

liu-shaojun

vllm-0.10.2-b6

b5571c3

llm-scaler-vllm beta release 0.10.2-b6 Pre-release

Pre-release

Highlights

Resources

Docker Image: intel/llm-scaler-vllm:0.10.2-b6

What’s new

vLLM:
- MoE-Int4 support for Qwen3-30B-A3B
- Bpe-Qwen tokenizer support
- Enable Qwen3-VL Dense/MoE models
- Enable Qwen3-Omni models
- MinerU 2.5 Support
- Enable whisper transcription models
- Fix minicpmv4.5 OOM issue and output error
- Enable ERNIE-4.5-vl models
- Enable Glyph based GLM-4.1V-9B-Base
- Attention kernel optimizations for decoding phases for all workloads (>10% e2e throughput on 10+ models with all in/out seq length)
- Gpt-oss 20B and 120B support in mxfp4 with optimized performance
- MoE models optimizations, output throughput:Qwen3-30B-A3B 2.6x e2e improvement; DeeSeek-V2-lite 1.5x improvement.
- New models: added 8 multi-modality models, image/video are supported.
- vLLM 0.10.2 with new features: P/D disaggregation(experimental), tooling, reasoning output, structured output,
- fp16/bf16 gemm optimizations for batch size 1-128. obvious improvement for small batch sizes.
- Bug fixes

Assets 2

19 Nov 02:47

liu-shaojun

omni-0.1.0-b3

5696df8

llm-scaler-omni beta release 0.1.0-b3 Pre-release

Pre-release

Highlights

Resources

Docker Image: intel/llm-scaler-omni:0.1.0-b3

What’s new

omni:
- More workflows support:
  - Hunyuan 3D 2.1
  - Controlnet on SD3.5, FLUX.1, etc.
  - Multi XPU support for Wan 2.2 I2V 14B rapid aio
  - AnimateDiff lightning
- Add Windows installation

Assets 2

Releases: intel/llm-scaler

llm-scaler-vllm beta release 0.14.0-b8.1

Highlights

Resources

What’s new

Uh oh!

llm-scaler-omni beta release 0.1.0-b6

Highlights

Resources

What’s new

Uh oh!

llm-scaler-vllm beta release 0.14.0-b8

Highlights

Resources

Ingredients

What’s new

Uh oh!

llm-scaler-vllm PV release 1.3

Highlights

Resources

What’s new

Uh oh!

llm-scaler-vllm beta release 0.11.1-b7

Highlights

Resources

What’s new

Uh oh!

llm-scaler-omni beta release 0.1.0-b5

Highlights

Resources

What’s new

Uh oh!

llm-scaler-vllm PV release 1.2

Highlights

Resources

Ingredients

What’s new

Known issue

Uh oh!

llm-scaler-omni beta release 0.1.0-b4

Highlights

Resources

What’s new

Uh oh!

llm-scaler-vllm beta release 0.10.2-b6

Highlights

Resources

What’s new

Uh oh!

llm-scaler-omni beta release 0.1.0-b3

Highlights

Resources

What’s new

Uh oh!