Skip to content

Releases: intel/llm-scaler

llm-scaler-vllm beta release 0.14.0-b8.1

12 Mar 01:49
93507d2

Choose a tag to compare

Highlights

Resources

What’s new

  • vLLM:

    • Added support for Qwen3.5-27B, Qwen3.5-35B-A3B and Qwen3.5-122B-A10B (FP8/INT4 online quantization, GPTQ)
    • Added support for Qwen3-ASR-1.7B

llm-scaler-omni beta release 0.1.0-b6

10 Mar 03:03
4e36eb3

Choose a tag to compare

Pre-release

Highlights

Resources

What’s new

  • ComfyUI (base commit 3dd10a59c00248d00f0cb0ab794ff1bb9fb00a5f, v0.15.1)

    • Added CacheDiT and torch.compile() acceleration support
    • Added support for new models and workflows: SeedVR2, FlashVSR, Anima, FireRed-Image-Edit-1.1
    • Introduced initial SYCL acceleration for ComfyUI-GGUF via omni_xpu_kernel (Q4_K, Q4_0, and Q8_0 optimized)
    • Bug fixes:
    • Fix IndexTTS-2 workflow and Wan2.2-I2V-14B-multi-XPU workflow
  • SGLD (base commit 0e53cee1f69beba0b1ae598e6fd5646fc52afb32)

    • Code cleanup and refactoring
    • Added FP8 precision support on XPU

llm-scaler-vllm beta release 0.14.0-b8

02 Mar 07:12
4a65586

Choose a tag to compare

Pre-release

Highlights

Resources

Ingredients

Ingredients Version
Host OS Ubuntu 24.04 Desktop/Server
vllm 0.14.0
PyTorch 2.10
OneAPI 2025.3.2
OneCCL 15.7.8
UMD Driver 25.40.36300.8
KMD Driver 6.17.0-1007.7
GuC Firmware 70.55.3
XPU Manager 1.3.5
Offline Installer 26.13.7.1

What’s new

  • vLLM:

    • Upgrade: vLLM upgrade to 0.14.0, Pytorch upgrade to 2.10. oneAPI uplifted to 2025.3.2(hotfix) with LTS support on UR adaptor v2. Oneccl upgrade to 2021.15.7.8.
    • Int4 onednn optimizations are included and up to 25% throughput improvement is achieved VS last release.
    • Bug fixes.
    • Added support for Qwen3-VL-Reranker-2B/8B
    • Added support for Qwen3-VL-Embedding-2B/8B
    • Added support for GLM-4.7-Flash
    • Added support for Ministral models
    • Added support for DeepSeek-OCR-2
    • Added support for Qwen3-Coder-Next
    • Fix InternVL issue
  • Offline Installer:

    • Add 24.04 support
    • Add script to configure PCIe switch downstream port ACS
    • Add script for P2P issue diagnostic

llm-scaler-vllm PV release 1.3

30 Jan 06:59
6a86b24

Choose a tag to compare

Highlights

Resources

What’s new

  • vLLM:

    • Upgrade: vLLM upgrade to 0.11.1, Pytorch upgrade to 2.9. oneAPI upgrade to 2025.2.2(hotfix), oneccl upgrade to 2021.15.7.6.
    • 8 New models supported: Qwen3-Next-80B-A3B-Instruct, Qwen3-Next-80B-A3B-Thinking, InternVL3.5-30B-A3B, DeepSeek-OCR,PaddleOCR-VL, Seed-OSS-36B-Instruct, Qwen3-30B-A3B-Instruct-2507 and openai/whisper-large-v3.
    • Key bug fixes for timeout/accuracy issues found in long time stress run.
    • Key bug fixes communication accuracy issue on long run scenarios. Sub-communicator hang issue on oneCCL side.
    • vLLM 0.11.1 with new features: cpu kv cache offload, speculative decoding support with 2 more methods(medusa, suffix), experimental feature:FP8 kv cache, Experts parallelism is supported with scenarios TP+EP and DP+EP.
    • Bug fixes.
    • Supported sym_int4 for Qwen3-30B-A3B on TP 4/8.
    • Supported sym_int4 for Qwen3-235B-A22B on TP 16.
    • Added support for the PaddleOCR model.
    • Added support for GLM-4.6v-Flash.
    • Fixed crash errors with 2DP + 4TP configuration.
    • Fixed abnormal output observed during JMeter stress testing.
    • Fixed UR_ERROR_DEVICE_LOST errors triggered by frequent preemption under high load.
    • Fixed output errors for InternVL-38B.
    • Refine logic for profile_run to provide more GPU blocks by default

llm-scaler-vllm beta release 0.11.1-b7

19 Jan 07:01
e09993b

Choose a tag to compare

Pre-release

Highlights

Resources

What’s new

  • vLLM:

    • Upgrade: vLLM upgrade to 0.11.1, Pytorch upgrade to 2.9. oneAPI upgrade to 2025.2.2(hotfix), oneccl upgrade to 2021.15.7.6.
    • 8 New models supported: Qwen3-Next-80B-A3B-Instruct, Qwen3-Next-80B-A3B-Thinking, InternVL3.5-30B-A3B, DeepSeek-OCR,PaddleOCR-VL, Seed-OSS-36B-Instruct, Qwen3-30B-A3B-Instruct-2507 and openai/whisper-large-v3.
    • Key bug fixes for timeout/accuracy issues found in long time stress run.
    • Key bug fixes communication accuracy issue on long run scenarios. Sub-communicator hang issue on oneCCL side.
    • vLLM 0.11.1 with new features: cpu kv cache offload, speculative decoding support with 2 more methods(medusa, suffix), experimental feature:FP8 kv cache, Experts parallelism is supported with scenarios TP+EP and DP+EP.
    • Bug fixes.
    • Supported sym_int4 for Qwen3-30B-A3B on TP 4/8.
    • Supported sym_int4 for Qwen3-235B-A22B on TP 16.
    • Added support for the PaddleOCR model.
    • Added support for GLM-4.6v-Flash.
    • Fixed crash errors with 2DP + 4TP configuration.
    • Fixed abnormal output observed during JMeter stress testing.
    • Fixed UR_ERROR_DEVICE_LOST errors triggered by frequent preemption under high load.
    • Fixed output errors for InternVL-38B.
    • Refine logic for profile_run to provide more GPU blocks by default

llm-scaler-omni beta release 0.1.0-b5

19 Jan 01:43
e09993b

Choose a tag to compare

Pre-release

Highlights

Resources

What’s new

  • Core Upgrades
    • Upgraded to Python 3.12 and PyTorch 2.9 for improved performance and compatibility
  • ComfyUI
    • Fixed a stochastic rounding issue on XPU, resolving the LoRA black-screen output problem. LoRA workflows are now supported (e.g, Z-Image-Turbo, Qwen-Image, Qwen-Image-Edit).
    • Added support for new models and workflows: Qwen-Image-Layered, Qwen-Image-Edit-2511, Qwen-Image-2512, HY-Motion, and more.
    • Added support for ComfyUI-GGUF, enabling GGUF models (e.g., FLUX.2-dev Q4_0) with reduced VRAM usage.
    • Fixed image format issue in the Hunyuan3D-2.1 workflow.
    • Refined documentation for improved clarity.
    • LTX2 support on XPU.
    • Updated Windows environment setup script.
  • SGLang Diffusion
    • Added support for CacheDiT.
    • Added Tensor Parallelism (TP) support for selected model with better performance (e.g, Z-Image-Turbo).
    • Added SGLD ComfyUI custom node support, allowing SGLang Diffusion to serve as a backend for ComfyUI image generation workflows.
  • Standalone Examples
    • Added support for HY-WorldPlay.
    • Add audio models in standalone example

llm-scaler-vllm PV release 1.2

11 Dec 08:24
426cf68

Choose a tag to compare

Highlights

Resources

Ingredients

Ingredients Version
Host OS Ubuntu 25.04 Desktop/Server
vllm 0.10.2
PyTorch 2.8.0
OneAPI 2025.1.3-7
OneCCL 15.6.2
UMD Driver 25.40.35563.7
KMD Driver 6.14.0-1008-intel
GuC Firmware 70.45.2
XPU Manager 1.3.3
Offline Installer 25.45.5.4

What’s new

  • vLLM:
    • MoE-Int4 support for Qwen3-30B-A3B
    • Bpe-Qwen tokenizer support
    • Enable Qwen3-VL Dense/MoE models
    • Enable Qwen3-Omni models
    • MinerU 2.5 Support
    • Enable whisper transcription models
    • Fix minicpmv4.5 OOM issue and output error
    • Enable ERNIE-4.5-vl models
    • Enable Glyph based GLM-4.1V-9B-Base
    • Attention kernel optimizations for decoding phases for all workloads (>10% e2e throughput on 10+ models with all in/out seq length)
    • Gpt-oss 20B and 120B support in mxfp4 with optimized performance
    • MoE models optimizations, output throughput:Qwen3-30B-A3B 2.6x e2e improvement; DeeSeek-V2-lite 1.5x improvement.
    • New models: added 8 multi-modality models, image/video are supported.
    • vLLM 0.10.2 with new features: P/D disaggregation(experimental), tooling, reasoning output, structured output,
    • fp16/bf16 gemm optimizations for batch size 1-128. obvious improvement for small batch sizes.
    • Bug fixes

Known issue

  • Crash during initialization with 2DP x 4TP configuration.
    • Status: Scheduled to be fixed in release b7.
  • Abnormal output (excessive "!!!") observed during JMeter stress testing.
    • Status: Scheduled to be fixed in release b7.
  • UR_ERROR_DEVICE_LOST occurs due to excessive preemption under high load.
    • Description: Requests exceeding server capacity trigger frequent preemption, eventually leading to device loss.
    • Workaround: Temporarily mitigate by increasing the number of GPU blocks(set higher gpu_memory_utilization) or adjusting the --max-num-seqs parameter.
  • An abnormal value for gpu_blocks_num will cause a performance degradation when running large batches with gpt-oss-120b.
    • Description: The vllm profile_run() logic will cause the kv_cache's gpu_blocks_num decrease, and lead to performance drop.
    • Workaround(hotfix): Temporarily update determine_available_memory()'s logic to allow for a larger and more efficient KV Cache.
file: /usr/local/lib/python3.12/dist-packages/vllm-0.10.3.dev0+g01efc7ef7.d20251125.xpu-py3.12-linux-x86_64.egg/vllm/v1/worker/xpu_worker.py
line: 83
# Change the entire function to:
```python
    @torch.inference_mode()
    def determine_available_memory(self) -> int:
        """Profiles the peak memory usage of the model to determine how many
        KV blocks may be allocated without OOMs.

        The engine will first conduct a profiling of the existing memory usage.
        Then, it calculates the maximum possible number of GPU and CPU blocks
        that can be allocated with the remaining free memory.

        .. tip::
            You may limit the usage of GPU memory
            by adjusting the `gpu_memory_utilization` parameter.
        """
        torch.xpu.empty_cache()
        torch.xpu.synchronize()

        torch.xpu.reset_peak_memory_stats()

        self.model_runner.profile_run()
        torch.xpu.synchronize()

        stats = torch.xpu.memory_stats()
        
        peak_allocated = stats.get("allocated_bytes.all.peak", 0)
        
        current_reserved = torch.xpu.memory_reserved()

        fragmentation_bytes = current_reserved - peak_allocated
        fragmentation_gb = fragmentation_bytes / 1024**3
        peak_gb = peak_allocated / 1024**3
        reserved_gb = current_reserved / 1024**3
        model_memory = self.model_runner.model_memory_usage / 1024 ** 3

      
        peak_memory = peak_allocated 

        total_gpu_memory = torch.xpu.get_device_properties(self.local_rank).total_memory
        
        available_kv_cache_memory = (
            total_gpu_memory * self.cache_config.gpu_memory_utilization -
            peak_memory)

        return int(available_kv_cache_memory)

llm-scaler-omni beta release 0.1.0-b4

10 Dec 01:15
f0019a1

Choose a tag to compare

Pre-release

Highlights

Resources

What’s new

  • omni:

    • Added SGLang Diffusion support. 10% perf improve for ComfyUI in single card scenario
    • Added ComfyUI workflows for Hunyuan-Video-1.5 (T2V, I2V, Multi-B60), Z-Image

llm-scaler-vllm beta release 0.10.2-b6

26 Nov 07:22

Choose a tag to compare

Pre-release

Highlights

Resources

What’s new

  • vLLM:

    • MoE-Int4 support for Qwen3-30B-A3B
    • Bpe-Qwen tokenizer support
    • Enable Qwen3-VL Dense/MoE models
    • Enable Qwen3-Omni models
    • MinerU 2.5 Support
    • Enable whisper transcription models
    • Fix minicpmv4.5 OOM issue and output error
    • Enable ERNIE-4.5-vl models
    • Enable Glyph based GLM-4.1V-9B-Base
    • Attention kernel optimizations for decoding phases for all workloads (>10% e2e throughput on 10+ models with all in/out seq length)
    • Gpt-oss 20B and 120B support in mxfp4 with optimized performance
    • MoE models optimizations, output throughput:Qwen3-30B-A3B 2.6x e2e improvement; DeeSeek-V2-lite 1.5x improvement.
    • New models: added 8 multi-modality models, image/video are supported.
    • vLLM 0.10.2 with new features: P/D disaggregation(experimental), tooling, reasoning output, structured output,
    • fp16/bf16 gemm optimizations for batch size 1-128. obvious improvement for small batch sizes.
    • Bug fixes

llm-scaler-omni beta release 0.1.0-b3

19 Nov 02:47
5696df8

Choose a tag to compare

Pre-release

Highlights

Resources

What’s new

  • omni:

    • More workflows support:
      • Hunyuan 3D 2.1
      • Controlnet on SD3.5, FLUX.1, etc.
      • Multi XPU support for Wan 2.2 I2V 14B rapid aio
      • AnimateDiff lightning
    • Add Windows installation