vLLM-Omni Cookbook

A practical guide to vLLM-Omni with recipes, examples, and best practices for omni-modality inference and serving.

Overview

vLLM-Omni extends vLLM to support omni-modality model inference and serving. While vLLM was designed for text-based autoregressive generation, vLLM-Omni provides:

Omni-modality Support: Text, image, video, and audio data processing
Non-Autoregressive Architectures: Diffusion Transformers (DiT) and other parallel generation models
Heterogeneous Outputs: From traditional text generation to multimodal outputs

This cookbook provides hands-on recipes to help you leverage these extended capabilities.

Key Differences from vLLM

Feature	vLLM	vLLM-Omni
Modalities	Text	Text, Image, Video, Audio
Architectures	Autoregressive	AR + DiT + Parallel
Outputs	Text	Multimodal outputs
Use Cases	LLM serving	Omni-modality AI

Quick Start

Prerequisites

OS: Linux
Python: 3.12

Installation

# Create virtual environment with uv
uv venv --python 3.12 --seed
source .venv/bin/activate

# On CUDA
uv pip install vllm==0.15.0 --torch-backend=auto

# On ROCm
uv pip install vllm==0.15.0 --extra-index-url https://wheels.vllm.ai/rocm/0.15.0/rocm700

# Install vLLM-Omni from source
git clone https://github.com/vllm-project/vllm-omni.git
cd vllm-omni
uv pip install -e .

Offline Inference Example

from vllm_omni.entrypoints.omni import Omni

omni = Omni(model="Tongyi-MAI/Z-Image-Turbo")
prompt = "a cup of coffee on the table"
outputs = omni.generate(prompt)
images = outputs[0].request_output[0].images
images[0].save("coffee.png")

Online Serving

# Start the server
vllm serve Tongyi-MAI/Z-Image-Turbo --omni --port 8091

# Make a request
curl -s http://localhost:8091/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "user", "content": "a cup of coffee on the table"}
    ],
    "extra_body": {
      "height": 1024,
      "width": 1024,
      "num_inference_steps": 50,
      "guidance_scale": 4.0
    }
  }'

Category	Description	Status
00 - Quickstart	Get started with omni-modality inference	✅ Available
01 - Inference	Text, vision, audio generation & streaming	🚧 Planned
02 - Deployment	Production serving for omni-modality models	🚧 Planned
03 - Multimodal	Cross-modal applications and workflows	🚧 Planned
04 - DiT Models	Diffusion Transformers and parallel generation	🚧 Planned
05 - Best Practices	Security, monitoring, error handling	🚧 Planned
06 - Performance	Benchmarking and optimization strategies	🚧 Planned
07 - Troubleshooting	Common issues and solutions	🚧 Planned

See topics/index.md for a detailed table of contents.

Contributing

We welcome contributions! Please see CONTRIBUTING.md for guidelines on how to add new recipes, report issues, or improve existing content.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.claude/skills		.claude/skills
.github		.github
00-quickstart		00-quickstart
01-inference		01-inference
02-deployment		02-deployment
03-multimodal		03-multimodal
04-hardware		04-hardware
05-best-practices		05-best-practices
06-performance		06-performance
07-troubleshooting		07-troubleshooting
templates		templates
topics		topics
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

vLLM-Omni Cookbook

Overview

Key Differences from vLLM

Quick Start

Prerequisites

Installation

Offline Inference Example

Online Serving

Table of Contents

Contributing

Resources

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

vLLM-Omni Cookbook

Overview

Key Differences from vLLM

Quick Start

Prerequisites

Installation

Offline Inference Example

Online Serving

Table of Contents

Contributing

Resources

About

Resources

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages