A practical guide to vLLM-Omni with recipes, examples, and best practices for omni-modality inference and serving.
vLLM-Omni extends vLLM to support omni-modality model inference and serving. While vLLM was designed for text-based autoregressive generation, vLLM-Omni provides:
- Omni-modality Support: Text, image, video, and audio data processing
- Non-Autoregressive Architectures: Diffusion Transformers (DiT) and other parallel generation models
- Heterogeneous Outputs: From traditional text generation to multimodal outputs
This cookbook provides hands-on recipes to help you leverage these extended capabilities.
| Feature | vLLM | vLLM-Omni |
|---|---|---|
| Modalities | Text | Text, Image, Video, Audio |
| Architectures | Autoregressive | AR + DiT + Parallel |
| Outputs | Text | Multimodal outputs |
| Use Cases | LLM serving | Omni-modality AI |
- OS: Linux
- Python: 3.12
# Create virtual environment with uv
uv venv --python 3.12 --seed
source .venv/bin/activate
# On CUDA
uv pip install vllm==0.15.0 --torch-backend=auto
# On ROCm
uv pip install vllm==0.15.0 --extra-index-url https://wheels.vllm.ai/rocm/0.15.0/rocm700
# Install vLLM-Omni from source
git clone https://github.com/vllm-project/vllm-omni.git
cd vllm-omni
uv pip install -e .from vllm_omni.entrypoints.omni import Omni
omni = Omni(model="Tongyi-MAI/Z-Image-Turbo")
prompt = "a cup of coffee on the table"
outputs = omni.generate(prompt)
images = outputs[0].request_output[0].images
images[0].save("coffee.png")# Start the server
vllm serve Tongyi-MAI/Z-Image-Turbo --omni --port 8091# Make a request
curl -s http://localhost:8091/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "user", "content": "a cup of coffee on the table"}
],
"extra_body": {
"height": 1024,
"width": 1024,
"num_inference_steps": 50,
"guidance_scale": 4.0
}
}'| Category | Description | Status |
|---|---|---|
| 00 - Quickstart | Get started with omni-modality inference | ✅ Available |
| 01 - Inference | Text, vision, audio generation & streaming | 🚧 Planned |
| 02 - Deployment | Production serving for omni-modality models | 🚧 Planned |
| 03 - Multimodal | Cross-modal applications and workflows | 🚧 Planned |
| 04 - DiT Models | Diffusion Transformers and parallel generation | 🚧 Planned |
| 05 - Best Practices | Security, monitoring, error handling | 🚧 Planned |
| 06 - Performance | Benchmarking and optimization strategies | 🚧 Planned |
| 07 - Troubleshooting | Common issues and solutions | 🚧 Planned |
See topics/index.md for a detailed table of contents.
We welcome contributions! Please see CONTRIBUTING.md for guidelines on how to add new recipes, report issues, or improve existing content.