inference: add CLI overrides for fps/total_pixels and vLLM memory knobs to prevent CUDA OOM in video runs#83
Conversation
|
Thanks! This is excellent information. This script was meant to be a starting example that you can copy and modify. We are looking into adding full-fledged scripts (e.g. offline batch inference, online server). For those, we will expose all config settings in the CLI via |
|
Thanks @spectralflight for the review and the clear guidance! I’ve updated the PR accordingly:
These overrides should still make it easier for users running on smaller GPUs to tune video workloads without modifying repo configs, while keeping Cosmos-Reason1’s defaults intact. Happy to adjust anything else if needed! Best Regards |
|
Awesome, this will be very useful! Looks like the PR is failing linting. Could you please run (requires just): |
|
@spectralflight Thanks for your suggestion. The pre-commit checks were failing because |
|
@spectralflight @jingyijin2 Do you have any other models? I am working now on the Cosmos-CookBoook Recipe. It would be great to merge this into the main repo to provide clear instructions for developers. Thanks and Best Regards. |
Context
Running Cosmos-Reason1 VL on 24 GiB GPUs with video inputs can OOM during vLLM's profile run
inside Qwen2.5-VL's visual tower (see trace in qwen2_5_vl.py → _process_video_input → self.visual).
This happens before generation when allocations scale with #frames × frame area × hidden size.
What this PR changes (code)
These changes let you adjust video sampling and leave headroom for the vision tower without touching repo configs.
Not included (operational guidance only; no code/config committed)
fps: 1total_pixels: 2_000_000PYTORCH_CUDA_ALLOC_CONF=expandable_segments:Trueenv=os.environ)How to test
Expected:
Model loads without OOM.
vLLM’s profile run succeeds.
Generation returns an answer and optional reasoning.
Rationale
The OOM occurs when the vision tower processes the sampled video frames. Providing light-weight knobs
(fps/total_pixels) plus vLLM headroom (gpu_memory_utilization/max_model_len) makes the pipeline usable
on common –24 GiB GPUs. Sometimes it is necessary to reduce
fpsandtotal_pixelsin the configuration files.