inference: add CLI overrides for fps/total_pixels and vLLM memory knobs to prevent CUDA OOM in video runs by paularamo · Pull Request #83 · nvidia-cosmos/cosmos-reason1

paularamo · 2025-11-14T18:52:19Z

Context

Running Cosmos-Reason1 VL on 24 GiB GPUs with video inputs can OOM during vLLM's profile run
inside Qwen2.5-VL's visual tower (see trace in qwen2_5_vl.py → _process_video_input → self.visual).
This happens before generation when allocations scale with #frames × frame area × hidden size.

What this PR changes (code)

Add CLI overrides to avoid editing YAMLs:
- --fps
- --vision-total-pixels
Add vLLM memory-safety knobs:
- --gpu-memory-utilization (default 0.70)
- --max-model-len (default 2048)
Apply overrides on top of loaded vision_config, with schema re-validation
Pass the memory knobs through to vllm.LLM

These changes let you adjust video sampling and leave headroom for the vision tower without touching repo configs.

Not included (operational guidance only; no code/config committed)

Vision config values used in our runs:
- fps: 1
- total_pixels: 2_000_000
Environment variable to reduce fragmentation:
- PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
Notebook/runtime fixes:
- Ensure the env var is present in the subprocess (env=os.environ)
- Avoid duplicated/glued CLI flags when building command strings

How to test

export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
python scripts/inference.py \
  --prompt prompts/question.yaml \
  --question "What are the potential safety hazards?" \
  --reasoning \
  --videos assets/sample.mp4 \
  --vision-config, configs/vision_config.yaml \
  --gpu-memory-utilization 0.70 \
  --max-model-len 1536 \
  -v

Expected:

Model loads without OOM.
vLLM’s profile run succeeds.
Generation returns an answer and optional reasoning.

Rationale

The OOM occurs when the vision tower processes the sampled video frames. Providing light-weight knobs
(fps/total_pixels) plus vLLM headroom (gpu_memory_utilization/max_model_len) makes the pipeline usable
on common –24 GiB GPUs. Sometimes it is necessary to reduce fps and total_pixels in the configuration files.

spectralflight · 2025-11-18T04:04:32Z

Thanks! This is excellent information.

This script was meant to be a starting example that you can copy and modify. We are looking into adding full-fledged scripts (e.g. offline batch inference, online server). For those, we will expose all config settings in the CLI via tyro.

scripts/inference.py

paularamo · 2025-11-18T10:56:34Z

Thanks @spectralflight for the review and the clear guidance!

I’ve updated the PR accordingly:

Removed the temporary debug printouts, and
Set the memory-related CLI parameters to default to None so we preserve the existing behavior, as suggested.

These overrides should still make it easier for users running on smaller GPUs to tune video workloads without modifying repo configs, while keeping Cosmos-Reason1’s defaults intact. Happy to adjust anything else if needed!

Best Regards

spectralflight · 2025-11-22T00:12:19Z

Awesome, this will be very useful! Looks like the PR is failing linting. Could you please run (requires just):

just lint
just test

paularamo · 2025-11-25T16:59:26Z

@spectralflight Thanks for your suggestion. The pre-commit checks were failing because reasoning.txt was missing a required end-of-file newline, and scripts/inference.py had a shebang but wasn’t marked executable. I fixed both issues by applying the newline correction and updating the file permissions to make the script executable. After these changes, the pre-commit hooks pass as expected. Thanks for your support on this PR.

paularamo · 2025-12-09T21:45:19Z

@spectralflight @jingyijin2 Do you have any other models? I am working now on the Cosmos-CookBoook Recipe. It would be great to merge this into the main repo to provide clear instructions for developers. Thanks and Best Regards.

inference: add CLI overrides for fps/total_pixels and vLLM memory knobs

19481c9

pjannaty requested a review from jingyijin2 November 18, 2025 00:07

spectralflight self-requested a review November 18, 2025 03:47

spectralflight requested changes Nov 18, 2025

View reviewed changes

scripts/inference.py Show resolved Hide resolved

scripts/inference.py Outdated Show resolved Hide resolved

avoid CUDA OOM

0919ddf

spectralflight approved these changes Nov 22, 2025

View reviewed changes

paularamo added 2 commits November 25, 2025 11:53

chore: fix EOF newline and apply CLI override patch

7f6baa2

chore: mark inference script as executable

925275b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

inference: add CLI overrides for fps/total_pixels and vLLM memory knobs to prevent CUDA OOM in video runs#83

inference: add CLI overrides for fps/total_pixels and vLLM memory knobs to prevent CUDA OOM in video runs#83
paularamo wants to merge 4 commits intonvidia-cosmos:mainfrom
paularamo:feat/cli-memory-safe-video

paularamo commented Nov 14, 2025 •

edited

Loading

Uh oh!

spectralflight commented Nov 18, 2025

Uh oh!

Uh oh!

Uh oh!

paularamo commented Nov 18, 2025

Uh oh!

spectralflight commented Nov 22, 2025

Uh oh!

paularamo commented Nov 25, 2025

Uh oh!

paularamo commented Dec 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

paularamo commented Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Context

What this PR changes (code)

Not included (operational guidance only; no code/config committed)

How to test

Expected:

Rationale

Uh oh!

spectralflight commented Nov 18, 2025

Uh oh!

Uh oh!

Uh oh!

paularamo commented Nov 18, 2025

Uh oh!

spectralflight commented Nov 22, 2025

Uh oh!

paularamo commented Nov 25, 2025

Uh oh!

paularamo commented Dec 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

paularamo commented Nov 14, 2025 •

edited

Loading