Skip to content

Fix OOM in CI by reducing image size of tiny Gemma3 model#5680

Open
albertvillanova wants to merge 1 commit intomainfrom
pfix-5207-tiny-gemma3
Open

Fix OOM in CI by reducing image size of tiny Gemma3 model#5680
albertvillanova wants to merge 1 commit intomainfrom
pfix-5207-tiny-gemma3

Conversation

@albertvillanova
Copy link
Copy Markdown
Member

@albertvillanova albertvillanova commented Apr 29, 2026

Fix OOM in CI by reducing image size of tiny Gemma3 model.

This PR introduces a targeted adjustment for the google/gemma-3-4b-it model in the generate_tiny_models script to address memory usage issues related to image processing.

Partial fix for:

Motivation

The tiny-Gemma3ForConditionalGeneration model was generated with the default SigLIP image size of 896×896, which produces 4,096 patches per image. During training, the vision encoder attention maps have shape [batch, heads, 4096, 4096], consuming ~1 GB per layer. With 2 vision layers and backpropagation, a single Gemma3 test consumes 5–7 GiB of GPU memory. Two such tests running concurrently on a 14.74 GiB GPU caused CUDA out-of-memory errors in all other parallel workers.

Solution

Override image_size=224 (256 patches) when generating the tiny Gemma3 model. This is consistent with mm_tokens_per_image=256 in the Gemma3 config: the projector's AvgPool2d gets kernel_size=1 (identity), which is architecturally valid. The processor's image processor size is updated to match so that test inputs are also resized to 224×224.

Changes

Model-specific configuration:

  • For the google/gemma-3-4b-it model, sets vision_config["image_size"] and processor.image_processor.size to 224x224 (instead of the default 896x896) to reduce memory consumption during training by limiting the number of image patches and ensuring the projector's average pooling layer acts as an identity function.

Note

Low Risk
Model-generation script change scoped to a single model ID; it only adjusts test image resolution/config and shouldn’t affect runtime code paths beyond tiny model artifacts.

Overview
Reduces memory usage for the generated tiny Gemma3 vision-language test model by overriding SigLIP image resolution when model_id == "google/gemma-3-4b-it".

scripts/generate_tiny_models.py now sets vision_config["image_size"] = 224 and aligns processor.image_processor.size to 224×224, cutting patch count and preventing CI GPU OOMs during Gemma3 training/tests.

Reviewed by Cursor Bugbot for commit 15c5aff. Bugbot is set up for automated code reviews on this repo. Configure here.

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@qgallouedec
Copy link
Copy Markdown
Member

thanks! can you just confirm with a forward + gpu peak memory measurement for old vs new?

@albertvillanova
Copy link
Copy Markdown
Member Author

Thanks for your sensible suggestion: the difference is small. I continue investigating...

@albertvillanova
Copy link
Copy Markdown
Member Author

I think the intermediate_size should also be reduced: from 4304 to 32.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants