Add Teacache Support for LongCat Image by alex-jw-brooks · Pull Request #1487 · vllm-project/vllm-omni

alex-jw-brooks · 2026-02-25T22:11:37Z

Enables TeaCache support for LongCat Image. The model coefficients and speedups were calculated with the current config, not main (see fix).
Includes some fixes to the coefficient estimator to avoid computing gradients and avoid dtype casting issues from running bf16 models
Updates docs to add some notes on estimating coefficients for models that have layers that require vLLM's fwd context and parallel groups to be set up, since it was needed for this one

Example Outputs

For both text to image and image edit, teacache is the left one.

$ python text_to_image.py --cache-backend tea_cache --model meituan-longcat/LongCat-Image --output coffee_tc.png
$ python text_to_image.py --model meituan-longcat/LongCat-Image --output coffee.png

For Image edit (using the coffee image above):

$ python image_edit.py --model meituan-longcat/LongCat-Image-Edit --image coffee.png --prompt "make the coffee cup transparent"  --cache-backend tea_cache --output edit_coffee_tc.png
$ python image_edit.py --model meituan-longcat/LongCat-Image-Edit --image coffee.png --prompt "make the coffee cup transparent" --output edit_coffee.png

Speed Benchmarks

With a thresh of .2, the speedup: is ~1.7x on an h100; didn't benchmark edit, but speedup looked comparable when I ran a quick check after.

Here is the full script I had used for testing, which can be used for reproduction for tti - it'll report the average speedup of 3 images

import os
import gc
import time
import torch
from vllm_omni import Omni
from vllm_omni.inputs.data import OmniDiffusionSamplingParams

# Configuration
MODEL_ID = "meituan-longcat/LongCat-Image"
PROMPT = "A cup of coffee sitting on a table."
STEPS = 50
SEEDS = [444, 111, 3919]

TEACACHE_DIR = "cache_results"
NO_CACHE_DIR = "no_cache_results"
os.makedirs(TEACACHE_DIR, exist_ok=True)
os.makedirs(NO_CACHE_DIR, exist_ok=True)


def run_benchmark(use_cache=False):
    print(f"\n{'Testing with TeaCache' if use_cache else 'Testing without TeaCache'}...")
    times = []
    # Configure cache based on requirement
    out_dir = TEACACHE_DIR if use_cache else NO_CACHE_DIR
    cache_config = {
        "rel_l1_thresh": 0.2,
    } if use_cache else {}
    cache_backend = "tea_cache" if use_cache else None

    omni = Omni(
        model=MODEL_ID,
        cache_backend=cache_backend,
        cache_config=cache_config,
        dtype="bfloat16",
    )

    for seed in SEEDS:
        sampling_params = OmniDiffusionSamplingParams(num_inference_steps=STEPS, seed=seed)
        start = time.time()
        outputs = omni.generate(PROMPT, sampling_params)
        end = time.time()
        run_time = end - start
        times.append(run_time)
        # Save the generated image
        image = outputs[0].request_output[0].images[0]
        filename = f"{out_dir}/seed_{seed}.png"
        print(f"Run time: {run_time} for seed: {seed} [use_cache={use_cache}]")
        image.save(filename)

    avg_time = sum(times) / len(times)
    print(f"Average latency [use_cache={use_cache}]: {avg_time}")
    return avg_time


if __name__ == "__main__":
    # Run tests
    time_no_cache = run_benchmark(use_cache=False)
    torch.cuda.empty_cache()
    gc.collect()
    time_with_cache = run_benchmark(use_cache=True)

    print(f"\nResults:")
    print(f"Speedup: {time_no_cache / time_with_cache:.2f}x")

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 003d3b9f61

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

vllm_omni/diffusion/cache/teacache/extractors.py

chatgpt-codex-connector · 2026-02-27T17:51:52Z

vllm_omni/diffusion/cache/teacache/extractors.py

+    sp_size = module.parallel_config.sequence_parallel_size
+    get_forward_context().sequence_parallel_size = sp_size
+
+    hidden_states = module.x_embedder(hidden_states)


Preserve sequence-parallel sharding in LongCat extractor

In the SP case (sequence_parallel_size > 1), this code enables SP in the forward context but does not replicate the required LongCat preprocessing (chunking image hidden_states and RoPE by rank, as done in LongCatImageTransformer2DModel.forward). As a result, SP attention paths run on unsharded layouts, which yields invalid coefficient-collection behavior and can break distributed estimation runs.

Useful? React with 👍 / 👎.

Useful, but I think there are larger underlying problems in SP for this model at the moment (see #1556). I will investigate the fix for that as well, but see the same error with & without TeaCache at the moment, so open to any direction for how to handle it on this PR

alex-jw-brooks · 2026-02-27T19:10:09Z

vllm_omni/diffusion/cache/teacache/coefficient_estimator.py

+        # Explicitly use inference mode to avoid gradients since we
+        # are not creating the pipeline through the model runner
+        with torch.inference_mode():
+            self.pipeline.forward(req)


A few small fixes were needed in this script to avoid OOMs in my env from gradients, and to handle bf16 since it can't be .numpy the tensors

lishunyang12

Left a couple comments. The extractor mostly mirrors the model forward correctly, but the first block runs twice on non-cached steps which seems unintentional.

lishunyang12 · 2026-02-28T03:51:01Z

vllm_omni/diffusion/cache/teacache/extractors.py

+    _, hs = first_block(
+        hidden_states=hidden_states,
+        encoder_hidden_states=encoder_hidden_states,
+        temb=temb,


This runs first_block(...) to get the modulated input, but then run_transformer_blocks() below iterates over all module.transformer_blocks (including [0]) again. So block 0 gets executed twice on every non-cached step.

The other extractors (e.g., qwen) avoid this by extracting the modulated input with just the lightweight norm call (block.img_mod(temb) + block.img_norm1(...)) without running the full block forward. Could you do something similar here, or at least start run_transformer_blocks from module.transformer_blocks[1:]?

Good catch 😬 Thanks! Fixed the modulated input and reran the coefficient calculations

lishunyang12 · 2026-02-28T03:51:01Z

vllm_omni/diffusion/cache/teacache/coefficient_estimator.py

+        pipeline.to(device)
+        return pipeline
+
+


Should this also be wrapped with set_default_torch_dtype(od_config.dtype) like BagelAdapter.load_pipeline was updated to do above?

I had actually added a set_default_torch_dtype around the call to the load pipeline on the adapter instead of just putting it around the one line 🙂 the better way to do this is

loader = DiffusersPipelineLoader(LoadConfig(), od_config=od_config) return loader.load_model(od_config=od_config, load_device=device)

because load_model will handle the device placement, put the model in eval mode, and handle the dtypes from the diffusion config. Updated both to avoid managing default dtypes manually and made sure the bagel one still runs

vllm_omni/diffusion/cache/teacache/extractors.py

hsliuustc0106 · 2026-03-12T12:49:25Z

Hi @alex-jw-brooks 👋

Checking in on the Teacache support for LongCat Image PR — 12 days since last update. Any progress?

Thanks!

lishunyang12 · 2026-03-12T13:56:22Z

Hey @alex-jw-brooks — following up on the open threads from 2 weeks ago. The main concern is still the block 0 double execution + modulated input extraction.

Looking more carefully: first_block.norm1(hs, emb=temb)[0] extracts the modulated input from hs (the post-block-0 output), but it should be from the pre-block hidden_states. The Qwen extractor does this correctly — it calls block.img_norm1(hidden_states, img_mod1) on the original hidden states without running the full block. This means the cache decisions here are based on the wrong signal, and the coefficients were estimated with that bug.

Could you take a look?

alex-jw-brooks · 2026-03-12T18:00:14Z

Hey @hsliuustc0106 @lishunyang12, haven't forgotten about this PR, just paused it for a bit while fixing the sequence parallelism for this model to avoid copying things over here. I'll get back to it this afternoon and work on the comments, thanks for your patience 🙂

alex-jw-brooks · 2026-03-15T00:20:55Z

docs/user_guide/diffusion_acceleration.md

-| **LongCat-Image** | `meituan-longcat/LongCat-Image` | ❌ | ✅ | ❌ | ❌ | ✅ | ✅ | ❌ |
-| **LongCat-Image-Edit** | `meituan-longcat/LongCat-Image-Edit` | ❌ | ✅ | ❌ | ❌ | ✅ | ✅ | ❌ |
+| **LongCat-Image** | `meituan-longcat/LongCat-Image` | ✅  | ✅ | ✅  | ✅ | ✅ | ✅ | ❌ |
+| **LongCat-Image-Edit** | `meituan-longcat/LongCat-Image-Edit` | ✅  | ✅ | ✅  | ✅  | ✅ | ✅ | ❌ |


Columns for sequence parallel are out of date for LongCat; it does support ring attention and Ulysses SP.

Tested that teacache works with both types of SP on as part of this PR.

alex-jw-brooks · 2026-03-15T00:39:30Z

Hey @lishunyang12 @hsliuustc0106 thanks for the reviews - took another pass at this and added some additional info since I hadn't tested image edit yet originally. Ready for another look when you've got the bandwidth 🙂

lishunyang12 · 2026-03-21T23:24:04Z

Thanks for the update. Will re-review this week.

lishunyang12

The extractor looks correct now — norm1 is called on the pre-block hidden_states, and all blocks run in run_transformer_blocks(). Previous concern is addressed.

lishunyang12 · 2026-03-21T23:28:39Z

vllm_omni/diffusion/cache/teacache/extractors.py

+        CacheContext with all information needed for generic caching
+    """
+    # TODO (Alex) - Refactor TeaCache extractors to more tightly integrate with .forward
+    from diffusers.models.modeling_outputs import Transformer2DModelOutput


This mutates fwd_context.split_text_embed_in_sp but never restores it. If the forward context is reused across timesteps, this side effect persists. Worth a comment or a reset in postprocess().

Sure, I added a comment. For this model, we don't need to restore it because it should never be True, but IMO we should just remove this attribute from the repo for now, because I don't think it's currently ever expected to be True or even implemented for the True behavior. Will open a pr to discuss 🙂

Signed-off-by: Alex Brooks <albrooks@redhat.com>

alex-jw-brooks · 2026-03-26T23:38:20Z

@wtomin @hsliuustc0106 could you please review this PR?

I've refactored a bit to share some of the code for the estimators since LongCat / the new Stable Audio coefficient were basically the same, but it is ready for a look when you have a moment

alex-jw-brooks changed the title ~~[WIP]~~ [WIP] Add Teacache Support for LongCat Image Feb 25, 2026

alex-jw-brooks marked this pull request as ready for review February 27, 2026 17:48

alex-jw-brooks requested a review from hsliuustc0106 as a code owner February 27, 2026 17:48

chatgpt-codex-connector bot reviewed Feb 27, 2026

View reviewed changes

alex-jw-brooks changed the title ~~[WIP] Add Teacache Support for LongCat Image~~ Add Teacache Support for LongCat Image Feb 27, 2026

alex-jw-brooks force-pushed the longcat_teacache branch from d35e91d to b0dd147 Compare February 27, 2026 19:09

alex-jw-brooks commented Feb 27, 2026

View reviewed changes

alex-jw-brooks mentioned this pull request Feb 27, 2026

[RFC]: Continuous Diffusion Model Acceleration Support #1217

Open

1 task

lishunyang12 reviewed Feb 28, 2026

View reviewed changes

This was referenced Mar 4, 2026

[BugFix] Fix LongCat Sequence Parallelism / Small Cleanup #1631

Merged

[Refactor] Use SP Plan for LongCat Sequence Parallelism #1772

Merged

Gaohan123 added this to the v0.18.0 milestone Mar 14, 2026

alex-jw-brooks force-pushed the longcat_teacache branch 2 times, most recently from 08396d7 to 53f39f2 Compare March 15, 2026 00:18

alex-jw-brooks commented Mar 15, 2026

View reviewed changes

alex-jw-brooks requested a review from lishunyang12 March 15, 2026 00:39

alex-jw-brooks force-pushed the longcat_teacache branch from 53f39f2 to 5284249 Compare March 19, 2026 17:24

lishunyang12 approved these changes Mar 21, 2026

View reviewed changes

alex-jw-brooks force-pushed the longcat_teacache branch from e485ebb to a25f48f Compare March 23, 2026 04:28

alex-jw-brooks added 6 commits March 26, 2026 22:19

first pass at longcat teacache

d69206d

Signed-off-by: Alex Brooks <albrooks@redhat.com>

update longcat image teacache coefficients

31b3c0f

Signed-off-by: Alex Brooks <albrooks@redhat.com>

add teacache adapter for longcat

faf7484

Signed-off-by: Alex Brooks <albrooks@redhat.com>

add workaround for vllm context to docs

15fd564

Signed-off-by: Alex Brooks <albrooks@redhat.com>

copy sequence parallel support to teacache

0bea009

Signed-off-by: Alex Brooks <albrooks@redhat.com>

fix block output

ddf66e6

Signed-off-by: Alex Brooks <albrooks@redhat.com>

alex-jw-brooks added 10 commits March 26, 2026 22:23

correct coefficients

dbd2c6e

Signed-off-by: Alex Brooks <albrooks@redhat.com>

add docstring

6d30fa3

Signed-off-by: Alex Brooks <albrooks@redhat.com>

remove sequence parallel hacks in teacache extractor

c26f98c

Signed-off-by: Alex Brooks <albrooks@redhat.com>

fix img modulation in longcat image teacache

01c646b

Signed-off-by: Alex Brooks <albrooks@redhat.com>

clean up coefficient dtype handling

078a978

Signed-off-by: Alex Brooks <albrooks@redhat.com>

teacache doc tweaks

0344238

Signed-off-by: Alex Brooks <albrooks@redhat.com>

update docs

9eedf56

Signed-off-by: Alex Brooks <albrooks@redhat.com>

add comment about split_text_embed_in_sp

5182af7

Signed-off-by: Alex Brooks <albrooks@redhat.com>

fmt

3f30310

Signed-off-by: Alex Brooks <albrooks@redhat.com>

make teacache estimator common

9c0fe26

Signed-off-by: Alex Brooks <albrooks@redhat.com>

alex-jw-brooks force-pushed the longcat_teacache branch from adbe997 to 9c0fe26 Compare March 26, 2026 23:19

Conversation

alex-jw-brooks commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Example Outputs

Speed Benchmarks

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector bot Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lishunyang12 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alex-jw-brooks Mar 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

hsliuustc0106 commented Mar 12, 2026

Uh oh!

lishunyang12 commented Mar 12, 2026

Uh oh!

alex-jw-brooks commented Mar 12, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alex-jw-brooks commented Mar 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lishunyang12 commented Mar 21, 2026

Uh oh!

lishunyang12 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alex-jw-brooks commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

alex-jw-brooks commented Feb 25, 2026 •

edited

Loading

alex-jw-brooks Mar 15, 2026 •

edited

Loading

alex-jw-brooks commented Mar 15, 2026 •

edited

Loading