[Feature][Diffusion] Plugin-based sparse attention interface for DiT modules#2231
[Feature][Diffusion] Plugin-based sparse attention interface for DiT modules#2231zzhang-fr wants to merge 4 commits intovllm-project:mainfrom
Conversation
Unified sparse attention framework with plugin-based backend discovery. External packages register via Python entry_points — no vendor backends in vllm-omni. Only install and configure. Core: - SparseAttentionBackend ABC, SparsePatternSpec, SparseAttention module - Plugin registry: entry_points + explicit registration + class path - DiTSparseAttentionAdapter with sparsity schedules - DiffusionSparseAttnConfig + CLI (--sparse-attn-backend/topk/schedule) - DenseFallbackBackend for test-only correctness comparison Wan 2.2 integration: - enable_sparse_attention() on WanTransformer3DModel - Pipeline wiring from OmniDiffusionConfig - No-op fallback when no backend installed Contrib plugin packages: - contrib/flashinfer-vllm-omni/ (FlashInfer BSR block-sparse) - contrib/sparge-vllm-omni/ (SpargeAttn dynamic top-k) Benchmark on A100 80GB (Wan 2.2, 480x832, 81f, 40 steps): dense: 785s (19.6s/step) topk=0.7: 763s (1.03x) topk=0.5: 677s (1.16x) topk=0.3: 590s (1.33x) 98 tests (97 passed, 1 skipped). Signed-off-by: Zhen Zhang <zhen.zhang.fr@huawei.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 058884a921
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if self.sparse_attn is not None and attn_mask is None: | ||
| hidden_states = self.sparse_attn(query, key, value) |
There was a problem hiding this comment.
Route sparse self-attention through SP/ring communication
This direct call bypasses the existing Attention wrapper, which is where sequence/ring-parallel pre/post communication is performed (strategy.pre_attention/post_attention in attention/layer.py). When sparse attention is enabled in distributed runs (e.g., SP active or ring_degree > 1), attention is computed on local shards only, so outputs can be wrong even though the dense path is correct. Sparse mode should either reuse the same parallel strategy path or be explicitly disabled for those configs.
Useful? React with 👍 / 👎.
| backend_cls = get_sparse_attn_backend(sparse_attn_config) | ||
| if backend_cls is None: | ||
| from vllm_omni.diffusion.sparse_attn.backends.dense import DenseFallbackBackend | ||
|
|
||
| backend_cls = DenseFallbackBackend |
There was a problem hiding this comment.
Honor 'none' backend by skipping dense fallback substitution
get_sparse_attn_backend documents/implements 'none' as a disable signal, but this branch converts any None backend into DenseFallbackBackend. Because the pipeline enables sparse attention whenever sparse_attn config exists, --sparse-attn-backend none still replaces Wan self-attention layers instead of leaving the native Attention path untouched. That makes the disable option ineffective and can unexpectedly change runtime behavior.
Useful? React with 👍 / 👎.
| meta = self._get_metadata(query, key, query.device) | ||
| return self._impl.forward(query, key, value, meta) |
There was a problem hiding this comment.
Apply denoising schedule when running SparseAttention
The CLI/config now accept sparse_attn.schedule, but this forward path always uses cached metadata from a fixed _pattern_spec and never performs step-aware updates. No production call site wires DiTSparseAttentionAdapter.begin_step, so schedule choices like conservative/aggressive currently have no effect during denoising. This silently breaks expected experiment behavior for users tuning schedule.
Useful? React with 👍 / 👎.
|
|
||
| import torch | ||
|
|
||
| from vllm_omni.diffusion.sparse_attn.base import ( |
There was a problem hiding this comment.
Hey @zzhang-fr, thanks for the PR - is there a reason this belongs in Omni and not in vLLM, which already has its own patterns for attention backends?
IMO we should minimize new attention types that are Omni specific, especially if they need changes on a per model basis, since it is hard to maintain, and keeping cross feature compatibility is already challenging
There was a problem hiding this comment.
@alex-jw-brooks Thanks — I'll address both points together in my reply to your second comment below.
|
|
||
| [build-system] | ||
| requires = ["setuptools>=64"] | ||
| build-backend = "setuptools.build_meta" |
There was a problem hiding this comment.
Similarly, I don't think it will be easy to maintain a contrib package with additional plugins; part of the reason to have plugins is that they are externally maintained to begin with, so in many ways, imo adding it directly here and integrating it into models defeats the purpose of it being a plugin.
Another way to approach enabling this could be to contribute to the plugin architecture in vLLM to allow this type of thing to be registered with no code changes in attention layers for individual models, so that it's usable if people choose to install it? cc @tzhouam @hsliuustc0106 @Gaohan123 in case any of you have thoughts
There was a problem hiding this comment.
@alex-jw-brooks Thanks for the review and the thoughtful questions.
Why not vLLM's existing attention backend
DiT attention doesn't go through vLLM's attention stack at all — it runs entirely inside the
diffusion pipeline, which has no paged KV cache, no causal masking, and no chunked prefill.
Beyond the interface mismatch, sparse attention for diffusion has a lifecycle that has no
parallel in the LLM path: sparsity ratio adapts per denoising step, patterns are derived from
the H×W spatial layout of patches, and the same model may want dense attention in early steps
and aggressive sparsity later. None of this maps onto AttentionMetadata. This is documented
in the "Alternatives considered" section of RFC #2233 if you'd like the fuller reasoning.
Where you're right
Two of your points I agree with completely and will fix before merge:
The enable_sparse_attention() call in wan2_2_transformer.py is a per-model hook, which is
exactly the maintenance surface you're worried about. The better path is to wire this through
the existing pipeline hook system the same way cache_backend is applied today — a single call
in GPUWorker.load_model(), no model-specific code required.
The contrib/ directory should not be in this repo. It was included here to make the protocol
easier to review — having a concrete backend alongside the ABC lets reviewers validate that
the interface is actually usable. But they should live in separate external repos
(flashinfer-vllm-omni, sparge-vllm-omni) that install via entry_points, which is exactly
the plugin model this PR is designed around. I'll drop contrib/ from this PR.
Invitation to the RFC
The broader design question you're raising — where this belongs, whether contrib/ makes sense,
and how models should opt in — is exactly what RFC #2233 is meant to discuss. The Feedback
Period is open for two weeks and the first question listed is specifically about contrib/ and
separate repos. I'd really value your input there, especially given your concern about
cross-feature compatibility. cc @tzhouam @hsliuustc0106 @Gaohan123
|
I think you should wait for feedback from maintainers to disssus about value of this feature and maintainability before opening a PR. This is common practice as attn backend bas been altered. Otherwise, it is quite hard to be accepted. |
There was a problem hiding this comment.
I don't think we should introduce contrib/ directory and we have implemented the registration for Attention backend. This PR is introducing one independent registry for sparse attention. And it's trying to make sparse default when users have installed it. But in most of scenarios, sparse attention could lead to accuracy regression. So it's better to make it optional instead of default.
Could we make SparseAttentionBackend inherit AttentionBackend?
|
|
||
| # Sparse attention args | ||
| parser.add_argument( | ||
| "--sparse-attn-backend", |
There was a problem hiding this comment.
vllm-omni serve Wan2.2 --sparse-attn-backend spargeattn --sparse-attn-topk 0.5
I don't see the necessarity to add --sparse-attn-backend, the previous design vllm-omni serve Wan2.2 --attn-backend spargeattn seems cleaner to me
There was a problem hiding this comment.
@SamitHuang I've addressed this in my reply to @gcanlin's comment.
The short version: removing --sparse-attn-backend and routing through --attn-backend is
cleaner, but introduces a cross-attention scoping problem that needs to be
resolved first. Details in that thread.
@gcanlin Thanks for the detailed review — these are all valid points and I want to engage with them carefully. On inheriting AttentionBackend This is architecturally sound and I've prototyped it. The inheritance chain works: SparseAttentionBackend(AttentionBackend) VIDEO_SPARSE_ATTN becomes one line in DiffusionAttentionBackendEnum, the independent However, inheriting introduces a new problem the current design avoids --attn-backend is a global setting. get_attn_backend() returns the same backend class Cross-attention (Q from patch tokens, KV from text encoder) must remain fully dense. The current per-model enable_sparse_attention() hook exists precisely because it The complexity is equivalent; it just moves from one place to another: Current PR: one enable_sparse_attention() per model (gcanlin's concern) My proposal I'm happy to go either direction, but I'd like to resolve this in RFC #2233 before
On "sparse should not be default" Completely agreed. The _auto_select() behavior in registry.py is wrong — On contrib/ Will remove from this PR. Reference implementations will move to separate repos. Happy to hear from @tzhouam @hsliuustc0106 @SamitHuang on the module_type question |
Purpose
Build plugin-based sparse attention interface for DiT modules.
Resolves #1568 (attention backend extension discussion).
Design
Unified sparse attention framework with plugin-based backend discovery.
vllm-omni owns only the protocol layer and registry. All kernel
implementations live in external pip-installable packages that
register via Python
entry_points— zero vendor backends invllm-omni core.
Core (merged into vllm-omni)
SparseAttentionBackendABC,SparsePatternSpec,SparseAttentionmodule
entry_points+ explicit registration + class pathDiTSparseAttentionAdapterwith sparsity schedulesDiffusionSparseAttnConfig+ CLI(
--sparse-attn-backend/--sparse-attn-topk/--sparse-attn-schedule)DenseFallbackBackendfor correctness comparison in testsWan 2.2 integration (merged into vllm-omni)
enable_sparse_attention()onWanTransformer3DModelOmniDiffusionConfigContrib plugin packages (NOT merged — reference only)
contrib/flashinfer-vllm-omni/andcontrib/sparge-vllm-omni/areincluded solely to demonstrate how external packages implement the
interface. They serve as a concrete example for reviewers to validate
that the ABC design is actually usable. Neither package will be merged
into vllm-omni; they are intended to live in separate repositories
(
flashinfer-vllm-omni,sparge-vllm-omni) maintained outside thisrepo and installed independently via
pip install.A user who wants sparse attention does:
No changes to vllm-omni core are needed to add a new backend.
Known Limitations (to be fixed before merge)
multi-GPU runs — will add explicit SP/ring incompatibility check
and fallback
--sparse-attn-backend noneincorrectly falls back toDenseFallbackBackend— will fix inenable_sparse_attention()guard
scheduleconfig is parsed but not yet wired to step-leveltopk_ratioupdates — will be addressed in follow-up PRTest Plan
Benchmark on A100 80GB (Wan 2.2, 480x832, 81f, 40 steps) with
SpargeAttn via
sparge-vllm-omnientry_point plugin.Test Result
98 tests (97 passed, 1 skipped).
Dense: https://github.com/user-attachments/assets/62a7abfd-758f-4ec0-958b-fa3de672f972
Topk 0.7 https://github.com/user-attachments/assets/a175f027-424b-413f-b6c4-0733c354476f
Topk 0.5 https://github.com/user-attachments/assets/016c8463-f0bc-4a69-a891-b07fbc784a68
Topk 0.3 https://github.com/user-attachments/assets/17d2d464-0205-4d2c-8683-bd2e768c11f3
PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.
Essential Elements of an Effective PR Description Checklist