qwen35 compact mode#1235

Open

libinta wants to merge 94 commits intovllm-project:mainfrom

libinta:libint/qwen35_compact

Contributor

libinta commented Mar 24, 2026 •

edited

Loading

This PR has following changes mainly.

implement compact mode for qwen35.
As Hybrid models (Qwen3.5) have ATN layers (KV cache grows with seq len) and GDN layers (fixed-size state per request). Standard vLLM allocates thousands of block-sized slots for GDN — wasteful since only one slot per request is needed. Compact mode allocates a small tensor (max_num_reqs × num_gdn_groups + 2 slots) and manages them via a free-list: pop on new request, push back on finish. State index = base_slot × num_gdn_groups + group_offset + 1. Auto-enabled for GDN models. Disabled if kv transfer config on, not support prefix cache.
Work-around accuracy issue introduced by torch compile for few ops in hpu_gdn_pytorch
fix the multi-batch issue in causal_conv1d_pytorch.py

libinta and others added 30 commits

March 3, 2026 15:02


          initial for qwen35

5d4be5d


          add gdn_attn.py

7d999ef


          fix gdn metadata creation

2cff96a


          add monkey patch to Qwen3_5GatedDeltaNet

ab2b21c


          remove gdn_attn.py as we don't handle mixed batch and speculative dec…

f5dfb1e

…ode now:


          fix crash in type mismatched

81aabe3


          convert triton kernels to hpu ops

0f2d866


          add missing hpu_gdn_pytorch.py

161972a


          add code for upstream match with chunk_gated_delta_rule

ffe0b92


          add padding_mask_flat

bc603c1


          debug q35 accuracy

374a226


          debug acc

c4ce148


          add save torch tensor for acc debug

0de8ed1


          add more debug

93d92f3


          fix compilation time failure from vllm side

3613fa2


          Keep bool is_multimodal mask for Qwen3.5 embed merge

46e9764


          fix torch compile issue and add qwen35 non moe, and bring in fix from…

be03c7a

… PR 1018


          add client.py and modify q35.sh for online image query case

3bdb599


          fix the accuracy issue for hybrid true + naive sharing false

c64ab45


          remove unused memory and update script with model


          Add test for image and video offline

45dee4d


          Added q35s_x8.sh

299ca8e

Signed-off-by: slokesha <slokeshappa@habana.ai>


          add readme and fix compilation issue in moe fp8

e1ecbe2


          optimize gdn_pytorch

2bf8d13


          Merge branch 'main' into libintan/qwen35_main

d6434b5

Signed-off-by: Libin Tang <libin.tang@intel.com>


          fix import issue and use eager for prefill in hpu_chunk_gated_delta_rule

e2a86d3


          Fixed MoE hang

b7ef38b

Signed-off-by: slokesha <slokeshappa@habana.ai>


          fix the int32 bitwise NOT corrupts state

9d193a3

What happens: has_initial_states_cpu is bool (from num_computed_tokens_p_cpu > 0), but gets cast to int32 during H2D copy. Later, ~has_initial_state does bitwise NOT on int32: ~0 = -1 (truthy), ~1 = -2 (truthy). So initial_state[~has_initial_state, ...] = 0 zeroes all rows, including ones that should keep their cached state.

Impact: For prefix caching / chunked prefill where some sequences have prior state (has_initial_state=True) and others don't, ALL initial states get wiped. First-request-only flows accidentally work because all values are 0→0.


          remove extra mask off

feae034


          add q35r.sh for reasoning

dd360fc

github-actions bot commented Mar 24, 2026

🚧 CI Blocked

The main CI workflow was not started for the following reason:

This is a Draft PR. Please mark it as 'Ready for Review' to trigger the CI.


          fix block size in decode

0f170e7

github-actions bot commented Mar 24, 2026

🚧 CI Blocked

The main CI workflow was not started for the following reason:

This is a Draft PR. Please mark it as 'Ready for Review' to trigger the CI.


          revert to eager read as it causes accuracy

da6973c

github-actions bot commented Mar 25, 2026

🚧 CI Blocked

The main CI workflow was not started for the following reason:

This is a Draft PR. Please mark it as 'Ready for Review' to trigger the CI.


          With TP>1 and spawn multiprocessing, Python's set iteration order dif…

b0fc697

…fers

across ranks (different PYTHONHASHSEED), so iterating finished_req_ids
(a set[str]) to free/reassign compact GDN slots causes each rank to assign
different base_slot to the same request, corrupting GDN state across ranks.

Signed-off: Libin Tang <litang@habana.ai>

github-actions bot commented Mar 25, 2026

🚧 CI Blocked

The main CI workflow was not started for the following reason:

This is a Draft PR. Please mark it as 'Ready for Review' to trigger the CI.


          1. hpu_gdn_pytorch.py (1 line)

4fe72e7

Added @torch._dynamo.disable to hpu_chunk_gated_delta_rule — prevents torch.compile from compiling the prefill chunk pipeline entry point.
2. hpu_model_runner.py (4 logical changes)
a) GDN slot freeing — only free finished, not unscheduled

Signed-off: Libin Tang <litang@habana.ai>

github-actions bot commented Mar 25, 2026

🚧 CI Blocked

The main CI workflow was not started for the following reason:

This is a Draft PR. Please mark it as 'Ready for Review' to trigger the CI.


          Here's a summary of all 3 changes from base:

2ef615e

1. hpu_gdn_pytorch.py — Fix torch.compile miscompilation
Added _eager_output_accum() with @torch._dynamo.disable — isolates the 5D slice mutation (core_h[:, ci].add_(...)) to eager mode
Changed the loop in _hpu_chunk_gdr_phase_b_optimized to call _eager_output_accum instead of inline .add_(). The state update matmul remains compiled.
Removed @torch._dynamo.disable from hpu_chunk_gated_delta_rule (the outer entry point no longer needs to be fully eager — only the problematic slice mutation does)
Root cause: HPU torch.compile silently miscompiles any mutation of a 5D tensor slice indexed by a loop variable ([:, ci] with .add_(), slice assignment, or index_add_).

2. hpu_model_runner.py — Fix log format
Changed free_list=%d → free_list_len=%d with len(...) (was passing a list to %d)
Changed logger.debug → logger.info
3. hpu_worker.py — Fix misleading log
The "actual allocated num_blocks" log was reading kv_caches[0][0].shape[0] which is the compact GDN tensor's dim-0 (e.g., 770), not the ATN block count. Now scans all layers and reports the max (the ATN layer's actual block count).

Signed-off: Libin Tang <litang@habana.ai>

github-actions bot commented Mar 26, 2026

🚧 CI Blocked

The main CI workflow was not started for the following reason:

This is a Draft PR. Please mark it as 'Ready for Review' to trigger the CI.


          Merge branch 'main' into libint/qwen35_compact

80399f3

github-actions bot commented Mar 26, 2026

🚧 CI Blocked

The main CI workflow was not started for the following reason:

This is a Draft PR. Please mark it as 'Ready for Review' to trigger the CI.

libinta marked this pull request as ready for review

March 26, 2026 06:39

github-actions bot mentioned this pull request

🚦 Team Review Dashboard #701

Open

libinta and others added 7 commits

March 26, 2026 07:08


          Merge branch 'main' into libint/qwen35_compact

f64321e

Signed-off-by: Libin Tang <libin.tang@intel.com>


          fix pre-commit failures

608664c


          Optimization: precompute GDN group offset map instead of re-sorting p…

f787836

…er batch.

Signed off: Libin Tang <litang@habana.ai>


          1. Code cleanup

e729b6a

2. Disable compact mode for PD
3. Fix profiling overide max_num_seqs


          Update lm-eval for Qwen3.5

2c047ab

Signed-off-by: Jimin Ha <jimin.ha@intel.com>


          Instead of breaking loop, use eager execution for reshape output.

82abf26

It also fix accuracy issue.


          precommit fix

b218af9

github-actions bot commented Mar 27, 2026

✅ CI Passed

All checks passed successfully against the following vllm commit:
9ace378a63cac49129829ac30e9645bb8af4d2d5


          Enable caching for qwen3 moe op (vllm-project#1068)

6f0d189

Changes to reduce warmup time (in case Qwen3-VL-30B-A3B-Instruct,
roughly 3000 sec -> 500 sec)
. and a bug fix when dp 1 and moe model.

Signed-off-by: Seunghyuk Park <separk@habana.ai>
Co-authored-by: Michał Kuligowski <michal.kuligowski@intel.com>
Signed-off-by: Seunghyuk Park <separk@habana.ai>

shepark force-pushed the libint/qwen35_compact branch from beb2b6b to 6f0d189 Compare

March 27, 2026 05:00

github-actions bot commented Mar 27, 2026

✅ CI Passed

All checks passed successfully against the following vllm commit:
9ace378a63cac49129829ac30e9645bb8af4d2d5

yeonsily and others added 3 commits

March 27, 2026 17:17


          Fix decode profile error

a99ee20

Signed-off-by: Yeonsil Yoon <yeon.sil.yoon@intel.com>


          1. change VLLM_GDN_COMPUTE_FP32 to 1 as moe model has accuracy issue …

1a97317

…if 0

2. change hpu_model_runner log for compact allocation from debug to info.

Signed-Off: Libin Tang <litang@habana.ai>


          Precommit fix.

9b191c9

Signed-Off: Libin Tang <litang@habana.ai>

shepark force-pushed the libint/qwen35_compact branch from a601cba to 9b191c9 Compare

March 29, 2026 22:25

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

Copilot code review Copilot Copilot left review comments

xuechendi Awaiting requested review from xuechendi xuechendi is a code owner

adobrzyn Awaiting requested review from adobrzyn adobrzyn is a code owner

mgawarkiewicz-intel Awaiting requested review from mgawarkiewicz-intel mgawarkiewicz-intel is a code owner

afierka-intel Awaiting requested review from afierka-intel afierka-intel is a code owner

michalkuligowski Awaiting requested review from michalkuligowski michalkuligowski is a code owner

iboiko-habana Awaiting requested review from iboiko-habana iboiko-habana is a code owner

kamil-kaczor Awaiting requested review from kamil-kaczor kamil-kaczor is a code owner

ksmusz Awaiting requested review from ksmusz ksmusz is a code owner

PatrykWo Awaiting requested review from PatrykWo PatrykWo is a code owner

At least 1 approving review is required to merge this pull request.

Labels

None yet