Skip to content

qwen35 compact mode#1235

Open
libinta wants to merge 94 commits intovllm-project:mainfrom
libinta:libint/qwen35_compact
Open

qwen35 compact mode#1235
libinta wants to merge 94 commits intovllm-project:mainfrom
libinta:libint/qwen35_compact

Conversation

@libinta
Copy link
Copy Markdown
Contributor

@libinta libinta commented Mar 24, 2026

This PR has following changes mainly.

  1. implement compact mode for qwen35.
    As Hybrid models (Qwen3.5) have ATN layers (KV cache grows with seq len) and GDN layers (fixed-size state per request). Standard vLLM allocates thousands of block-sized slots for GDN — wasteful since only one slot per request is needed. Compact mode allocates a small tensor (max_num_reqs × num_gdn_groups + 2 slots) and manages them via a free-list: pop on new request, push back on finish. State index = base_slot × num_gdn_groups + group_offset + 1. Auto-enabled for GDN models. Disabled if kv transfer config on, not support prefix cache.
  2. Work-around accuracy issue introduced by torch compile for few ops in hpu_gdn_pytorch
  3. fix the multi-batch issue in causal_conv1d_pytorch.py

libinta and others added 30 commits March 3, 2026 15:02
Signed-off-by: slokesha <slokeshappa@habana.ai>
Signed-off-by: Libin Tang <libin.tang@intel.com>
Signed-off-by: slokesha <slokeshappa@habana.ai>
What happens: has_initial_states_cpu is bool (from num_computed_tokens_p_cpu > 0), but gets cast to int32 during H2D copy. Later, ~has_initial_state does bitwise NOT on int32: ~0 = -1 (truthy), ~1 = -2 (truthy). So initial_state[~has_initial_state, ...] = 0 zeroes all rows, including ones that should keep their cached state.

Impact: For prefix caching / chunked prefill where some sequences have prior state (has_initial_state=True) and others don't, ALL initial states get wiped. First-request-only flows accidentally work because all values are 0→0.
@github-actions
Copy link
Copy Markdown

🚧 CI Blocked

The main CI workflow was not started for the following reason:

This is a Draft PR. Please mark it as 'Ready for Review' to trigger the CI.

@github-actions
Copy link
Copy Markdown

🚧 CI Blocked

The main CI workflow was not started for the following reason:

This is a Draft PR. Please mark it as 'Ready for Review' to trigger the CI.

@github-actions
Copy link
Copy Markdown

🚧 CI Blocked

The main CI workflow was not started for the following reason:

This is a Draft PR. Please mark it as 'Ready for Review' to trigger the CI.

…fers

across ranks (different PYTHONHASHSEED), so iterating finished_req_ids
(a set[str]) to free/reassign compact GDN slots causes each rank to assign
different base_slot to the same request, corrupting GDN state across ranks.

Signed-off: Libin Tang <litang@habana.ai>
@github-actions
Copy link
Copy Markdown

🚧 CI Blocked

The main CI workflow was not started for the following reason:

This is a Draft PR. Please mark it as 'Ready for Review' to trigger the CI.

Added @torch._dynamo.disable to hpu_chunk_gated_delta_rule — prevents torch.compile from compiling the prefill chunk pipeline entry point.
2. hpu_model_runner.py (4 logical changes)
a) GDN slot freeing — only free finished, not unscheduled

Signed-off: Libin Tang <litang@habana.ai>
@github-actions
Copy link
Copy Markdown

🚧 CI Blocked

The main CI workflow was not started for the following reason:

This is a Draft PR. Please mark it as 'Ready for Review' to trigger the CI.

1. hpu_gdn_pytorch.py — Fix torch.compile miscompilation
Added _eager_output_accum() with @torch._dynamo.disable — isolates the 5D slice mutation (core_h[:, ci].add_(...)) to eager mode
Changed the loop in _hpu_chunk_gdr_phase_b_optimized to call _eager_output_accum instead of inline .add_(). The state update matmul remains compiled.
Removed @torch._dynamo.disable from hpu_chunk_gated_delta_rule (the outer entry point no longer needs to be fully eager — only the problematic slice mutation does)
Root cause: HPU torch.compile silently miscompiles any mutation of a 5D tensor slice indexed by a loop variable ([:, ci] with .add_(), slice assignment, or index_add_).

2. hpu_model_runner.py — Fix log format
Changed free_list=%d → free_list_len=%d with len(...) (was passing a list to %d)
Changed logger.debug → logger.info
3. hpu_worker.py — Fix misleading log
The "actual allocated num_blocks" log was reading kv_caches[0][0].shape[0] which is the compact GDN tensor's dim-0 (e.g., 770), not the ATN block count. Now scans all layers and reports the max (the ATN layer's actual block count).

Signed-off: Libin Tang <litang@habana.ai>
@github-actions
Copy link
Copy Markdown

🚧 CI Blocked

The main CI workflow was not started for the following reason:

This is a Draft PR. Please mark it as 'Ready for Review' to trigger the CI.

@github-actions
Copy link
Copy Markdown

🚧 CI Blocked

The main CI workflow was not started for the following reason:

This is a Draft PR. Please mark it as 'Ready for Review' to trigger the CI.

@libinta libinta marked this pull request as ready for review March 26, 2026 06:39
libinta and others added 7 commits March 26, 2026 07:08
Signed-off-by: Libin Tang <libin.tang@intel.com>
…er batch.

Signed off: Libin Tang <litang@habana.ai>
2. Disable compact mode for PD
3. Fix profiling overide max_num_seqs
Signed-off-by: Jimin Ha <jimin.ha@intel.com>
@github-actions
Copy link
Copy Markdown

✅ CI Passed

All checks passed successfully against the following vllm commit:
9ace378a63cac49129829ac30e9645bb8af4d2d5

Changes to reduce warmup time (in case Qwen3-VL-30B-A3B-Instruct,
roughly 3000 sec -> 500 sec)
. and a bug fix when dp 1 and moe model.

Signed-off-by: Seunghyuk Park <separk@habana.ai>
Co-authored-by: Michał Kuligowski <michal.kuligowski@intel.com>
Signed-off-by: Seunghyuk Park <separk@habana.ai>
@shepark shepark force-pushed the libint/qwen35_compact branch from beb2b6b to 6f0d189 Compare March 27, 2026 05:00
@github-actions
Copy link
Copy Markdown

✅ CI Passed

All checks passed successfully against the following vllm commit:
9ace378a63cac49129829ac30e9645bb8af4d2d5

yeonsily and others added 3 commits March 27, 2026 17:17
Signed-off-by: Yeonsil Yoon <yeon.sil.yoon@intel.com>
…if 0

2. change hpu_model_runner log for compact allocation from debug to info.

Signed-Off: Libin Tang <litang@habana.ai>
Signed-Off: Libin Tang <litang@habana.ai>
@shepark shepark force-pushed the libint/qwen35_compact branch from a601cba to 9b191c9 Compare March 29, 2026 22:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants