Conversation
Signed-off-by: slokesha <slokeshappa@habana.ai>
Signed-off-by: Libin Tang <libin.tang@intel.com>
Signed-off-by: slokesha <slokeshappa@habana.ai>
What happens: has_initial_states_cpu is bool (from num_computed_tokens_p_cpu > 0), but gets cast to int32 during H2D copy. Later, ~has_initial_state does bitwise NOT on int32: ~0 = -1 (truthy), ~1 = -2 (truthy). So initial_state[~has_initial_state, ...] = 0 zeroes all rows, including ones that should keep their cached state. Impact: For prefix caching / chunked prefill where some sequences have prior state (has_initial_state=True) and others don't, ALL initial states get wiped. First-request-only flows accidentally work because all values are 0→0.
🚧 CI BlockedThe main CI workflow was not started for the following reason:
|
🚧 CI BlockedThe main CI workflow was not started for the following reason:
|
🚧 CI BlockedThe main CI workflow was not started for the following reason:
|
…fers across ranks (different PYTHONHASHSEED), so iterating finished_req_ids (a set[str]) to free/reassign compact GDN slots causes each rank to assign different base_slot to the same request, corrupting GDN state across ranks. Signed-off: Libin Tang <litang@habana.ai>
🚧 CI BlockedThe main CI workflow was not started for the following reason:
|
Added @torch._dynamo.disable to hpu_chunk_gated_delta_rule — prevents torch.compile from compiling the prefill chunk pipeline entry point. 2. hpu_model_runner.py (4 logical changes) a) GDN slot freeing — only free finished, not unscheduled Signed-off: Libin Tang <litang@habana.ai>
🚧 CI BlockedThe main CI workflow was not started for the following reason:
|
1. hpu_gdn_pytorch.py — Fix torch.compile miscompilation Added _eager_output_accum() with @torch._dynamo.disable — isolates the 5D slice mutation (core_h[:, ci].add_(...)) to eager mode Changed the loop in _hpu_chunk_gdr_phase_b_optimized to call _eager_output_accum instead of inline .add_(). The state update matmul remains compiled. Removed @torch._dynamo.disable from hpu_chunk_gated_delta_rule (the outer entry point no longer needs to be fully eager — only the problematic slice mutation does) Root cause: HPU torch.compile silently miscompiles any mutation of a 5D tensor slice indexed by a loop variable ([:, ci] with .add_(), slice assignment, or index_add_). 2. hpu_model_runner.py — Fix log format Changed free_list=%d → free_list_len=%d with len(...) (was passing a list to %d) Changed logger.debug → logger.info 3. hpu_worker.py — Fix misleading log The "actual allocated num_blocks" log was reading kv_caches[0][0].shape[0] which is the compact GDN tensor's dim-0 (e.g., 770), not the ATN block count. Now scans all layers and reports the max (the ATN layer's actual block count). Signed-off: Libin Tang <litang@habana.ai>
🚧 CI BlockedThe main CI workflow was not started for the following reason:
|
🚧 CI BlockedThe main CI workflow was not started for the following reason:
|
Signed-off-by: Libin Tang <libin.tang@intel.com>
…er batch. Signed off: Libin Tang <litang@habana.ai>
2. Disable compact mode for PD 3. Fix profiling overide max_num_seqs
Signed-off-by: Jimin Ha <jimin.ha@intel.com>
It also fix accuracy issue.
✅ CI PassedAll checks passed successfully against the following vllm commit: |
Changes to reduce warmup time (in case Qwen3-VL-30B-A3B-Instruct, roughly 3000 sec -> 500 sec) . and a bug fix when dp 1 and moe model. Signed-off-by: Seunghyuk Park <separk@habana.ai> Co-authored-by: Michał Kuligowski <michal.kuligowski@intel.com> Signed-off-by: Seunghyuk Park <separk@habana.ai>
beb2b6b to
6f0d189
Compare
✅ CI PassedAll checks passed successfully against the following vllm commit: |
Signed-off-by: Yeonsil Yoon <yeon.sil.yoon@intel.com>
…if 0 2. change hpu_model_runner log for compact allocation from debug to info. Signed-Off: Libin Tang <litang@habana.ai>
Signed-Off: Libin Tang <litang@habana.ai>
a601cba to
9b191c9
Compare
This PR has following changes mainly.
As Hybrid models (Qwen3.5) have ATN layers (KV cache grows with seq len) and GDN layers (fixed-size state per request). Standard vLLM allocates thousands of block-sized slots for GDN — wasteful since only one slot per request is needed. Compact mode allocates a small tensor (max_num_reqs × num_gdn_groups + 2 slots) and manages them via a free-list: pop on new request, push back on finish. State index = base_slot × num_gdn_groups + group_offset + 1. Auto-enabled for GDN models. Disabled if kv transfer config on, not support prefix cache.