Skip to content

Commit ca968e4

Browse files
[NPUW] Fix multiple MatMul matching issue in NPUW LM head cutting (#32475)
### Details: **Background:** Eagle 3 pipeline will add new output in target model to get the intermedium feature embeddings. The `cut_lm_head` function separates the vocabulary matrix (LM head) from LLM models for efficient inference. It needs to identify the correct `MatMul` operation among multiple candidates in the model graph. **Problem:** When multiple `MatMul` operations match the pattern (common in LLMs), the callback executes multiple times, with each execution overwriting the previous result. Only the last matched `MatMul` is used, often missing the actual vocabulary matrix. **Solution:** Replaced `MatcherPass` with direct traversal and intelligent selection: 1. Collect all candidates instead of using last match 2. Select MatMul with largest matrix size (vocabulary size heuristic) 3. Optimize traversal - iterate Result nodes directly instead of all nodes ### Tickets: - [*CVS-175198*](https://jira.devtools.intel.com/browse/CVS-175198)
1 parent 4832243 commit ca968e4

File tree

1 file changed

+8
-0
lines changed

1 file changed

+8
-0
lines changed

src/plugins/intel_npu/src/plugin/npuw/llm_compiled_model.cpp

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -881,6 +881,14 @@ class CutLMHead : public ov::pass::MatcherPass {
881881
auto matched_matmul = std::static_pointer_cast<ov::op::v0::MatMul>(matched_node_matmul);
882882
auto matched_result = std::static_pointer_cast<ov::op::v0::Result>(matched_node_result);
883883

884+
// Some LLMs add intermediate hidden state outputs that can interfere with LM head detection.
885+
// Skip Result nodes that were manually added (marked with "manually_added_output" in RT_INFO).
886+
// For example, Eagle-3 target/draft models add "last_hidden_state" output which should be skipped.
887+
const auto& rt_info = matched_result->get_rt_info();
888+
if (rt_info.count("manually_added_output")) {
889+
return false;
890+
}
891+
884892
// Cut point:
885893
auto matmul_first_source = matched_matmul->input(0).get_source_output();
886894

0 commit comments

Comments
 (0)