Skip to content

[NPUW] Fix multiple MatMul matching issue in NPUW LM head cutting#32475

Merged
AsyaPronina merged 4 commits intoopenvinotoolkit:masterfrom
GuoliangShiIntel:fix-lm-head-multiple-matmul-matching
Dec 10, 2025
Merged

[NPUW] Fix multiple MatMul matching issue in NPUW LM head cutting#32475
AsyaPronina merged 4 commits intoopenvinotoolkit:masterfrom
GuoliangShiIntel:fix-lm-head-multiple-matmul-matching

Conversation

@GuoliangShiIntel
Copy link
Copy Markdown
Contributor

@GuoliangShiIntel GuoliangShiIntel commented Oct 20, 2025

Details:

Background:
Eagle 3 pipeline will add new output in target model to get the intermedium feature embeddings.

The cut_lm_head function separates the vocabulary matrix (LM head) from LLM models for efficient inference. It needs to identify the correct MatMul operation among multiple candidates in the model graph.

Problem:
When multiple MatMul operations match the pattern (common in LLMs), the callback executes multiple times, with each execution overwriting the previous result. Only the last matched MatMul is used, often missing the actual vocabulary matrix.

Solution:
Replaced MatcherPass with direct traversal and intelligent selection:

  1. Collect all candidates instead of using last match
  2. Select MatMul with largest matrix size (vocabulary size heuristic)
  3. Optimize traversal - iterate Result nodes directly instead of all nodes

Tickets:

@GuoliangShiIntel GuoliangShiIntel requested review from a team as code owners October 20, 2025 05:19
@github-actions github-actions bot added category: NPU OpenVINO NPU plugin category: NPUW NPUW plugin labels Oct 20, 2025
@sys-openvino-ci sys-openvino-ci added the ExternalIntelPR External contributor from Intel label Oct 20, 2025
@GuoliangShiIntel GuoliangShiIntel force-pushed the fix-lm-head-multiple-matmul-matching branch 3 times, most recently from 9343a41 to 5efef96 Compare October 20, 2025 06:10
@GuoliangShiIntel
Copy link
Copy Markdown
Contributor Author

@AsyaPronina Could you please take a look this fix?

@GuoliangShiIntel GuoliangShiIntel force-pushed the fix-lm-head-multiple-matmul-matching branch 2 times, most recently from 485b91b to b1f74ce Compare October 21, 2025 06:29
@dmatveev dmatveev added this to the 2026.0 milestone Oct 31, 2025
@GuoliangShiIntel GuoliangShiIntel force-pushed the fix-lm-head-multiple-matmul-matching branch from b1f74ce to 596799e Compare November 17, 2025 09:55
@AsyaPronina
Copy link
Copy Markdown
Contributor

build_jenkins

Copy link
Copy Markdown
Contributor

@AsyaPronina AsyaPronina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I like this approach to find logits MatMul. However in the context of Eagle-3 we can also differentiate the second features MatMul by just looking for Concat node at the top:
image

  • Implemented option in the PR seems to be stable option as Vocabulary should have the maximum size over the other output MatMul-s and this option is not limited to only two MatMul outputs. However, additional checks should be done on the shapes of the outputs, thus, could you please provide concrete example on different shapes and values for size0 and size1 that are used in this algorithm to make a final decision to understand how much stable it is?
  • The proposed approach with Concat can just check in the rewriter callback that if there is a Concat node abovee -> then do nothing. This will preserve the previous logic in CutLMHead pattern rewriter and minimize changes. However, unlike the first option, it may be more tied to the Eagle-3 use case.

using Ref = std::reference_wrapper<CutContext>;
};

CutLMHead(CutContext::Ref cut_context) : m_cut_context(cut_context) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that here we face the same issue as commented in the another PR: #32891 (comment)

If we want to pass the source model into the constructor of the pattern matcher, we might need to use another OpenVINO Transformations API here, such as ModelPass. Example is here:
https://github.com/openvinotoolkit/openvino/blob/master/src/core/include/openvino/pass/sdpa_to_paged_attention.hpp
https://github.com/openvinotoolkit/openvino/blob/master/src/core/src/pass/sdpa_to_paged_attention.cpp

Please IM if you have questions!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I have updated the code. pls take a look.

@GuoliangShiIntel
Copy link
Copy Markdown
Contributor Author

GuoliangShiIntel commented Dec 2, 2025

Overall I like this approach to find logits MatMul. However in the context of Eagle-3 we can also differentiate the second features MatMul by just looking for Concat node at the top: image

  • Implemented option in the PR seems to be stable option as Vocabulary should have the maximum size over the other output MatMul-s and this option is not limited to only two MatMul outputs. However, additional checks should be done on the shapes of the outputs, thus, could you please provide concrete example on different shapes and values for size0 and size1 that are used in this algorithm to make a final decision to understand how much stable it is?
  • The proposed approach with Concat can just check in the rewriter callback that if there is a Concat node abovee -> then do nothing. This will preserve the previous logic in CutLMHead pattern rewriter and minimize changes. However, unlike the first option, it may be more tied to the Eagle-3 use case.

@AsyaPronina Thanks for the proposal and analysis.

  • Here are the concrete shapes from the models:

Target model:
Original last result MatMul weights shape: [dictionary_size: 151936, token_embedding_size: 4096]
New output MatMul weights shape: [token_embedding_size: 4096, ffn_embedding_size: 12288]

Draft model:
Original last result MatMul weights shape: [dictionary_size: 32000, token_embedding_size: 4096]
New output MatMul weights shape: [token_embedding_size: 4096, ffn_embedding_size: 12288]

Based on typical LLM architecture constraints, the algorithm should reliably identify the vocabulary MatMul by selecting the one with the maximum output dimension, as the dictionary size is expected to be the largest dimension.

  • As you mentioned, this approach would be specific to the Eagle-3 use case. Additionally, there's another consideration: even within the Eagle-3 pipeline, the draft model also has a new output with the same issue but without a Concat op.
image

Based on this analysis, I think the current solution is more robust. However, I'm open to further discussion if you have additional concerns.

@GuoliangShiIntel GuoliangShiIntel force-pushed the fix-lm-head-multiple-matmul-matching branch from 596799e to 12f1b3f Compare December 2, 2025 02:24
@AlexanderKalistratov
Copy link
Copy Markdown
Contributor

AlexanderKalistratov commented Dec 2, 2025

  • As you mentioned, this approach would be specific to the Eagle-3 use case. Additionally, there's another consideration: even within the Eagle-3 pipeline, the draft model also has a new output with the same issue but without a Concat op.

We are cutting LM Head for performance benefits.
While the goal of this PR is to fix Eagle3 case functionally it is an open question whether should we cut one head or cut all "heads"

@AsyaPronina
Copy link
Copy Markdown
Contributor

build_jenkins

@AsyaPronina
Copy link
Copy Markdown
Contributor

AsyaPronina commented Dec 2, 2025

Hello Dear @GuoliangShiIntel !
Thanks a lot for such a detailed answer and analysis. Indeed, the current approach should work, but we are afraid that there is a possibility that draft's vocab_size can sometimes exceed emb_size * 3, what can lead to the wrong LM head detection.

Our current proposal (from @AlexanderKalistratov and @dmatveev) is:

  • For target model - to tag Result (or might be all?) Eagle-3 specific nodes in the target model via rt_info in OpenVINO GenAI pipeline. As these nodes are explicitly added in OpenVINO GenAI pipeline, this approach is robust. Then, in old CutLMHead transformation, in the rewriter, we can just check that if pattern was matched but there is an eagle3 tag in the Result node, then we won't cut this head and continue.
  • For draft model - forbid LM head cutting via passing "NPUW_LLM_SHARED_HEAD : NO" in OpenVINO GenAI pipeline for draft model.

@GuoliangShiIntel GuoliangShiIntel force-pushed the fix-lm-head-multiple-matmul-matching branch 2 times, most recently from efd95c8 to ccf7194 Compare December 3, 2025 02:55
@GuoliangShiIntel
Copy link
Copy Markdown
Contributor Author

Hello Dear @GuoliangShiIntel ! Thanks a lot for such a detailed answer and analysis. Indeed, the current approach should work, but we are afraid that there is a possibility that draft's vocab_size can sometimes exceed emb_size * 3, what can lead to the wrong LM head detection.

Our current proposal (from @AlexanderKalistratov and @dmatveev) is:

  • For target model - to tag Result (or might be all?) Eagle-3 specific nodes in the target model via rt_info in OpenVINO GenAI pipeline. As these nodes are explicitly added in OpenVINO GenAI pipeline, this approach is robust. Then, in old CutLMHead transformation, in the rewriter, we can just check that if pattern was matched but there is an eagle3 tag in the Result node, then we won't cut this head and continue.
  • For draft model - forbid LM head cutting via passing "NPUW_LLM_SHARED_HEAD : NO" in OpenVINO GenAI pipeline for draft model.

Hi @AsyaPronina, the one reason I used size-based matching for LM_HEAD was to avoid making it specific to the Eagle-3 feature—though I agree it's not safe enough.

If you're comfortable with using RT_INFO to explicitly set the name, that works for me too. However, to avoid being too specific in such a common transformation, I introduced a generic hidden_output_name_key instead of explicitly marking it as Eagle-3-related.

The reason is that we're seeing (and will likely see more) pipelines with additional outputs—such as multimodal models. This way, we can reuse this key for future cases rather than creating Eagle-3-specific logic.

and for draft model this way can also work for better performance by setting in pipeline:

draft_model->set_rt_info("last_hidden_state", "hidden_output_name");
main_model->set_rt_info("last_hidden_state", "hidden_output_name");

What do you think?

@AsyaPronina
Copy link
Copy Markdown
Contributor

Hello Dear @GuoliangShiIntel ! Thanks a lot for such a detailed answer and analysis. Indeed, the current approach should work, but we are afraid that there is a possibility that draft's vocab_size can sometimes exceed emb_size * 3, what can lead to the wrong LM head detection.
Our current proposal (from @AlexanderKalistratov and @dmatveev) is:

  • For target model - to tag Result (or might be all?) Eagle-3 specific nodes in the target model via rt_info in OpenVINO GenAI pipeline. As these nodes are explicitly added in OpenVINO GenAI pipeline, this approach is robust. Then, in old CutLMHead transformation, in the rewriter, we can just check that if pattern was matched but there is an eagle3 tag in the Result node, then we won't cut this head and continue.
  • For draft model - forbid LM head cutting via passing "NPUW_LLM_SHARED_HEAD : NO" in OpenVINO GenAI pipeline for draft model.

Hi @AsyaPronina, the one reason I used size-based matching for LM_HEAD was to avoid making it specific to the Eagle-3 feature—though I agree it's not safe enough.

If you're comfortable with using RT_INFO to explicitly set the name, that works for me too. However, to avoid being too specific in such a common transformation, I introduced a generic hidden_output_name_key instead of explicitly marking it as Eagle-3-related.

The reason is that we're seeing (and will likely see more) pipelines with additional outputs—such as multimodal models. This way, we can reuse this key for future cases rather than creating Eagle-3-specific logic.

and for draft model this way can also work for better performance by setting in pipeline:

draft_model->set_rt_info("last_hidden_state", "hidden_output_name");
main_model->set_rt_info("last_hidden_state", "hidden_output_name");

What do you think?

Hello Dear @GuoliangShiIntel ! Thanks a lot for quick update!
Personally, I like this approach a lot. However, let hear @AlexanderKalistratov and @dmatveev opinions on that also!

@AsyaPronina
Copy link
Copy Markdown
Contributor

build_jenkins

Copy link
Copy Markdown
Contributor

@dmatveev dmatveev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My recommendation here is not to introduce any changes if we can to do so.

@GuoliangShiIntel would forcing the two-model pipeline solve the problem? Assuming that the NPUW-side SLICE_OUT is still applied where possible.

Comment on lines +887 to +888
if (model_rt_info.count(ov::npuw::LLMCompiledModel::hidden_output_name_key)) {
const auto& hidden_output_name =
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does the rt_info get this key to begin with - who's adding it there?

This is a question to @AsyaPronina .

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OpenVINO GenAI pipeline for Eagle-3 is responsible for this

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does it do it for the model's own RT_INFO? I believe updating node's RT_INFO otherwise must be enough, so you don't need that extra model_rt_info argument?

Here's an example how we put this: https://github.com/openvinotoolkit/openvino/blob/master/src/plugins/intel_npu/src/plugin/npuw/llm_compiled_model_utils.cpp#L282

And here's an example how we handle this:
https://github.com/openvinotoolkit/openvino/blob/master/src/plugins/intel_npu/src/plugin/npuw/llm_compiled_model_utils.cpp#L282

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good proposal. Question on where to set the node's rt_info:

  1. NPUW: No appropriate place found
  2. GenAI with string key: result_node->get_rt_info()["manually_added_output"] = true; - direct but uses hard-coded string
  3. NPUW with npuw::util::ManuallyAddedOutputAttr: Requires exposing public rt_info attribute in GenAI - unsure if architecturally sound

any comments?

@GuoliangShiIntel
Copy link
Copy Markdown
Contributor Author

My recommendation here is not to introduce any changes if we can to do so.

@GuoliangShiIntel would forcing the two-model pipeline solve the problem? Assuming that the NPUW-side SLICE_OUT is still applied where possible.

Two-model pipeline can work well without any code change.
but it will influence the memory, pls find data at here

Comment on lines +887 to +888
if (model_rt_info.count(ov::npuw::LLMCompiledModel::hidden_output_name_key)) {
const auto& hidden_output_name =
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does it do it for the model's own RT_INFO? I believe updating node's RT_INFO otherwise must be enough, so you don't need that extra model_rt_info argument?

Here's an example how we put this: https://github.com/openvinotoolkit/openvino/blob/master/src/plugins/intel_npu/src/plugin/npuw/llm_compiled_model_utils.cpp#L282

And here's an example how we handle this:
https://github.com/openvinotoolkit/openvino/blob/master/src/plugins/intel_npu/src/plugin/npuw/llm_compiled_model_utils.cpp#L282

Comment on lines +890 to +895
const auto& result_output_names = matched_result->output(0).get_names();
for (const auto& name : result_output_names) {
if (name == hidden_output_name) {
return false;
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So instead of this you can just check matched_result's RT_INFO to check if it was added manually, and skip the cut if that's the case.

@GuoliangShiIntel GuoliangShiIntel force-pushed the fix-lm-head-multiple-matmul-matching branch 2 times, most recently from f97f20d to 5d7e933 Compare December 4, 2025 12:37
@dmatveev dmatveev enabled auto-merge December 5, 2025 00:00
@dmatveev
Copy link
Copy Markdown
Contributor

dmatveev commented Dec 5, 2025

GenAI with string key: result_node->get_rt_info()["manually_added_output"] = true; - direct but uses hard-coded string

I think we can live with this, thanks @GuoliangShiIntel !

@dmatveev
Copy link
Copy Markdown
Contributor

dmatveev commented Dec 5, 2025

build_jenkins

Copy link
Copy Markdown
Contributor

@AsyaPronina AsyaPronina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great effort @GuoliangShiIntel , thanks a lot!

@dmatveev
Copy link
Copy Markdown
Contributor

dmatveev commented Dec 6, 2025

build_jenkins

@dmatveev dmatveev added this pull request to the merge queue Dec 8, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Dec 8, 2025
@AsyaPronina AsyaPronina added this pull request to the merge queue Dec 9, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Dec 9, 2025
@AsyaPronina AsyaPronina added this pull request to the merge queue Dec 9, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Dec 9, 2025
@AsyaPronina AsyaPronina added this pull request to the merge queue Dec 10, 2025
Merged via the queue into openvinotoolkit:master with commit ca968e4 Dec 10, 2025
183 checks passed
Naseer-010 pushed a commit to Naseer-010/openvino that referenced this pull request Jan 4, 2026
…envinotoolkit#32475)

### Details:
**Background:**
Eagle 3 pipeline will add new output in target model to get the
intermedium feature embeddings.

The `cut_lm_head` function separates the vocabulary matrix (LM head)
from LLM models for efficient inference. It needs to identify the
correct `MatMul` operation among multiple candidates in the model graph.

**Problem:**
When multiple `MatMul` operations match the pattern (common in LLMs),
the callback executes multiple times, with each execution overwriting
the previous result. Only the last matched `MatMul` is used, often
missing the actual vocabulary matrix.

**Solution:**
Replaced `MatcherPass` with direct traversal and intelligent selection:
1. Collect all candidates instead of using last match
2. Select MatMul with largest matrix size (vocabulary size heuristic)
3. Optimize traversal - iterate Result nodes directly instead of all
nodes


### Tickets:
 - [*CVS-175198*](https://jira.devtools.intel.com/browse/CVS-175198)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

category: NPU OpenVINO NPU plugin category: NPUW NPUW plugin ExternalIntelPR External contributor from Intel

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants