[NPUW] Fix multiple MatMul matching issue in NPUW LM head cutting by GuoliangShiIntel · Pull Request #32475 · openvinotoolkit/openvino

GuoliangShiIntel · 2025-10-20T05:19:30Z

Details:

Background:
Eagle 3 pipeline will add new output in target model to get the intermedium feature embeddings.

The cut_lm_head function separates the vocabulary matrix (LM head) from LLM models for efficient inference. It needs to identify the correct MatMul operation among multiple candidates in the model graph.

Problem:
When multiple MatMul operations match the pattern (common in LLMs), the callback executes multiple times, with each execution overwriting the previous result. Only the last matched MatMul is used, often missing the actual vocabulary matrix.

Solution:
Replaced MatcherPass with direct traversal and intelligent selection:

Collect all candidates instead of using last match
Select MatMul with largest matrix size (vocabulary size heuristic)
Optimize traversal - iterate Result nodes directly instead of all nodes

Tickets:

CVS-175198

GuoliangShiIntel · 2025-10-20T06:16:25Z

@AsyaPronina Could you please take a look this fix?

AsyaPronina · 2025-12-01T15:52:59Z

build_jenkins

AsyaPronina

Overall I like this approach to find logits MatMul. However in the context of Eagle-3 we can also differentiate the second features MatMul by just looking for Concat node at the top:

Implemented option in the PR seems to be stable option as Vocabulary should have the maximum size over the other output MatMul-s and this option is not limited to only two MatMul outputs. However, additional checks should be done on the shapes of the outputs, thus, could you please provide concrete example on different shapes and values for size0 and size1 that are used in this algorithm to make a final decision to understand how much stable it is?
The proposed approach with Concat can just check in the rewriter callback that if there is a Concat node abovee -> then do nothing. This will preserve the previous logic in CutLMHead pattern rewriter and minimize changes. However, unlike the first option, it may be more tied to the Eagle-3 use case.

AsyaPronina · 2025-12-01T16:38:01Z

src/plugins/intel_npu/src/plugin/npuw/llm_compiled_model.cpp

+        using Ref = std::reference_wrapper<CutContext>;
+    };
+
+    CutLMHead(CutContext::Ref cut_context) : m_cut_context(cut_context) {


I think that here we face the same issue as commented in the another PR: #32891 (comment)

If we want to pass the source model into the constructor of the pattern matcher, we might need to use another OpenVINO Transformations API here, such as ModelPass. Example is here:
https://github.com/openvinotoolkit/openvino/blob/master/src/core/include/openvino/pass/sdpa_to_paged_attention.hpp
https://github.com/openvinotoolkit/openvino/blob/master/src/core/src/pass/sdpa_to_paged_attention.cpp

Please IM if you have questions!

Good point, I have updated the code. pls take a look.

GuoliangShiIntel · 2025-12-02T02:07:26Z

Overall I like this approach to find logits MatMul. However in the context of Eagle-3 we can also differentiate the second features MatMul by just looking for Concat node at the top:

Implemented option in the PR seems to be stable option as Vocabulary should have the maximum size over the other output MatMul-s and this option is not limited to only two MatMul outputs. However, additional checks should be done on the shapes of the outputs, thus, could you please provide concrete example on different shapes and values for size0 and size1 that are used in this algorithm to make a final decision to understand how much stable it is?

The proposed approach with Concat can just check in the rewriter callback that if there is a Concat node abovee -> then do nothing. This will preserve the previous logic in CutLMHead pattern rewriter and minimize changes. However, unlike the first option, it may be more tied to the Eagle-3 use case.

@AsyaPronina Thanks for the proposal and analysis.

Here are the concrete shapes from the models:

Target model:
Original last result MatMul weights shape: [dictionary_size: 151936, token_embedding_size: 4096]
New output MatMul weights shape: [token_embedding_size: 4096, ffn_embedding_size: 12288]

Draft model:
Original last result MatMul weights shape: [dictionary_size: 32000, token_embedding_size: 4096]
New output MatMul weights shape: [token_embedding_size: 4096, ffn_embedding_size: 12288]

Based on typical LLM architecture constraints, the algorithm should reliably identify the vocabulary MatMul by selecting the one with the maximum output dimension, as the dictionary size is expected to be the largest dimension.

As you mentioned, this approach would be specific to the Eagle-3 use case. Additionally, there's another consideration: even within the Eagle-3 pipeline, the draft model also has a new output with the same issue but without a Concat op.

Based on this analysis, I think the current solution is more robust. However, I'm open to further discussion if you have additional concerns.

AlexanderKalistratov · 2025-12-02T02:40:53Z

As you mentioned, this approach would be specific to the Eagle-3 use case. Additionally, there's another consideration: even within the Eagle-3 pipeline, the draft model also has a new output with the same issue but without a Concat op.

We are cutting LM Head for performance benefits.
While the goal of this PR is to fix Eagle3 case functionally it is an open question whether should we cut one head or cut all "heads"

AsyaPronina · 2025-12-02T12:21:31Z

build_jenkins

AsyaPronina · 2025-12-02T15:50:40Z

Hello Dear @GuoliangShiIntel !
Thanks a lot for such a detailed answer and analysis. Indeed, the current approach should work, but we are afraid that there is a possibility that draft's vocab_size can sometimes exceed emb_size * 3, what can lead to the wrong LM head detection.

Our current proposal (from @AlexanderKalistratov and @dmatveev) is:

For target model - to tag Result (or might be all?) Eagle-3 specific nodes in the target model via rt_info in OpenVINO GenAI pipeline. As these nodes are explicitly added in OpenVINO GenAI pipeline, this approach is robust. Then, in old CutLMHead transformation, in the rewriter, we can just check that if pattern was matched but there is an eagle3 tag in the Result node, then we won't cut this head and continue.
For draft model - forbid LM head cutting via passing "NPUW_LLM_SHARED_HEAD : NO" in OpenVINO GenAI pipeline for draft model.

GuoliangShiIntel · 2025-12-03T03:08:21Z

Hello Dear @GuoliangShiIntel ! Thanks a lot for such a detailed answer and analysis. Indeed, the current approach should work, but we are afraid that there is a possibility that draft's vocab_size can sometimes exceed emb_size * 3, what can lead to the wrong LM head detection.

Our current proposal (from @AlexanderKalistratov and @dmatveev) is:

For target model - to tag Result (or might be all?) Eagle-3 specific nodes in the target model via rt_info in OpenVINO GenAI pipeline. As these nodes are explicitly added in OpenVINO GenAI pipeline, this approach is robust. Then, in old CutLMHead transformation, in the rewriter, we can just check that if pattern was matched but there is an eagle3 tag in the Result node, then we won't cut this head and continue.

For draft model - forbid LM head cutting via passing "NPUW_LLM_SHARED_HEAD : NO" in OpenVINO GenAI pipeline for draft model.

Hi @AsyaPronina, the one reason I used size-based matching for LM_HEAD was to avoid making it specific to the Eagle-3 feature—though I agree it's not safe enough.

If you're comfortable with using RT_INFO to explicitly set the name, that works for me too. However, to avoid being too specific in such a common transformation, I introduced a generic hidden_output_name_key instead of explicitly marking it as Eagle-3-related.

The reason is that we're seeing (and will likely see more) pipelines with additional outputs—such as multimodal models. This way, we can reuse this key for future cases rather than creating Eagle-3-specific logic.

and for draft model this way can also work for better performance by setting in pipeline:

draft_model->set_rt_info("last_hidden_state", "hidden_output_name");
main_model->set_rt_info("last_hidden_state", "hidden_output_name");

What do you think?

AsyaPronina · 2025-12-03T07:39:53Z

Hello Dear @GuoliangShiIntel ! Thanks a lot for such a detailed answer and analysis. Indeed, the current approach should work, but we are afraid that there is a possibility that draft's vocab_size can sometimes exceed emb_size * 3, what can lead to the wrong LM head detection.
Our current proposal (from @AlexanderKalistratov and @dmatveev) is:

For target model - to tag Result (or might be all?) Eagle-3 specific nodes in the target model via rt_info in OpenVINO GenAI pipeline. As these nodes are explicitly added in OpenVINO GenAI pipeline, this approach is robust. Then, in old CutLMHead transformation, in the rewriter, we can just check that if pattern was matched but there is an eagle3 tag in the Result node, then we won't cut this head and continue.

For draft model - forbid LM head cutting via passing "NPUW_LLM_SHARED_HEAD : NO" in OpenVINO GenAI pipeline for draft model.

Hi @AsyaPronina, the one reason I used size-based matching for LM_HEAD was to avoid making it specific to the Eagle-3 feature—though I agree it's not safe enough.

If you're comfortable with using RT_INFO to explicitly set the name, that works for me too. However, to avoid being too specific in such a common transformation, I introduced a generic hidden_output_name_key instead of explicitly marking it as Eagle-3-related.

The reason is that we're seeing (and will likely see more) pipelines with additional outputs—such as multimodal models. This way, we can reuse this key for future cases rather than creating Eagle-3-specific logic.

and for draft model this way can also work for better performance by setting in pipeline:
draft_model->set_rt_info("last_hidden_state", "hidden_output_name");
main_model->set_rt_info("last_hidden_state", "hidden_output_name");
What do you think?

Hello Dear @GuoliangShiIntel ! Thanks a lot for quick update!
Personally, I like this approach a lot. However, let hear @AlexanderKalistratov and @dmatveev opinions on that also!

AsyaPronina · 2025-12-03T07:40:33Z

build_jenkins

dmatveev

My recommendation here is not to introduce any changes if we can to do so.

@GuoliangShiIntel would forcing the two-model pipeline solve the problem? Assuming that the NPUW-side SLICE_OUT is still applied where possible.

dmatveev · 2025-12-03T16:58:04Z

src/plugins/intel_npu/src/plugin/npuw/llm_compiled_model.cpp

+            if (model_rt_info.count(ov::npuw::LLMCompiledModel::hidden_output_name_key)) {
+                const auto& hidden_output_name =


How does the rt_info get this key to begin with - who's adding it there?

This is a question to @AsyaPronina .

OpenVINO GenAI pipeline for Eagle-3 is responsible for this

Why does it do it for the model's own RT_INFO? I believe updating node's RT_INFO otherwise must be enough, so you don't need that extra model_rt_info argument?

Here's an example how we put this: https://github.com/openvinotoolkit/openvino/blob/master/src/plugins/intel_npu/src/plugin/npuw/llm_compiled_model_utils.cpp#L282

And here's an example how we handle this:
https://github.com/openvinotoolkit/openvino/blob/master/src/plugins/intel_npu/src/plugin/npuw/llm_compiled_model_utils.cpp#L282

Good proposal. Question on where to set the node's rt_info:

NPUW: No appropriate place found

GenAI with string key: result_node->get_rt_info()["manually_added_output"] = true; - direct but uses hard-coded string

NPUW with npuw::util::ManuallyAddedOutputAttr: Requires exposing public rt_info attribute in GenAI - unsure if architecturally sound

any comments?

GuoliangShiIntel · 2025-12-04T04:13:51Z

My recommendation here is not to introduce any changes if we can to do so.

@GuoliangShiIntel would forcing the two-model pipeline solve the problem? Assuming that the NPUW-side SLICE_OUT is still applied where possible.

Two-model pipeline can work well without any code change.
but it will influence the memory, pls find data at here

dmatveev · 2025-12-04T09:38:14Z

src/plugins/intel_npu/src/plugin/npuw/llm_compiled_model.cpp

+            if (model_rt_info.count(ov::npuw::LLMCompiledModel::hidden_output_name_key)) {
+                const auto& hidden_output_name =


Why does it do it for the model's own RT_INFO? I believe updating node's RT_INFO otherwise must be enough, so you don't need that extra model_rt_info argument?

Here's an example how we put this: https://github.com/openvinotoolkit/openvino/blob/master/src/plugins/intel_npu/src/plugin/npuw/llm_compiled_model_utils.cpp#L282

And here's an example how we handle this:
https://github.com/openvinotoolkit/openvino/blob/master/src/plugins/intel_npu/src/plugin/npuw/llm_compiled_model_utils.cpp#L282

dmatveev · 2025-12-04T09:40:57Z

src/plugins/intel_npu/src/plugin/npuw/llm_compiled_model.cpp

+                const auto& result_output_names = matched_result->output(0).get_names();
+                for (const auto& name : result_output_names) {
+                    if (name == hidden_output_name) {
+                        return false;
+                    }
+                }


So instead of this you can just check matched_result's RT_INFO to check if it was added manually, and skip the cut if that's the case.

dmatveev · 2025-12-05T00:00:54Z

GenAI with string key: result_node->get_rt_info()["manually_added_output"] = true; - direct but uses hard-coded string

I think we can live with this, thanks @GuoliangShiIntel !

dmatveev · 2025-12-05T09:45:57Z

build_jenkins

AsyaPronina

Great effort @GuoliangShiIntel , thanks a lot!

dmatveev · 2025-12-06T09:18:35Z

build_jenkins

…envinotoolkit#32475) ### Details: **Background:** Eagle 3 pipeline will add new output in target model to get the intermedium feature embeddings. The `cut_lm_head` function separates the vocabulary matrix (LM head) from LLM models for efficient inference. It needs to identify the correct `MatMul` operation among multiple candidates in the model graph. **Problem:** When multiple `MatMul` operations match the pattern (common in LLMs), the callback executes multiple times, with each execution overwriting the previous result. Only the last matched `MatMul` is used, often missing the actual vocabulary matrix. **Solution:** Replaced `MatcherPass` with direct traversal and intelligent selection: 1. Collect all candidates instead of using last match 2. Select MatMul with largest matrix size (vocabulary size heuristic) 3. Optimize traversal - iterate Result nodes directly instead of all nodes ### Tickets: - [*CVS-175198*](https://jira.devtools.intel.com/browse/CVS-175198)

GuoliangShiIntel requested review from a team as code owners October 20, 2025 05:19

github-actions bot added category: NPU OpenVINO NPU plugin category: NPUW NPUW plugin labels Oct 20, 2025

sys-openvino-ci added the ExternalIntelPR External contributor from Intel label Oct 20, 2025

GuoliangShiIntel force-pushed the fix-lm-head-multiple-matmul-matching branch 3 times, most recently from 9343a41 to 5efef96 Compare October 20, 2025 06:10

GuoliangShiIntel force-pushed the fix-lm-head-multiple-matmul-matching branch 2 times, most recently from 485b91b to b1f74ce Compare October 21, 2025 06:29

dmatveev added this to the 2026.0 milestone Oct 31, 2025

GuoliangShiIntel force-pushed the fix-lm-head-multiple-matmul-matching branch from b1f74ce to 596799e Compare November 17, 2025 09:55

AsyaPronina reviewed Dec 1, 2025

View reviewed changes

GuoliangShiIntel force-pushed the fix-lm-head-multiple-matmul-matching branch from 596799e to 12f1b3f Compare December 2, 2025 02:24

GuoliangShiIntel force-pushed the fix-lm-head-multiple-matmul-matching branch 2 times, most recently from efd95c8 to ccf7194 Compare December 3, 2025 02:55

dmatveev reviewed Dec 3, 2025

View reviewed changes

dmatveev reviewed Dec 4, 2025

View reviewed changes

GuoliangShiIntel force-pushed the fix-lm-head-multiple-matmul-matching branch 2 times, most recently from f97f20d to 5d7e933 Compare December 4, 2025 12:37

GuoliangShiIntel added 4 commits December 5, 2025 04:06

Fix multiple MatMul matching issue in NPUW LM head cutting

fd65115

rt_info solution

550db89

define hidden_output_name_key

466bc14

Add node rt_info

5d7e933

dmatveev approved these changes Dec 5, 2025

View reviewed changes

dmatveev enabled auto-merge December 5, 2025 00:00

AsyaPronina approved these changes Dec 5, 2025

View reviewed changes

dmatveev added this pull request to the merge queue Dec 8, 2025

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Dec 8, 2025

AsyaPronina added this pull request to the merge queue Dec 9, 2025

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Dec 9, 2025

AsyaPronina added this pull request to the merge queue Dec 9, 2025

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Dec 9, 2025

AsyaPronina added this pull request to the merge queue Dec 10, 2025

Merged via the queue into openvinotoolkit:master with commit ca968e4 Dec 10, 2025
183 checks passed

		if (model_rt_info.count(ov::npuw::LLMCompiledModel::hidden_output_name_key)) {
		const auto& hidden_output_name =

Conversation

GuoliangShiIntel commented Oct 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Details:

Tickets:

Uh oh!

GuoliangShiIntel commented Oct 20, 2025

Uh oh!

AsyaPronina commented Dec 1, 2025

Uh oh!

AsyaPronina left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

GuoliangShiIntel commented Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AlexanderKalistratov commented Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AsyaPronina commented Dec 2, 2025

Uh oh!

AsyaPronina commented Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

GuoliangShiIntel commented Dec 3, 2025

Uh oh!

AsyaPronina commented Dec 3, 2025

Uh oh!

AsyaPronina commented Dec 3, 2025

Uh oh!

dmatveev left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

GuoliangShiIntel commented Dec 4, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dmatveev commented Dec 5, 2025

Uh oh!

dmatveev commented Dec 5, 2025

Uh oh!

AsyaPronina left a comment

Choose a reason for hiding this comment

Uh oh!

dmatveev commented Dec 6, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

GuoliangShiIntel commented Oct 20, 2025 •

edited

Loading

AsyaPronina left a comment •

edited

Loading

GuoliangShiIntel commented Dec 2, 2025 •

edited

Loading

AlexanderKalistratov commented Dec 2, 2025 •

edited

Loading

AsyaPronina commented Dec 2, 2025 •

edited

Loading