[NPUW] Fix multiple MatMul matching issue in NPUW LM head cutting#32475
Conversation
9343a41 to
5efef96
Compare
|
@AsyaPronina Could you please take a look this fix? |
485b91b to
b1f74ce
Compare
b1f74ce to
596799e
Compare
|
build_jenkins |
There was a problem hiding this comment.
Overall I like this approach to find logits MatMul. However in the context of Eagle-3 we can also differentiate the second features MatMul by just looking for Concat node at the top:

- Implemented option in the PR seems to be stable option as Vocabulary should have the maximum size over the other output MatMul-s and this option is not limited to only two MatMul outputs. However, additional checks should be done on the shapes of the outputs, thus, could you please provide concrete example on different shapes and values for
size0andsize1that are used in this algorithm to make a final decision to understand how much stable it is? - The proposed approach with
Concatcan just check in the rewriter callback that if there is aConcatnode abovee -> then do nothing. This will preserve the previous logic inCutLMHeadpattern rewriter and minimize changes. However, unlike the first option, it may be more tied to the Eagle-3 use case.
| using Ref = std::reference_wrapper<CutContext>; | ||
| }; | ||
|
|
||
| CutLMHead(CutContext::Ref cut_context) : m_cut_context(cut_context) { |
There was a problem hiding this comment.
I think that here we face the same issue as commented in the another PR: #32891 (comment)
If we want to pass the source model into the constructor of the pattern matcher, we might need to use another OpenVINO Transformations API here, such as ModelPass. Example is here:
https://github.com/openvinotoolkit/openvino/blob/master/src/core/include/openvino/pass/sdpa_to_paged_attention.hpp
https://github.com/openvinotoolkit/openvino/blob/master/src/core/src/pass/sdpa_to_paged_attention.cpp
Please IM if you have questions!
There was a problem hiding this comment.
Good point, I have updated the code. pls take a look.
@AsyaPronina Thanks for the proposal and analysis.
Target model: Draft model: Based on typical LLM architecture constraints, the algorithm should reliably identify the vocabulary
Based on this analysis, I think the current solution is more robust. However, I'm open to further discussion if you have additional concerns. |
596799e to
12f1b3f
Compare
We are cutting LM Head for performance benefits. |
|
build_jenkins |
|
Hello Dear @GuoliangShiIntel ! Our current proposal (from @AlexanderKalistratov and @dmatveev) is:
|
efd95c8 to
ccf7194
Compare
Hi @AsyaPronina, the one reason I used size-based matching for LM_HEAD was to avoid making it specific to the Eagle-3 feature—though I agree it's not safe enough. If you're comfortable with using RT_INFO to explicitly set the name, that works for me too. However, to avoid being too specific in such a common transformation, I introduced a generic The reason is that we're seeing (and will likely see more) pipelines with additional outputs—such as multimodal models. This way, we can reuse this key for future cases rather than creating Eagle-3-specific logic. and for draft model this way can also work for better performance by setting in pipeline: What do you think? |
Hello Dear @GuoliangShiIntel ! Thanks a lot for quick update! |
|
build_jenkins |
dmatveev
left a comment
There was a problem hiding this comment.
My recommendation here is not to introduce any changes if we can to do so.
@GuoliangShiIntel would forcing the two-model pipeline solve the problem? Assuming that the NPUW-side SLICE_OUT is still applied where possible.
| if (model_rt_info.count(ov::npuw::LLMCompiledModel::hidden_output_name_key)) { | ||
| const auto& hidden_output_name = |
There was a problem hiding this comment.
How does the rt_info get this key to begin with - who's adding it there?
This is a question to @AsyaPronina .
There was a problem hiding this comment.
OpenVINO GenAI pipeline for Eagle-3 is responsible for this
There was a problem hiding this comment.
Why does it do it for the model's own RT_INFO? I believe updating node's RT_INFO otherwise must be enough, so you don't need that extra model_rt_info argument?
Here's an example how we put this: https://github.com/openvinotoolkit/openvino/blob/master/src/plugins/intel_npu/src/plugin/npuw/llm_compiled_model_utils.cpp#L282
And here's an example how we handle this:
https://github.com/openvinotoolkit/openvino/blob/master/src/plugins/intel_npu/src/plugin/npuw/llm_compiled_model_utils.cpp#L282
There was a problem hiding this comment.
Good proposal. Question on where to set the node's rt_info:
- NPUW: No appropriate place found
- GenAI with string key: result_node->get_rt_info()["manually_added_output"] = true; - direct but uses hard-coded string
- NPUW with npuw::util::ManuallyAddedOutputAttr: Requires exposing public rt_info attribute in GenAI - unsure if architecturally sound
any comments?
Two-model pipeline can work well without any code change. |
| if (model_rt_info.count(ov::npuw::LLMCompiledModel::hidden_output_name_key)) { | ||
| const auto& hidden_output_name = |
There was a problem hiding this comment.
Why does it do it for the model's own RT_INFO? I believe updating node's RT_INFO otherwise must be enough, so you don't need that extra model_rt_info argument?
Here's an example how we put this: https://github.com/openvinotoolkit/openvino/blob/master/src/plugins/intel_npu/src/plugin/npuw/llm_compiled_model_utils.cpp#L282
And here's an example how we handle this:
https://github.com/openvinotoolkit/openvino/blob/master/src/plugins/intel_npu/src/plugin/npuw/llm_compiled_model_utils.cpp#L282
| const auto& result_output_names = matched_result->output(0).get_names(); | ||
| for (const auto& name : result_output_names) { | ||
| if (name == hidden_output_name) { | ||
| return false; | ||
| } | ||
| } |
There was a problem hiding this comment.
So instead of this you can just check matched_result's RT_INFO to check if it was added manually, and skip the cut if that's the case.
f97f20d to
5d7e933
Compare
I think we can live with this, thanks @GuoliangShiIntel ! |
|
build_jenkins |
AsyaPronina
left a comment
There was a problem hiding this comment.
Great effort @GuoliangShiIntel , thanks a lot!
|
build_jenkins |
…envinotoolkit#32475) ### Details: **Background:** Eagle 3 pipeline will add new output in target model to get the intermedium feature embeddings. The `cut_lm_head` function separates the vocabulary matrix (LM head) from LLM models for efficient inference. It needs to identify the correct `MatMul` operation among multiple candidates in the model graph. **Problem:** When multiple `MatMul` operations match the pattern (common in LLMs), the callback executes multiple times, with each execution overwriting the previous result. Only the last matched `MatMul` is used, often missing the actual vocabulary matrix. **Solution:** Replaced `MatcherPass` with direct traversal and intelligent selection: 1. Collect all candidates instead of using last match 2. Select MatMul with largest matrix size (vocabulary size heuristic) 3. Optimize traversal - iterate Result nodes directly instead of all nodes ### Tickets: - [*CVS-175198*](https://jira.devtools.intel.com/browse/CVS-175198)


Details:
Background:
Eagle 3 pipeline will add new output in target model to get the intermedium feature embeddings.
The
cut_lm_headfunction separates the vocabulary matrix (LM head) from LLM models for efficient inference. It needs to identify the correctMatMuloperation among multiple candidates in the model graph.Problem:
When multiple
MatMuloperations match the pattern (common in LLMs), the callback executes multiple times, with each execution overwriting the previous result. Only the last matchedMatMulis used, often missing the actual vocabulary matrix.Solution:
Replaced
MatcherPasswith direct traversal and intelligent selection:Tickets: