Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -1204,6 +1204,10 @@ ov::AnyMap get_default_generate_config(const std::optional<NPUDesc>& npudesc,
if (hint == ::intel_npu::npuw::llm::GenerateHint::FAST_COMPILE) {
config.emplace("NPUW_UNFOLD_IREQS", "YES");
}
// Specify NPUW DQ if Compiler DQ is not enabled
if (!npudesc.has_value() || !npudesc->compiler_dq) {
config.emplace("NPUW_DQ", "YES");
}
Comment on lines +1207 to +1210
Copy link
Copy Markdown
Contributor

@dmatveev dmatveev Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change certainly brings back the missed behavior, but after reviewing the history thoroughly I am not quite sure if the OLD behavior was correct.

The OLD behavior was first introduced here: #28343

The logic we bring back is: "use NPUW_DQ if the compiler DQ is not present". But, if I remember correctly, NPUW_DQ is the compiler DQ. They come together. So one can't substitute another, they come in pair. The idea here is that to make the compiler DQ work, we sometimes need to transform a model a certain way. If the compiler DQ isn't available as in older drivers, we need to tranfrom the model even more (the FULL NPUW-side DQ).

UPD: NPUW_DQ_FULL is on by default, so enabling NPUW_DQ here gives us NPUW_DQ_FULL automatically. It is obscure but seem to work (see below).

Mnemonics in the property description confirm this:

Looking at the default values - https://github.com/openvinotoolkit/openvino/blob/2025.4.0/src/plugins/intel_npu/src/al/include/intel_npu/config/npuw.hpp#L111

  • NPUW_DQ is false (probably a rudiment)
  • NPUW_DQ_FULL is true

Now looking into the configuration building:

This logic seem to be good for the moment. Remember this is the baseline common configuration that is used as a basis for prefill & generate stages.

But later, when we refine the PREFILL model config, we do something obscure: https://github.com/openvinotoolkit/openvino/blob/2025.4.0/src/plugins/intel_npu/src/plugin/npuw/llm_compiled_model.cpp#L1171 - strangely enough this change is introduced by the same original commit 57025dc

UPD2: The obscurity is deciphered above.

The old logic (in red) seem to make more sense than the new one (in green):

Image

Previously (red), we've set DQ_FULL (to avoid the full transformation) to NO if and only IF compiler supported DQ, that made sense. Now (green) this condition is reversed, but in the case when compiler DQ is not present, we set NPUW_DQ (instead of NPUW_DQ_FULL that is supposed to handle this case). That's clearly a miss. that also includes NPUW_DQ_FULL as that one wasn't disabled.

Same thing happened for the GENERATE model - we didn't find the capability but we still set _DQ (not _DQ_FULL that is supposed to be there).

Initially, we've only had NPUW_DQ that did the full transformation. Later, when compiler-side DQ has came in, we've provided the past behavior under NPUW_DQ_FULL, and used NPUW_DQ to do the compiler-friendly transformation (only impacting group-quantized models). I beleive the combination of this rename & some "refactoring" in the original commit caused the issue confusion (UPD2).

TL;DR: the old behavior is restored, but the old behavior is sus

UPD: More archeology

  1. NPUW_DQ was the die hard one in the beginning: NPUW: Introduce DQ #26362 (did the full transformation to the model)
  2. NPUW_DQ_FULL was introduced later as an early return in the die-hard NPUW_DQ path: [NPUW] Introduce DQ_FULL property #27678

DQ and DQ_FULL don't inverse each other. DQ_FULL will only work if it is ON while DQ is ON.

So here comes UPD2;

// We don't need slice out for kv cache model, especially for speculative decoding which need
// to generate more than 1 token for each inference
config.erase("NPUW_SLICE_OUT");
Expand Down
Loading