NPUW: Abstract away fail-safety from the core by dmatveev · Pull Request #34927 · openvinotoolkit/openvino

dmatveev · 2026-03-25T14:09:51Z

Tickets:

EISW-207002

AI Assistance:

AI assistance used: yes
Generated according to spec

dmatveev

Self-review part 1

dmatveev · 2026-03-25T15:46:52Z

src/plugins/intel_npu/src/plugin/npuw/v1/elements/failsafe.cpp

-    if (devices.size() == 1u) {
-        auto compiled_model = factory(devices.front());
-        OPENVINO_ASSERT(compiled_model != nullptr,
-                        "Failsafe factory returned null compiled model for device ",
-                        devices.front());
-        return compiled_model;
-    }


Where did this change go?

dmatveev · 2026-03-25T15:48:04Z

src/plugins/intel_npu/src/plugin/npuw/base_sync_infer_request.cpp

+    if (recompiled) {
+        *recompiled = device_before != m_npuw_model->submodel_device(id);
+    }


Probably we shouldn't care about this at all now? Should the recompiled be removed as an argument?

dmatveev · 2026-03-25T16:03:58Z

src/plugins/intel_npu/src/plugin/npuw/just_sync_infer_request.cpp

+        LOG_BLOCK();
+        unsafe_run_this_prep_next(idx, next_prepared);
+        job_done = true;
+        failover = failover || (m_subrequest_devices[real_idx] != m_npuw_model->submodel_device(real_idx));


Does this mean that if failover happened once and we've got this difference, every subsequent inference will hit this part and recreate the subrequests again?
Also, should the requests be recreated at all given that they execute via fail-safe model?

Restore the single-device fast path in the failsafe factory, remove stale infer-request recreation plumbing, and align failsafe unit tests with the public factory behavior. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

dmatveev

Self-review 2

dmatveev · 2026-03-25T17:52:46Z

src/plugins/intel_npu/src/plugin/npuw/compiled_model.cpp

+    const bool allow_runtime_failover = m_cfg.get<::intel_npu::NPUW_FALLBACK_EXEC>();
    for (auto iter = m_compiled_submodels[id].device_it; iter != m_dev_list.cend(); ++iter) {


device iterator shouldn't be present anymore here, right? It is only Failsafe model that knows there's multiple devices to compile for.

dmatveev · 2026-03-25T17:54:03Z

src/plugins/intel_npu/src/plugin/npuw/compiled_model.cpp

+    auto active_device_it = std::find(m_dev_list.cbegin(), m_dev_list.cend(), wrapped_failsafe->active_device_name());
+    OPENVINO_ASSERT(active_device_it != m_dev_list.cend(), "Failsafe selected an unknown device");


let's get rid of device_it if possible.

dmatveev · 2026-03-25T17:56:16Z

src/plugins/intel_npu/src/plugin/npuw/just_sync_infer_request.cpp

-        failover_happened |= recompiled;
+        const auto device_before = m_npuw_model->submodel_device(i);
+        auto rqs = create_infer_requests(i, is_piped ? 2 : 1);
+        failover_happened |= device_before != m_npuw_model->submodel_device(i);


Do we really need to know about failover happened here?

dmatveev · 2026-03-25T17:57:25Z

src/plugins/intel_npu/src/plugin/npuw/just_sync_infer_request.cpp

-    m_subrequests[real_idx] = new_rqs.at(0);
-    if (is_piped) {
-        m_funcall_pipeline[real_idx].subrequest = new_rqs.at(1);
+    if (m_subrequest_devices[real_idx] == active_device) {


Why should we track subrequest devices? Should we get rid of that as well?

dmatveev · 2026-03-25T17:58:38Z

src/plugins/intel_npu/src/plugin/npuw/just_sync_infer_request.cpp

+    if (proto_comp_model_desc.host_flash_attention) {
+        setup_hfa_infer_requests(real_idx, is_piped, /* is_recreate */ true, /* enable_hfa_optimizations */ true);
+    }


Make the HFA requests also work through the fail-safe manner. Remoev this code if it is no longer needed.

dmatveev · 2026-03-25T17:59:27Z

src/plugins/intel_npu/src/plugin/npuw/just_sync_infer_request.cpp


-        // Feeding the global Parameters is now part of the common
-        // execution pipeline: See how it is done in
-        // `unsafe_run_this_prep_next()`.  Now we only need to bind
-        // the subrequest' outputs to global Results, if relevant.
-        bind_global_results(idx);
+    // Another infer request may have already failed over this shared compiled model.
+    refresh_failover_side_resources(idx);


Why it is called here? Nothing happened here yet??

dmatveev · 2026-03-25T17:59:59Z

src/plugins/intel_npu/src/plugin/npuw/just_sync_infer_request.cpp

+    const bool device_changed = m_subrequest_devices[real_idx] != m_npuw_model->submodel_device(real_idx);
+    failover = failover || device_changed;
+    if (device_changed) {
+        refresh_failover_side_resources(idx);
+    }


with failover in inference chaining test in place, no actions should be taken here?

Coordinate failover across main and auxiliary compiled models so infer requests no longer refresh side resources explicitly. Keep the single-device fast path intact and let auxiliary helpers self-heal on device changes where needed. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

dmatveev

Self-review 3

dmatveev · 2026-03-25T22:23:17Z

src/plugins/intel_npu/src/plugin/npuw/moe/moe_executor.cpp

+    if (processing_mode == MoEProcessingMode::EXPERT_ITERATIVE && m_resources.device_name != active_device) {
+        LOG_INFO("Reallocating MoE iterative output buffer for failover to " << active_device << "...");
+        LOG_BLOCK();
+
+        const size_t active_experts = m_config.num_active_experts;
+        const size_t embed_dim = m_config.expert_hidden_dim;
+        ov::Shape buffer_shape = {active_experts, 1, input_token_count, embed_dim};
+        auto any_compiled_model = m_config.compiled_models.begin()->second;
+        auto output_element_type = any_compiled_model->outputs()[0].get_element_type();
+        m_resources.expert_output_accumulator = m_allocator(output_element_type, buffer_shape, active_device);
+        m_resources.device_name = active_device;
+    }


lets avoid this knowledge here. It should be transparent, isnt it? Let's assume we'll start optimizing those memory hops later if we really need to.

dmatveev · 2026-03-25T22:24:31Z

src/plugins/intel_npu/src/plugin/npuw/moe/moe_resources.hpp

    // Shape: [num_active_experts, 1, input_token_count, expert_hidden_dim]
    // Accumulates expert outputs before final reduction
    TensorPtr expert_output_accumulator;
+    std::string device_name;


let's remove it here for now. But lets comment the surrounding code that with a failsafe model in place, the memory management should be done smarter (right now this code is aware of the device where the model is compiled to)

dmatveev · 2026-03-25T22:28:23Z

src/plugins/intel_npu/src/plugin/npuw/compiled_model.cpp

+        m_profile["compile/" + device + profile_suffix].record([&]() {
+            compiled_model = compile_submodel(model, device);
+        });


With parallel compilation.. can there be a race?

dmatveev · 2026-03-25T22:30:39Z

src/plugins/intel_npu/src/plugin/npuw/just_sync_infer_request.cpp

+                const bool enable_hfa_optimizations =
+                    std::dynamic_pointer_cast<ov::npuw::failsafe::CompiledModel>(proto_comp_model_desc.compiled_model._ptr) ==
+                    nullptr;
                setup_hfa_infer_requests(real_idx,
                                         is_piped,
                                         /* is_recreate */ false,
-                                         /* enable_hfa_optimizations */ true);
+                                         enable_hfa_optimizations);


Why can't HFA optimizations work with the failsafe model?

dmatveev · 2026-03-25T22:32:30Z

src/plugins/intel_npu/src/plugin/npuw/just_sync_infer_request.cpp

-            // with the default (to be accessed next) one.
-            std::swap(m_subrequests[real_idx], m_funcall_pipeline[real_idx].subrequest);
-        }
+    failover = failover || (active_device_before != m_npuw_model->submodel_device(real_idx));


Why should we track the failover here?

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

dmatveev added 2 commits March 25, 2026 14:05

NPUW: Replace hard-wired fail-over with failsafe models

1765628

NPUW: Fix coding style

0fadcc3

dmatveev added this to the 2026.2 milestone Mar 25, 2026

github-actions bot added category: NPU OpenVINO NPU plugin category: NPUW NPUW plugin labels Mar 25, 2026

dmatveev commented Mar 25, 2026

View reviewed changes

NPUW: refine failsafe request plumbing

c7ff6a0

Restore the single-device fast path in the failsafe factory, remove stale infer-request recreation plumbing, and align failsafe unit tests with the public factory behavior. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

dmatveev commented Mar 25, 2026

View reviewed changes

dmatveev and others added 2 commits March 25, 2026 18:28

NPUW: simplify failsafe auxiliary models

b5c69d1

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

dmatveev commented Mar 25, 2026

View reviewed changes

NPUW: drop stale submodel bundle helper

337e8f4

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

		const bool allow_runtime_failover = m_cfg.get<::intel_npu::NPUW_FALLBACK_EXEC>();
		for (auto iter = m_compiled_submodels[id].device_it; iter != m_dev_list.cend(); ++iter) {

		auto active_device_it = std::find(m_dev_list.cbegin(), m_dev_list.cend(), wrapped_failsafe->active_device_name());
		OPENVINO_ASSERT(active_device_it != m_dev_list.cend(), "Failsafe selected an unknown device");

Conversation

dmatveev commented Mar 25, 2026

Tickets:

AI Assistance:

Uh oh!

dmatveev left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dmatveev left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dmatveev left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

dmatveev left a comment •

edited

Loading