Skip to content

NPUW: Abstract away fail-safety from the core#34927

Draft
dmatveev wants to merge 6 commits intoopenvinotoolkit:masterfrom
dmatveev:dm/npuw-failsafe-models-switch
Draft

NPUW: Abstract away fail-safety from the core#34927
dmatveev wants to merge 6 commits intoopenvinotoolkit:masterfrom
dmatveev:dm/npuw-failsafe-models-switch

Conversation

@dmatveev
Copy link
Contributor

Tickets:

AI Assistance:

  • AI assistance used: yes
  • Generated according to spec

@dmatveev dmatveev added this to the 2026.2 milestone Mar 25, 2026
@github-actions github-actions bot added category: NPU OpenVINO NPU plugin category: NPUW NPUW plugin labels Mar 25, 2026
Copy link
Contributor Author

@dmatveev dmatveev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Self-review part 1

Comment on lines -56 to -62
if (devices.size() == 1u) {
auto compiled_model = factory(devices.front());
OPENVINO_ASSERT(compiled_model != nullptr,
"Failsafe factory returned null compiled model for device ",
devices.front());
return compiled_model;
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where did this change go?

Comment on lines +53 to +55
if (recompiled) {
*recompiled = device_before != m_npuw_model->submodel_device(id);
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably we shouldn't care about this at all now? Should the recompiled be removed as an argument?

LOG_BLOCK();
unsafe_run_this_prep_next(idx, next_prepared);
job_done = true;
failover = failover || (m_subrequest_devices[real_idx] != m_npuw_model->submodel_device(real_idx));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that if failover happened once and we've got this difference, every subsequent inference will hit this part and recreate the subrequests again?
Also, should the requests be recreated at all given that they execute via fail-safe model?

Restore the single-device fast path in the failsafe factory, remove stale infer-request recreation plumbing, and align failsafe unit tests with the public factory behavior.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Contributor Author

@dmatveev dmatveev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Self-review 2

Comment on lines +1761 to 1762
const bool allow_runtime_failover = m_cfg.get<::intel_npu::NPUW_FALLBACK_EXEC>();
for (auto iter = m_compiled_submodels[id].device_it; iter != m_dev_list.cend(); ++iter) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

device iterator shouldn't be present anymore here, right? It is only Failsafe model that knows there's multiple devices to compile for.

Comment on lines +1792 to +1793
auto active_device_it = std::find(m_dev_list.cbegin(), m_dev_list.cend(), wrapped_failsafe->active_device_name());
OPENVINO_ASSERT(active_device_it != m_dev_list.cend(), "Failsafe selected an unknown device");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's get rid of device_it if possible.

failover_happened |= recompiled;
const auto device_before = m_npuw_model->submodel_device(i);
auto rqs = create_infer_requests(i, is_piped ? 2 : 1);
failover_happened |= device_before != m_npuw_model->submodel_device(i);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need to know about failover happened here?

m_subrequests[real_idx] = new_rqs.at(0);
if (is_piped) {
m_funcall_pipeline[real_idx].subrequest = new_rqs.at(1);
if (m_subrequest_devices[real_idx] == active_device) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why should we track subrequest devices? Should we get rid of that as well?

Comment on lines +1037 to +1039
if (proto_comp_model_desc.host_flash_attention) {
setup_hfa_infer_requests(real_idx, is_piped, /* is_recreate */ true, /* enable_hfa_optimizations */ true);
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make the HFA requests also work through the fail-safe manner. Remoev this code if it is no longer needed.

Comment on lines +1320 to +1322

// Feeding the global Parameters is now part of the common
// execution pipeline: See how it is done in
// `unsafe_run_this_prep_next()`. Now we only need to bind
// the subrequest' outputs to global Results, if relevant.
bind_global_results(idx);
// Another infer request may have already failed over this shared compiled model.
refresh_failover_side_resources(idx);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why it is called here? Nothing happened here yet??

Comment on lines +1339 to +1343
const bool device_changed = m_subrequest_devices[real_idx] != m_npuw_model->submodel_device(real_idx);
failover = failover || device_changed;
if (device_changed) {
refresh_failover_side_resources(idx);
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with failover in inference chaining test in place, no actions should be taken here?

dmatveev and others added 2 commits March 25, 2026 18:28
Coordinate failover across main and auxiliary compiled models so infer requests no longer refresh side resources explicitly. Keep the single-device fast path intact and let auxiliary helpers self-heal on device changes where needed.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Contributor Author

@dmatveev dmatveev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Self-review 3

Comment on lines +213 to +224
if (processing_mode == MoEProcessingMode::EXPERT_ITERATIVE && m_resources.device_name != active_device) {
LOG_INFO("Reallocating MoE iterative output buffer for failover to " << active_device << "...");
LOG_BLOCK();

const size_t active_experts = m_config.num_active_experts;
const size_t embed_dim = m_config.expert_hidden_dim;
ov::Shape buffer_shape = {active_experts, 1, input_token_count, embed_dim};
auto any_compiled_model = m_config.compiled_models.begin()->second;
auto output_element_type = any_compiled_model->outputs()[0].get_element_type();
m_resources.expert_output_accumulator = m_allocator(output_element_type, buffer_shape, active_device);
m_resources.device_name = active_device;
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets avoid this knowledge here. It should be transparent, isnt it? Let's assume we'll start optimizing those memory hops later if we really need to.

// Shape: [num_active_experts, 1, input_token_count, expert_hidden_dim]
// Accumulates expert outputs before final reduction
TensorPtr expert_output_accumulator;
std::string device_name;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's remove it here for now. But lets comment the surrounding code that with a failsafe model in place, the memory management should be done smarter (right now this code is aware of the device where the model is compiled to)

Comment on lines +1804 to +1806
m_profile["compile/" + device + profile_suffix].record([&]() {
compiled_model = compile_submodel(model, device);
});
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With parallel compilation.. can there be a race?

Comment on lines +370 to +376
const bool enable_hfa_optimizations =
std::dynamic_pointer_cast<ov::npuw::failsafe::CompiledModel>(proto_comp_model_desc.compiled_model._ptr) ==
nullptr;
setup_hfa_infer_requests(real_idx,
is_piped,
/* is_recreate */ false,
/* enable_hfa_optimizations */ true);
enable_hfa_optimizations);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't HFA optimizations work with the failsafe model?

// with the default (to be accessed next) one.
std::swap(m_subrequests[real_idx], m_funcall_pipeline[real_idx].subrequest);
}
failover = failover || (active_device_before != m_npuw_model->submodel_device(real_idx));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why should we track the failover here?

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

category: NPU OpenVINO NPU plugin category: NPUW NPUW plugin

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant