NPUW: Abstract away fail-safety from the core#34927
NPUW: Abstract away fail-safety from the core#34927dmatveev wants to merge 6 commits intoopenvinotoolkit:masterfrom
Conversation
dmatveev
left a comment
There was a problem hiding this comment.
Self-review part 1
| if (devices.size() == 1u) { | ||
| auto compiled_model = factory(devices.front()); | ||
| OPENVINO_ASSERT(compiled_model != nullptr, | ||
| "Failsafe factory returned null compiled model for device ", | ||
| devices.front()); | ||
| return compiled_model; | ||
| } |
There was a problem hiding this comment.
Where did this change go?
| if (recompiled) { | ||
| *recompiled = device_before != m_npuw_model->submodel_device(id); | ||
| } |
There was a problem hiding this comment.
Probably we shouldn't care about this at all now? Should the recompiled be removed as an argument?
| LOG_BLOCK(); | ||
| unsafe_run_this_prep_next(idx, next_prepared); | ||
| job_done = true; | ||
| failover = failover || (m_subrequest_devices[real_idx] != m_npuw_model->submodel_device(real_idx)); |
There was a problem hiding this comment.
Does this mean that if failover happened once and we've got this difference, every subsequent inference will hit this part and recreate the subrequests again?
Also, should the requests be recreated at all given that they execute via fail-safe model?
Restore the single-device fast path in the failsafe factory, remove stale infer-request recreation plumbing, and align failsafe unit tests with the public factory behavior. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
| const bool allow_runtime_failover = m_cfg.get<::intel_npu::NPUW_FALLBACK_EXEC>(); | ||
| for (auto iter = m_compiled_submodels[id].device_it; iter != m_dev_list.cend(); ++iter) { |
There was a problem hiding this comment.
device iterator shouldn't be present anymore here, right? It is only Failsafe model that knows there's multiple devices to compile for.
| auto active_device_it = std::find(m_dev_list.cbegin(), m_dev_list.cend(), wrapped_failsafe->active_device_name()); | ||
| OPENVINO_ASSERT(active_device_it != m_dev_list.cend(), "Failsafe selected an unknown device"); |
There was a problem hiding this comment.
let's get rid of device_it if possible.
| failover_happened |= recompiled; | ||
| const auto device_before = m_npuw_model->submodel_device(i); | ||
| auto rqs = create_infer_requests(i, is_piped ? 2 : 1); | ||
| failover_happened |= device_before != m_npuw_model->submodel_device(i); |
There was a problem hiding this comment.
Do we really need to know about failover happened here?
| m_subrequests[real_idx] = new_rqs.at(0); | ||
| if (is_piped) { | ||
| m_funcall_pipeline[real_idx].subrequest = new_rqs.at(1); | ||
| if (m_subrequest_devices[real_idx] == active_device) { |
There was a problem hiding this comment.
Why should we track subrequest devices? Should we get rid of that as well?
| if (proto_comp_model_desc.host_flash_attention) { | ||
| setup_hfa_infer_requests(real_idx, is_piped, /* is_recreate */ true, /* enable_hfa_optimizations */ true); | ||
| } |
There was a problem hiding this comment.
Make the HFA requests also work through the fail-safe manner. Remoev this code if it is no longer needed.
|
|
||
| // Feeding the global Parameters is now part of the common | ||
| // execution pipeline: See how it is done in | ||
| // `unsafe_run_this_prep_next()`. Now we only need to bind | ||
| // the subrequest' outputs to global Results, if relevant. | ||
| bind_global_results(idx); | ||
| // Another infer request may have already failed over this shared compiled model. | ||
| refresh_failover_side_resources(idx); |
There was a problem hiding this comment.
Why it is called here? Nothing happened here yet??
| const bool device_changed = m_subrequest_devices[real_idx] != m_npuw_model->submodel_device(real_idx); | ||
| failover = failover || device_changed; | ||
| if (device_changed) { | ||
| refresh_failover_side_resources(idx); | ||
| } |
There was a problem hiding this comment.
with failover in inference chaining test in place, no actions should be taken here?
Coordinate failover across main and auxiliary compiled models so infer requests no longer refresh side resources explicitly. Keep the single-device fast path intact and let auxiliary helpers self-heal on device changes where needed. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
| if (processing_mode == MoEProcessingMode::EXPERT_ITERATIVE && m_resources.device_name != active_device) { | ||
| LOG_INFO("Reallocating MoE iterative output buffer for failover to " << active_device << "..."); | ||
| LOG_BLOCK(); | ||
|
|
||
| const size_t active_experts = m_config.num_active_experts; | ||
| const size_t embed_dim = m_config.expert_hidden_dim; | ||
| ov::Shape buffer_shape = {active_experts, 1, input_token_count, embed_dim}; | ||
| auto any_compiled_model = m_config.compiled_models.begin()->second; | ||
| auto output_element_type = any_compiled_model->outputs()[0].get_element_type(); | ||
| m_resources.expert_output_accumulator = m_allocator(output_element_type, buffer_shape, active_device); | ||
| m_resources.device_name = active_device; | ||
| } |
There was a problem hiding this comment.
lets avoid this knowledge here. It should be transparent, isnt it? Let's assume we'll start optimizing those memory hops later if we really need to.
| // Shape: [num_active_experts, 1, input_token_count, expert_hidden_dim] | ||
| // Accumulates expert outputs before final reduction | ||
| TensorPtr expert_output_accumulator; | ||
| std::string device_name; |
There was a problem hiding this comment.
let's remove it here for now. But lets comment the surrounding code that with a failsafe model in place, the memory management should be done smarter (right now this code is aware of the device where the model is compiled to)
| m_profile["compile/" + device + profile_suffix].record([&]() { | ||
| compiled_model = compile_submodel(model, device); | ||
| }); |
There was a problem hiding this comment.
With parallel compilation.. can there be a race?
| const bool enable_hfa_optimizations = | ||
| std::dynamic_pointer_cast<ov::npuw::failsafe::CompiledModel>(proto_comp_model_desc.compiled_model._ptr) == | ||
| nullptr; | ||
| setup_hfa_infer_requests(real_idx, | ||
| is_piped, | ||
| /* is_recreate */ false, | ||
| /* enable_hfa_optimizations */ true); | ||
| enable_hfa_optimizations); |
There was a problem hiding this comment.
Why can't HFA optimizations work with the failsafe model?
| // with the default (to be accessed next) one. | ||
| std::swap(m_subrequests[real_idx], m_funcall_pipeline[real_idx].subrequest); | ||
| } | ||
| failover = failover || (active_device_before != m_npuw_model->submodel_device(real_idx)); |
There was a problem hiding this comment.
Why should we track the failover here?
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Tickets:
AI Assistance: