[GPU] model loading latency optimization#34057
[GPU] model loading latency optimization#34057riverlijunjie wants to merge 6 commits intoopenvinotoolkit:masterfrom
Conversation
| #endif | ||
|
|
||
| bool load_direct(std::istream& stream, void* buffer, size_t size) { | ||
| #ifdef __linux__ |
There was a problem hiding this comment.
Instead ifdef use separate files for windows/linux etc
this utils could be common utils to improved read
7ef90f1 to
bc5e0dc
Compare
0931e3c to
cae70ee
Compare
1. Use Unified Shared Memory (usm_host) to eliminates hidden implicit memory copies done by the GPU driver and allows the GPU to DMA directly from the system host memory. 2. L3 Cache-Friendly Block Size: Use small, finely-tuned chunk sizes (e.g., 4MB) instead of massive blocks. 3. Low-Level System I/O: Bypass the overhead and locking of C++ standard streams (std::istream) in favor of direct binary file reading interfaces to reduce user-space buffer copies and kernel-space copying overhead.
|
|
||
| // Pass the cached blob file path to plugins that support it (e.g. GPU plugin) | ||
| // so they can use optimized parallel I/O to read weights directly from the blob file | ||
| if (!cacheContent.m_blob_id.empty() && util::contains(plugin.get_property(ov::supported_properties), |
There was a problem hiding this comment.
The core logic should be avoided, especially for device specific properties.
The cache entry is manage by cache manger and here should not be any logic to add such property or bypass what is opened cache manger. Also use any hardcoded path is not correct.
The proper solution is open the stream (fast version) or (mmap) version which allow to read the parallel better read.
There was a problem hiding this comment.
At early time, I also hope to do so, but ifstream cannot meet requirement of parallel read due to each thread need seek to different offset to read data. Could you give some test sample of such parallel read with stream?
| OV_CONFIG_RELEASE_OPTION(ov::internal, value_cache_quant_mode, ov::internal::CacheQuantMode::BY_TOKEN, "AUTO or BY_CHANNEL or BY_TOKEN") | ||
| OV_CONFIG_RELEASE_OPTION(ov::intel_gpu, mem_pool_util_threshold, 0.5, "Minimum utilization threshold (0.0~1.0) for reusable memory in the pool") | ||
| OV_CONFIG_RELEASE_OPTION(ov, enable_weightless, false, "Enable/Disable weightless blob") | ||
| OV_CONFIG_RELEASE_OPTION(ov::intel_gpu, cached_blob_path, "", "Path to the cached blob file used during cache loading for optimized parallel I/O") |
There was a problem hiding this comment.
The property should not be introduce
mange cache entries is core responsibility and it should not be bypassed
There was a problem hiding this comment.
Could you give better idea to pass cache blob path to gpu plugin for parallel read?
There was a problem hiding this comment.
@riverlijunjie
As discussed offline.
The core cache manger opens cache in two ways depends on mmap flag.
With mmap disabled is opened as stream and forward to plugin. In this case it if there would be custom stream which hides parallel read the plugin could use it with beneficial of faster read and then it will work for all plugins.
in case mmap enabled the blob is opened as ov::Tensor view on mmap file. In this case plugin (GPU) should have more native support to use tensor and read data as from buffer (with parallel option) instead wrap it in stream. Then reading should be faster and mmap flag will be not bypassed by custom GPU property.
| } | ||
|
|
||
| #ifdef _WIN32 | ||
| bool ov::util::read_binary_file_parallel(const std::filesystem::path& path, void* buffer, size_t size, size_t offset) { |
There was a problem hiding this comment.
Move this implementation to dedicated file for windows under os folder
| const std::wstring& wpath = path.native(); | ||
|
|
||
| HANDLE hFile = CreateFileW(wpath.c_str(), |
There was a problem hiding this comment.
| const std::wstring& wpath = path.native(); | |
| HANDLE hFile = CreateFileW(wpath.c_str(), | |
| HANDLE hFile = CreateFileW(path.c_str(), |
| return false; | ||
|
|
||
| // Safety check: File size | ||
| LARGE_INTEGER fileSize; |
There was a problem hiding this comment.
| LARGE_INTEGER fileSize; | |
| LARGE_INTEGER file_size; |
Use snake_case for variables
sungeunk
left a comment
There was a problem hiding this comment.
LGTM. This change reduces the model loading time on PTLH.
Run 4 executions after remove cache files.
- Master: R1 4.657s -> R4 1.133s
- PR: R1 4.681s -> R4 0.441s
|
|
||
| allocation_type _allocation_type = allocation_type::unknown; | ||
| ib >> make_data(&_allocation_type, sizeof(_allocation_type)); | ||
| // std::cout << "load weights: allocation_type = " << static_cast<int>(_allocation_type) << ", weights_path = " << weights_path << std::endl; |
There was a problem hiding this comment.
Pull request overview
This PR optimizes Intel GPU cached-model (.blob) loading latency by enabling parallel file I/O for large weight blocks and plumbing a cached-blob-path property from Core → GPU plugin to let the plugin read weight payloads directly from the cache file.
Changes:
- Add a new GPU plugin property (
GPU_CACHED_BLOB_PATH) and expose it as a supported property. - Pass the cache
.blobpath fromCoreImpl::load_model_from_cacheinto plugin config when supported. - Introduce
ov::util::read_binary_file_parallel(...)and use it in GPUdata::load_weights()for large reads (with header-offset compensation logic).
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| src/plugins/intel_gpu/src/plugin/plugin.cpp | Exposes cached_blob_path in GPU supported properties. |
| src/plugins/intel_gpu/src/graph/program.cpp | Passes cached blob path into data::load_weights() during program load. |
| src/plugins/intel_gpu/include/intel_gpu/runtime/options.inl | Declares ov::intel_gpu::cached_blob_path option. |
| src/plugins/intel_gpu/include/intel_gpu/runtime/internal_properties.hpp | Defines GPU_CACHED_BLOB_PATH property. |
| src/plugins/intel_gpu/include/intel_gpu/runtime/file_util.hpp | Fixes typo in comment (“throw”). |
| src/plugins/intel_gpu/include/intel_gpu/primitives/data.hpp | Implements fast path using parallel file reads for weight loading and adds ITT scopes. |
| src/plugins/intel_gpu/include/intel_gpu/graph/serialization/binary_buffer.hpp | Adds get_stream() accessor and updates read assertion message. |
| src/inference/src/dev/core_impl.cpp | Injects blob path into plugin config when supported. |
| src/common/util/src/file_util.cpp | Fixes load_binary() and implements read_binary_file_parallel() for Windows/Linux. |
| src/common/util/include/openvino/util/file_util.hpp | Declares read_binary_file_parallel() API. |
| src/common/util/CMakeLists.txt | Links Threads library for the new parallel I/O implementation. |
| auto cur_offset = ib.get_stream().tellg(); | ||
|
|
||
| // Auto-detect header offset compensation for path-based loading | ||
| // This applies to both Windows and Linux Parallel loaders which open by path | ||
| size_t offset_compensation = 0; | ||
|
|
||
| // Save current position | ||
| auto restore_pos = ib.get_stream().tellg(); | ||
| ib.get_stream().seekg(0, std::ios::end); | ||
| auto stream_end = (size_t)ib.get_stream().tellg(); | ||
| ib.get_stream().seekg(restore_pos, std::ios::beg); | ||
|
|
||
| int64_t phys_size = ov::util::file_size(ov::util::make_path(weights_path)); | ||
| size_t physical_size = (phys_size >= 0) ? static_cast<size_t>(phys_size) : 0; | ||
|
|
||
| if (physical_size > stream_end) { | ||
| offset_compensation = physical_size - stream_end; | ||
| } | ||
|
|
||
| used_fast_io = ov::util::read_binary_file_parallel(ov::util::make_path(weights_path), | ||
| mem->buffer_ptr(), | ||
| data_size, | ||
| (size_t)cur_offset + offset_compensation); | ||
| if (used_fast_io) { | ||
| ib.get_stream().seekg(data_size, std::ios::cur); |
There was a problem hiding this comment.
[HIGH] Fast-I/O offset computation casts tellg() results directly to size_t and uses them for file offsets ((size_t)cur_offset + offset_compensation). If tellg() fails it returns -1, which becomes a huge size_t and can drive out-of-bounds reads. Please guard fast-I/O with checks that tellg()/seekg() succeed and positions are non-negative (otherwise disable fast-I/O and use the stream path).
| auto cur_offset = ib.get_stream().tellg(); | |
| // Auto-detect header offset compensation for path-based loading | |
| // This applies to both Windows and Linux Parallel loaders which open by path | |
| size_t offset_compensation = 0; | |
| // Save current position | |
| auto restore_pos = ib.get_stream().tellg(); | |
| ib.get_stream().seekg(0, std::ios::end); | |
| auto stream_end = (size_t)ib.get_stream().tellg(); | |
| ib.get_stream().seekg(restore_pos, std::ios::beg); | |
| int64_t phys_size = ov::util::file_size(ov::util::make_path(weights_path)); | |
| size_t physical_size = (phys_size >= 0) ? static_cast<size_t>(phys_size) : 0; | |
| if (physical_size > stream_end) { | |
| offset_compensation = physical_size - stream_end; | |
| } | |
| used_fast_io = ov::util::read_binary_file_parallel(ov::util::make_path(weights_path), | |
| mem->buffer_ptr(), | |
| data_size, | |
| (size_t)cur_offset + offset_compensation); | |
| if (used_fast_io) { | |
| ib.get_stream().seekg(data_size, std::ios::cur); | |
| auto& stream = ib.get_stream(); | |
| // Get current offset and validate it | |
| std::streampos cur_offset_pos = stream.tellg(); | |
| bool can_use_fast_io = stream.good() && (cur_offset_pos != std::streampos(-1)); | |
| // Auto-detect header offset compensation for path-based loading | |
| // This applies to both Windows and Linux Parallel loaders which open by path | |
| size_t offset_compensation = 0; | |
| // Save current position | |
| std::streampos restore_pos = cur_offset_pos; | |
| if (can_use_fast_io) { | |
| stream.seekg(0, std::ios::end); | |
| std::streampos stream_end_pos = stream.tellg(); | |
| can_use_fast_io = stream.good() && (stream_end_pos != std::streampos(-1)); | |
| // Restore original position | |
| stream.seekg(restore_pos, std::ios::beg); | |
| can_use_fast_io = can_use_fast_io && stream.good(); | |
| if (can_use_fast_io) { | |
| int64_t phys_size = ov::util::file_size(ov::util::make_path(weights_path)); | |
| if (phys_size >= 0) { | |
| size_t physical_size = static_cast<size_t>(phys_size); | |
| size_t stream_end = static_cast<size_t>(static_cast<std::streamoff>(stream_end_pos)); | |
| if (physical_size > stream_end) { | |
| offset_compensation = physical_size - stream_end; | |
| } | |
| size_t cur_offset = static_cast<size_t>(static_cast<std::streamoff>(cur_offset_pos)); | |
| size_t file_offset = cur_offset + offset_compensation; | |
| used_fast_io = ov::util::read_binary_file_parallel(ov::util::make_path(weights_path), | |
| mem->buffer_ptr(), | |
| data_size, | |
| file_offset); | |
| } else { | |
| can_use_fast_io = false; | |
| } | |
| } | |
| } | |
| if (used_fast_io) { | |
| stream.seekg(static_cast<std::streamoff>(data_size), std::ios::cur); |
| auto const read_size = _stream.rdbuf()->sgetn(reinterpret_cast<char*>(data), size); | ||
| OPENVINO_ASSERT(read_size == size, | ||
| "[GPU] Failed to read " + std::to_string(size) + " bytes from stream! Read " + std::to_string(read_size)); | ||
| "[GPU] Failed to read " + std::to_string(size) + " bytes to stream! Read " + std::to_string(read_size)); |
There was a problem hiding this comment.
Assertion message says "Failed to read ... bytes to stream" but this function reads from the input stream. Please change wording back to "from stream" to avoid confusing diagnostics when cache import fails.
| "[GPU] Failed to read " + std::to_string(size) + " bytes to stream! Read " + std::to_string(read_size)); | |
| "[GPU] Failed to read " + std::to_string(size) + " bytes from stream! Read " + std::to_string(read_size)); |
| futures.emplace_back(std::async(std::launch::async, [path, file_offset, ptr, read_size] { | ||
| std::ifstream t_ifs(path, std::ios::binary); | ||
| if (t_ifs.is_open()) { | ||
| t_ifs.seekg(file_offset, std::ios::beg); | ||
| t_ifs.read(static_cast<char*>(ptr), read_size); | ||
| } | ||
| })); | ||
|
|
||
| current_offset += read_size; | ||
| } | ||
|
|
||
| for (auto& f : futures) { | ||
| f.get(); | ||
| } | ||
| return true; // Simplified error handling for parallel ifstream | ||
| } |
There was a problem hiding this comment.
[BLOCKER] read_binary_file_parallel on non-Windows always returns true and doesn’t validate that each thread successfully opened/seeked/read the requested bytes. This can silently return partially/uninitialized data and cause model corruption. Please propagate per-thread read/seek failures via an atomic status, validate file size vs offset+size, and return false if any chunk read fails (and consider a single-thread fallback when parallel read fails).
| while (remaining_size > 0 && overall_status) { | ||
| DWORD to_read = static_cast<DWORD>(std::min(remaining_size, static_cast<size_t>(UINT_MAX - 1024))); | ||
|
|
||
| OVERLAPPED ov = {0}; | ||
| ov.Offset = static_cast<DWORD>(current_file_offset & 0xFFFFFFFF); | ||
| ov.OffsetHigh = static_cast<DWORD>((current_file_offset >> 32) & 0xFFFFFFFF); | ||
|
|
||
| DWORD bytesRead = 0; | ||
| if (!ReadFile(t_hFile, current_ptr, to_read, &bytesRead, &ov) || bytesRead != to_read) { | ||
| if (GetLastError() != ERROR_IO_PENDING) { | ||
| overall_status = false; | ||
| break; | ||
| } |
There was a problem hiding this comment.
[BLOCKER] Windows implementation: on ReadFile failure you treat ERROR_IO_PENDING as non-fatal, but the handles are opened without FILE_FLAG_OVERLAPPED and there is no GetOverlappedResult/event wait. If ERROR_IO_PENDING ever occurs, bytesRead can remain 0, causing an infinite loop or silent short read. Suggestion: either open with FILE_FLAG_OVERLAPPED and properly wait for completion, or (simpler) treat any ReadFile failure / short read as an immediate error and return false.
| while (remaining_size > 0 && overall_status) { | |
| DWORD to_read = static_cast<DWORD>(std::min(remaining_size, static_cast<size_t>(UINT_MAX - 1024))); | |
| OVERLAPPED ov = {0}; | |
| ov.Offset = static_cast<DWORD>(current_file_offset & 0xFFFFFFFF); | |
| ov.OffsetHigh = static_cast<DWORD>((current_file_offset >> 32) & 0xFFFFFFFF); | |
| DWORD bytesRead = 0; | |
| if (!ReadFile(t_hFile, current_ptr, to_read, &bytesRead, &ov) || bytesRead != to_read) { | |
| if (GetLastError() != ERROR_IO_PENDING) { | |
| overall_status = false; | |
| break; | |
| } | |
| LARGE_INTEGER li; | |
| li.QuadPart = static_cast<LONGLONG>(file_offset); | |
| if (!SetFilePointerEx(t_hFile, li, nullptr, FILE_BEGIN)) { | |
| overall_status = false; | |
| CloseHandle(t_hFile); | |
| return; | |
| } | |
| while (remaining_size > 0 && overall_status) { | |
| DWORD to_read = static_cast<DWORD>(std::min(remaining_size, static_cast<size_t>(UINT_MAX - 1024))); | |
| DWORD bytesRead = 0; | |
| if (!ReadFile(t_hFile, current_ptr, to_read, &bytesRead, nullptr) || bytesRead != to_read) { | |
| overall_status = false; | |
| break; |
| if (buf_flag) { | ||
| ib >> make_data(_buf1.data(), copy_size); | ||
| if (can_use_fast_io) { | ||
| ov::util::read_binary_file_parallel(file_path, _buf1, copy_size, file_base_offset + dst_offset); | ||
| } else { | ||
| ib >> make_data(_buf1, copy_size); | ||
| } | ||
| if (ev2 != nullptr) { | ||
| ev2->wait(); | ||
| ev2 = nullptr; | ||
| } | ||
| ev1 = mem->copy_from(strm, _buf1.data(), src_offset, dst_offset, copy_size, is_blocking); | ||
| ev1 = mem->copy_from(strm, *buf1_mem, src_offset, dst_offset, copy_size, is_blocking); | ||
| } else { | ||
| ib >> make_data(_buf2.data(), copy_size); | ||
| if (can_use_fast_io) { | ||
| ov::util::read_binary_file_parallel(file_path, _buf2, copy_size, file_base_offset + dst_offset); | ||
| } else { | ||
| ib >> make_data(_buf2, copy_size); | ||
| } |
There was a problem hiding this comment.
[HIGH] In the non-host-accessible path, when can_use_fast_io is true the return value from ov::util::read_binary_file_parallel(...) is ignored. If the parallel read fails, _buf1/_buf2 may contain stale/uninitialized bytes and will still be copied into GPU memory. Please check the boolean result per chunk (or once for the whole region), and fall back to stream-based ib >> make_data(...) or throw on failure before issuing mem->copy_from.
Details:
Solutions:
Results:
Todo list:
Test result:
Tickets: