Skip to content

[GPU] model loading latency optimization#34057

Open
riverlijunjie wants to merge 6 commits intoopenvinotoolkit:masterfrom
riverlijunjie:river/cache_loading_opt
Open

[GPU] model loading latency optimization#34057
riverlijunjie wants to merge 6 commits intoopenvinotoolkit:masterfrom
riverlijunjie:river/cache_loading_opt

Conversation

@riverlijunjie
Copy link
Contributor

@riverlijunjie riverlijunjie commented Feb 11, 2026

Details:

  • Goal: Maximize IO throughput when loading large OpenVINO GPU model caches (Blobs) on NVMe SSDs.
  • Bottleneck: The original implementation used single-threaded std::istream (read/sgetn). Due to standard library double-buffering and CPU memory copy overhead, the throughput was capped at 1GB/s, failing to saturate modern NVMe hardware (3.5GB/s+).
image
  • Solutions:

    • Linux Optimization (Zero-Copy): O_DIRECT (Direct IO), which extracted the underlying file descriptor (FD) from std::filebuf, bypassed the partial stream, and used pread to read directly from disk into the user-space buffer.(Discarded due to performance is not good as Parallel IO)
    • Linux/Windows Optimization (Parallel IO): Implemented a custom parallel file loader. It splits the load task into 4KB-aligned chunks processed by concurrent threads.
    • Resolving Data Corruption ("Garbled Data"): Parallel loading resulted in incorrect/corrupted weight values, implemented an Automatic Header Detection mechanism.
  • Results:

    • Correctness: Data verification passed; physical and logical offsets are now perfectly synchronized.
    • Performance: up to 2x increase, effectively utilizing the parallel capabilities of modern NVMe drives.
  • Todo list:

    • Cache model support
    • To verify on Windows
    • To verify on Linux
    • To verify on dGPU
    • Weightless support (Will do in another PR)
    • Normal loading support (Will do in another PR)
  • Test result:

image

Tickets:

@github-actions github-actions bot added the category: GPU OpenVINO GPU plugin label Feb 11, 2026
@p-durandin p-durandin added this to the 2026.1 milestone Feb 11, 2026
#endif

bool load_direct(std::istream& stream, void* buffer, size_t size) {
#ifdef __linux__
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead ifdef use separate files for windows/linux etc
this utils could be common utils to improved read

@github-actions github-actions bot added the category: Core OpenVINO Core (aka ngraph) label Feb 16, 2026
@riverlijunjie riverlijunjie marked this pull request as ready for review February 28, 2026 00:38
@riverlijunjie riverlijunjie requested review from a team as code owners February 28, 2026 00:38
@github-actions github-actions bot added the category: build OpenVINO cmake script / infra label Feb 28, 2026
@riverlijunjie riverlijunjie force-pushed the river/cache_loading_opt branch from 7ef90f1 to bc5e0dc Compare February 28, 2026 14:46
@riverlijunjie riverlijunjie requested a review from praasz March 3, 2026 01:08
@riverlijunjie riverlijunjie force-pushed the river/cache_loading_opt branch from 0931e3c to cae70ee Compare March 3, 2026 06:35
   1. Use Unified Shared Memory (usm_host) to eliminates hidden implicit memory copies done by the GPU driver and allows the GPU to DMA directly from the system host memory.
   2. L3 Cache-Friendly Block Size: Use small, finely-tuned chunk sizes (e.g., 4MB) instead of massive blocks.
   3. Low-Level System I/O: Bypass the overhead and locking of C++ standard streams (std::istream) in favor of direct binary file reading interfaces to reduce user-space buffer copies and kernel-space copying overhead.
Copy link
Contributor

@praasz praasz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The improvement in read speed is good direction, but integration with core must corrected as it introduce some kind of bypass to main logic


// Pass the cached blob file path to plugins that support it (e.g. GPU plugin)
// so they can use optimized parallel I/O to read weights directly from the blob file
if (!cacheContent.m_blob_id.empty() && util::contains(plugin.get_property(ov::supported_properties),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The core logic should be avoided, especially for device specific properties.
The cache entry is manage by cache manger and here should not be any logic to add such property or bypass what is opened cache manger. Also use any hardcoded path is not correct.

The proper solution is open the stream (fast version) or (mmap) version which allow to read the parallel better read.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At early time, I also hope to do so, but ifstream cannot meet requirement of parallel read due to each thread need seek to different offset to read data. Could you give some test sample of such parallel read with stream?

OV_CONFIG_RELEASE_OPTION(ov::internal, value_cache_quant_mode, ov::internal::CacheQuantMode::BY_TOKEN, "AUTO or BY_CHANNEL or BY_TOKEN")
OV_CONFIG_RELEASE_OPTION(ov::intel_gpu, mem_pool_util_threshold, 0.5, "Minimum utilization threshold (0.0~1.0) for reusable memory in the pool")
OV_CONFIG_RELEASE_OPTION(ov, enable_weightless, false, "Enable/Disable weightless blob")
OV_CONFIG_RELEASE_OPTION(ov::intel_gpu, cached_blob_path, "", "Path to the cached blob file used during cache loading for optimized parallel I/O")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The property should not be introduce
mange cache entries is core responsibility and it should not be bypassed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you give better idea to pass cache blob path to gpu plugin for parallel read?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@riverlijunjie
As discussed offline.
The core cache manger opens cache in two ways depends on mmap flag.
With mmap disabled is opened as stream and forward to plugin. In this case it if there would be custom stream which hides parallel read the plugin could use it with beneficial of faster read and then it will work for all plugins.
in case mmap enabled the blob is opened as ov::Tensor view on mmap file. In this case plugin (GPU) should have more native support to use tensor and read data as from buffer (with parallel option) instead wrap it in stream. Then reading should be faster and mmap flag will be not bypassed by custom GPU property.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see solution of PR: #34679

}

#ifdef _WIN32
bool ov::util::read_binary_file_parallel(const std::filesystem::path& path, void* buffer, size_t size, size_t offset) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this implementation to dedicated file for windows under os folder

Comment on lines +276 to +278
const std::wstring& wpath = path.native();

HANDLE hFile = CreateFileW(wpath.c_str(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const std::wstring& wpath = path.native();
HANDLE hFile = CreateFileW(wpath.c_str(),
HANDLE hFile = CreateFileW(path.c_str(),

return false;

// Safety check: File size
LARGE_INTEGER fileSize;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
LARGE_INTEGER fileSize;
LARGE_INTEGER file_size;

Use snake_case for variables

Copy link
Contributor

@sungeunk sungeunk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. This change reduces the model loading time on PTLH.
Run 4 executions after remove cache files.

  • Master: R1 4.657s -> R4 1.133s
  • PR: R1 4.681s -> R4 0.441s


allocation_type _allocation_type = allocation_type::unknown;
ib >> make_data(&_allocation_type, sizeof(_allocation_type));
// std::cout << "load weights: allocation_type = " << static_cast<int>(_allocation_type) << ", weights_path = " << weights_path << std::endl;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commented out code.

@riverlijunjie
Copy link
Contributor Author

Thanks @praasz suggestion of hiding parallel IO optimization into ifstream to make possible to benefit each plugin without extra work, I have create new PR for it: #34057

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes Intel GPU cached-model (.blob) loading latency by enabling parallel file I/O for large weight blocks and plumbing a cached-blob-path property from Core → GPU plugin to let the plugin read weight payloads directly from the cache file.

Changes:

  • Add a new GPU plugin property (GPU_CACHED_BLOB_PATH) and expose it as a supported property.
  • Pass the cache .blob path from CoreImpl::load_model_from_cache into plugin config when supported.
  • Introduce ov::util::read_binary_file_parallel(...) and use it in GPU data::load_weights() for large reads (with header-offset compensation logic).

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
src/plugins/intel_gpu/src/plugin/plugin.cpp Exposes cached_blob_path in GPU supported properties.
src/plugins/intel_gpu/src/graph/program.cpp Passes cached blob path into data::load_weights() during program load.
src/plugins/intel_gpu/include/intel_gpu/runtime/options.inl Declares ov::intel_gpu::cached_blob_path option.
src/plugins/intel_gpu/include/intel_gpu/runtime/internal_properties.hpp Defines GPU_CACHED_BLOB_PATH property.
src/plugins/intel_gpu/include/intel_gpu/runtime/file_util.hpp Fixes typo in comment (“throw”).
src/plugins/intel_gpu/include/intel_gpu/primitives/data.hpp Implements fast path using parallel file reads for weight loading and adds ITT scopes.
src/plugins/intel_gpu/include/intel_gpu/graph/serialization/binary_buffer.hpp Adds get_stream() accessor and updates read assertion message.
src/inference/src/dev/core_impl.cpp Injects blob path into plugin config when supported.
src/common/util/src/file_util.cpp Fixes load_binary() and implements read_binary_file_parallel() for Windows/Linux.
src/common/util/include/openvino/util/file_util.hpp Declares read_binary_file_parallel() API.
src/common/util/CMakeLists.txt Links Threads library for the new parallel I/O implementation.

Comment on lines +429 to +453
auto cur_offset = ib.get_stream().tellg();

// Auto-detect header offset compensation for path-based loading
// This applies to both Windows and Linux Parallel loaders which open by path
size_t offset_compensation = 0;

// Save current position
auto restore_pos = ib.get_stream().tellg();
ib.get_stream().seekg(0, std::ios::end);
auto stream_end = (size_t)ib.get_stream().tellg();
ib.get_stream().seekg(restore_pos, std::ios::beg);

int64_t phys_size = ov::util::file_size(ov::util::make_path(weights_path));
size_t physical_size = (phys_size >= 0) ? static_cast<size_t>(phys_size) : 0;

if (physical_size > stream_end) {
offset_compensation = physical_size - stream_end;
}

used_fast_io = ov::util::read_binary_file_parallel(ov::util::make_path(weights_path),
mem->buffer_ptr(),
data_size,
(size_t)cur_offset + offset_compensation);
if (used_fast_io) {
ib.get_stream().seekg(data_size, std::ios::cur);
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[HIGH] Fast-I/O offset computation casts tellg() results directly to size_t and uses them for file offsets ((size_t)cur_offset + offset_compensation). If tellg() fails it returns -1, which becomes a huge size_t and can drive out-of-bounds reads. Please guard fast-I/O with checks that tellg()/seekg() succeed and positions are non-negative (otherwise disable fast-I/O and use the stream path).

Suggested change
auto cur_offset = ib.get_stream().tellg();
// Auto-detect header offset compensation for path-based loading
// This applies to both Windows and Linux Parallel loaders which open by path
size_t offset_compensation = 0;
// Save current position
auto restore_pos = ib.get_stream().tellg();
ib.get_stream().seekg(0, std::ios::end);
auto stream_end = (size_t)ib.get_stream().tellg();
ib.get_stream().seekg(restore_pos, std::ios::beg);
int64_t phys_size = ov::util::file_size(ov::util::make_path(weights_path));
size_t physical_size = (phys_size >= 0) ? static_cast<size_t>(phys_size) : 0;
if (physical_size > stream_end) {
offset_compensation = physical_size - stream_end;
}
used_fast_io = ov::util::read_binary_file_parallel(ov::util::make_path(weights_path),
mem->buffer_ptr(),
data_size,
(size_t)cur_offset + offset_compensation);
if (used_fast_io) {
ib.get_stream().seekg(data_size, std::ios::cur);
auto& stream = ib.get_stream();
// Get current offset and validate it
std::streampos cur_offset_pos = stream.tellg();
bool can_use_fast_io = stream.good() && (cur_offset_pos != std::streampos(-1));
// Auto-detect header offset compensation for path-based loading
// This applies to both Windows and Linux Parallel loaders which open by path
size_t offset_compensation = 0;
// Save current position
std::streampos restore_pos = cur_offset_pos;
if (can_use_fast_io) {
stream.seekg(0, std::ios::end);
std::streampos stream_end_pos = stream.tellg();
can_use_fast_io = stream.good() && (stream_end_pos != std::streampos(-1));
// Restore original position
stream.seekg(restore_pos, std::ios::beg);
can_use_fast_io = can_use_fast_io && stream.good();
if (can_use_fast_io) {
int64_t phys_size = ov::util::file_size(ov::util::make_path(weights_path));
if (phys_size >= 0) {
size_t physical_size = static_cast<size_t>(phys_size);
size_t stream_end = static_cast<size_t>(static_cast<std::streamoff>(stream_end_pos));
if (physical_size > stream_end) {
offset_compensation = physical_size - stream_end;
}
size_t cur_offset = static_cast<size_t>(static_cast<std::streamoff>(cur_offset_pos));
size_t file_offset = cur_offset + offset_compensation;
used_fast_io = ov::util::read_binary_file_parallel(ov::util::make_path(weights_path),
mem->buffer_ptr(),
data_size,
file_offset);
} else {
can_use_fast_io = false;
}
}
}
if (used_fast_io) {
stream.seekg(static_cast<std::streamoff>(data_size), std::ios::cur);

Copilot uses AI. Check for mistakes.
auto const read_size = _stream.rdbuf()->sgetn(reinterpret_cast<char*>(data), size);
OPENVINO_ASSERT(read_size == size,
"[GPU] Failed to read " + std::to_string(size) + " bytes from stream! Read " + std::to_string(read_size));
"[GPU] Failed to read " + std::to_string(size) + " bytes to stream! Read " + std::to_string(read_size));
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assertion message says "Failed to read ... bytes to stream" but this function reads from the input stream. Please change wording back to "from stream" to avoid confusing diagnostics when cache import fails.

Suggested change
"[GPU] Failed to read " + std::to_string(size) + " bytes to stream! Read " + std::to_string(read_size));
"[GPU] Failed to read " + std::to_string(size) + " bytes from stream! Read " + std::to_string(read_size));

Copilot uses AI. Check for mistakes.
Comment on lines +433 to +448
futures.emplace_back(std::async(std::launch::async, [path, file_offset, ptr, read_size] {
std::ifstream t_ifs(path, std::ios::binary);
if (t_ifs.is_open()) {
t_ifs.seekg(file_offset, std::ios::beg);
t_ifs.read(static_cast<char*>(ptr), read_size);
}
}));

current_offset += read_size;
}

for (auto& f : futures) {
f.get();
}
return true; // Simplified error handling for parallel ifstream
}
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[BLOCKER] read_binary_file_parallel on non-Windows always returns true and doesn’t validate that each thread successfully opened/seeked/read the requested bytes. This can silently return partially/uninitialized data and cause model corruption. Please propagate per-thread read/seek failures via an atomic status, validate file size vs offset+size, and return false if any chunk read fails (and consider a single-thread fallback when parallel read fails).

Copilot uses AI. Check for mistakes.
Comment on lines +375 to +387
while (remaining_size > 0 && overall_status) {
DWORD to_read = static_cast<DWORD>(std::min(remaining_size, static_cast<size_t>(UINT_MAX - 1024)));

OVERLAPPED ov = {0};
ov.Offset = static_cast<DWORD>(current_file_offset & 0xFFFFFFFF);
ov.OffsetHigh = static_cast<DWORD>((current_file_offset >> 32) & 0xFFFFFFFF);

DWORD bytesRead = 0;
if (!ReadFile(t_hFile, current_ptr, to_read, &bytesRead, &ov) || bytesRead != to_read) {
if (GetLastError() != ERROR_IO_PENDING) {
overall_status = false;
break;
}
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[BLOCKER] Windows implementation: on ReadFile failure you treat ERROR_IO_PENDING as non-fatal, but the handles are opened without FILE_FLAG_OVERLAPPED and there is no GetOverlappedResult/event wait. If ERROR_IO_PENDING ever occurs, bytesRead can remain 0, causing an infinite loop or silent short read. Suggestion: either open with FILE_FLAG_OVERLAPPED and properly wait for completion, or (simpler) treat any ReadFile failure / short read as an immediate error and return false.

Suggested change
while (remaining_size > 0 && overall_status) {
DWORD to_read = static_cast<DWORD>(std::min(remaining_size, static_cast<size_t>(UINT_MAX - 1024)));
OVERLAPPED ov = {0};
ov.Offset = static_cast<DWORD>(current_file_offset & 0xFFFFFFFF);
ov.OffsetHigh = static_cast<DWORD>((current_file_offset >> 32) & 0xFFFFFFFF);
DWORD bytesRead = 0;
if (!ReadFile(t_hFile, current_ptr, to_read, &bytesRead, &ov) || bytesRead != to_read) {
if (GetLastError() != ERROR_IO_PENDING) {
overall_status = false;
break;
}
LARGE_INTEGER li;
li.QuadPart = static_cast<LONGLONG>(file_offset);
if (!SetFilePointerEx(t_hFile, li, nullptr, FILE_BEGIN)) {
overall_status = false;
CloseHandle(t_hFile);
return;
}
while (remaining_size > 0 && overall_status) {
DWORD to_read = static_cast<DWORD>(std::min(remaining_size, static_cast<size_t>(UINT_MAX - 1024)));
DWORD bytesRead = 0;
if (!ReadFile(t_hFile, current_ptr, to_read, &bytesRead, nullptr) || bytesRead != to_read) {
overall_status = false;
break;

Copilot uses AI. Check for mistakes.
Comment on lines 507 to +523
if (buf_flag) {
ib >> make_data(_buf1.data(), copy_size);
if (can_use_fast_io) {
ov::util::read_binary_file_parallel(file_path, _buf1, copy_size, file_base_offset + dst_offset);
} else {
ib >> make_data(_buf1, copy_size);
}
if (ev2 != nullptr) {
ev2->wait();
ev2 = nullptr;
}
ev1 = mem->copy_from(strm, _buf1.data(), src_offset, dst_offset, copy_size, is_blocking);
ev1 = mem->copy_from(strm, *buf1_mem, src_offset, dst_offset, copy_size, is_blocking);
} else {
ib >> make_data(_buf2.data(), copy_size);
if (can_use_fast_io) {
ov::util::read_binary_file_parallel(file_path, _buf2, copy_size, file_base_offset + dst_offset);
} else {
ib >> make_data(_buf2, copy_size);
}
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[HIGH] In the non-host-accessible path, when can_use_fast_io is true the return value from ov::util::read_binary_file_parallel(...) is ignored. If the parallel read fails, _buf1/_buf2 may contain stale/uninitialized bytes and will still be copied into GPU memory. Please check the boolean result per chunk (or once for the whole region), and fall back to stream-based ib >> make_data(...) or throw on failure before issuing mem->copy_from.

Copilot uses AI. Check for mistakes.
@praasz praasz modified the milestones: 2026.1, 2026.2 Mar 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

category: build OpenVINO cmake script / infra category: Core OpenVINO Core (aka ngraph) category: GPU OpenVINO GPU plugin category: inference OpenVINO Runtime library - Inference

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants