[GPU] model loading latency optimization by riverlijunjie · Pull Request #34057 · openvinotoolkit/openvino

riverlijunjie · 2026-02-11T03:14:03Z

Details:

Goal: Maximize IO throughput when loading large OpenVINO GPU model caches (Blobs) on NVMe SSDs.
Bottleneck: The original implementation used single-threaded std::istream (read/sgetn). Due to standard library double-buffering and CPU memory copy overhead, the throughput was capped at 1GB/s, failing to saturate modern NVMe hardware (3.5GB/s+).

Solutions:
- Linux Optimization (Zero-Copy): O_DIRECT (Direct IO), which extracted the underlying file descriptor (FD) from std::filebuf, bypassed the partial stream, and used pread to read directly from disk into the user-space buffer.(Discarded due to performance is not good as Parallel IO)
- Linux/Windows Optimization (Parallel IO): Implemented a custom parallel file loader. It splits the load task into 4KB-aligned chunks processed by concurrent threads.
- Resolving Data Corruption ("Garbled Data"): Parallel loading resulted in incorrect/corrupted weight values, implemented an Automatic Header Detection mechanism.
Results:
- Correctness: Data verification passed; physical and logical offsets are now perfectly synchronized.
- Performance: up to 2x increase, effectively utilizing the parallel capabilities of modern NVMe drives.
Todo list:
- Cache model support
- To verify on Windows
- To verify on Linux
- To verify on dGPU
- Weightless support (Will do in another PR)
- Normal loading support (Will do in another PR)
Test result:

Tickets:

CVS-179677

src/plugins/intel_gpu/include/intel_gpu/primitives/data.hpp

praasz · 2026-02-11T06:56:14Z

src/plugins/intel_gpu/include/intel_gpu/primitives/data.hpp

+#endif
+
+bool load_direct(std::istream& stream, void* buffer, size_t size) {
+#ifdef __linux__


Instead ifdef use separate files for windows/linux etc
this utils could be common utils to improved read

src/plugins/intel_gpu/include/intel_gpu/primitives/data.hpp

src/common/util/src/file_util.cpp

1. Use Unified Shared Memory (usm_host) to eliminates hidden implicit memory copies done by the GPU driver and allows the GPU to DMA directly from the system host memory. 2. L3 Cache-Friendly Block Size: Use small, finely-tuned chunk sizes (e.g., 4MB) instead of massive blocks. 3. Low-Level System I/O: Bypass the overhead and locking of C++ standard streams (std::istream) in favor of direct binary file reading interfaces to reduce user-space buffer copies and kernel-space copying overhead.

praasz

The improvement in read speed is good direction, but integration with core must corrected as it introduce some kind of bypass to main logic

praasz · 2026-03-03T12:57:48Z

src/inference/src/dev/core_impl.cpp


+                // Pass the cached blob file path to plugins that support it (e.g. GPU plugin)
+                // so they can use optimized parallel I/O to read weights directly from the blob file
+                if (!cacheContent.m_blob_id.empty() && util::contains(plugin.get_property(ov::supported_properties),


The core logic should be avoided, especially for device specific properties.
The cache entry is manage by cache manger and here should not be any logic to add such property or bypass what is opened cache manger. Also use any hardcoded path is not correct.

The proper solution is open the stream (fast version) or (mmap) version which allow to read the parallel better read.

At early time, I also hope to do so, but ifstream cannot meet requirement of parallel read due to each thread need seek to different offset to read data. Could you give some test sample of such parallel read with stream?

praasz · 2026-03-04T08:33:06Z

src/plugins/intel_gpu/include/intel_gpu/runtime/options.inl

 OV_CONFIG_RELEASE_OPTION(ov::internal, value_cache_quant_mode, ov::internal::CacheQuantMode::BY_TOKEN, "AUTO or BY_CHANNEL or BY_TOKEN")
 OV_CONFIG_RELEASE_OPTION(ov::intel_gpu, mem_pool_util_threshold, 0.5, "Minimum utilization threshold (0.0~1.0) for reusable memory in the pool")
 OV_CONFIG_RELEASE_OPTION(ov, enable_weightless, false, "Enable/Disable weightless blob")
+OV_CONFIG_RELEASE_OPTION(ov::intel_gpu, cached_blob_path, "", "Path to the cached blob file used during cache loading for optimized parallel I/O")


The property should not be introduce
mange cache entries is core responsibility and it should not be bypassed

Could you give better idea to pass cache blob path to gpu plugin for parallel read?

@riverlijunjie
As discussed offline.
The core cache manger opens cache in two ways depends on mmap flag.
With mmap disabled is opened as stream and forward to plugin. In this case it if there would be custom stream which hides parallel read the plugin could use it with beneficial of faster read and then it will work for all plugins.
in case mmap enabled the blob is opened as ov::Tensor view on mmap file. In this case plugin (GPU) should have more native support to use tensor and read data as from buffer (with parallel option) instead wrap it in stream. Then reading should be faster and mmap flag will be not bypassed by custom GPU property.

Please see solution of PR: #34679

praasz · 2026-03-04T08:42:19Z

src/common/util/src/file_util.cpp

 }

+#ifdef _WIN32
+bool ov::util::read_binary_file_parallel(const std::filesystem::path& path, void* buffer, size_t size, size_t offset) {


Move this implementation to dedicated file for windows under os folder

praasz · 2026-03-04T08:42:33Z

src/common/util/src/file_util.cpp

+    const std::wstring& wpath = path.native();
+
+    HANDLE hFile = CreateFileW(wpath.c_str(),


Suggested change

const std::wstring& wpath = path.native();

HANDLE hFile = CreateFileW(wpath.c_str(),

HANDLE hFile = CreateFileW(path.c_str(),

praasz · 2026-03-04T08:43:08Z

src/common/util/src/file_util.cpp

+        return false;
+
+    // Safety check: File size
+    LARGE_INTEGER fileSize;


Suggested change

LARGE_INTEGER fileSize;

LARGE_INTEGER file_size;

Use snake_case for variables

sungeunk

LGTM. This change reduces the model loading time on PTLH.
Run 4 executions after remove cache files.

Master: R1 4.657s -> R4 1.133s
PR: R1 4.681s -> R4 0.441s

maxnick · 2026-03-10T13:07:29Z

src/plugins/intel_gpu/include/intel_gpu/primitives/data.hpp


        allocation_type _allocation_type = allocation_type::unknown;
        ib >> make_data(&_allocation_type, sizeof(_allocation_type));
+        // std::cout << "load weights: allocation_type = " << static_cast<int>(_allocation_type) << ", weights_path = " << weights_path << std::endl;


Commented out code.

riverlijunjie · 2026-03-20T01:20:17Z

Thanks @praasz suggestion of hiding parallel IO optimization into ifstream to make possible to benefit each plugin without extra work, I have create new PR for it: #34057

Copilot

Pull request overview

This PR optimizes Intel GPU cached-model (.blob) loading latency by enabling parallel file I/O for large weight blocks and plumbing a cached-blob-path property from Core → GPU plugin to let the plugin read weight payloads directly from the cache file.

Changes:

Add a new GPU plugin property (GPU_CACHED_BLOB_PATH) and expose it as a supported property.
Pass the cache .blob path from CoreImpl::load_model_from_cache into plugin config when supported.
Introduce ov::util::read_binary_file_parallel(...) and use it in GPU data::load_weights() for large reads (with header-offset compensation logic).

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
src/plugins/intel_gpu/src/plugin/plugin.cpp	Exposes `cached_blob_path` in GPU supported properties.
src/plugins/intel_gpu/src/graph/program.cpp	Passes cached blob path into `data::load_weights()` during program load.
src/plugins/intel_gpu/include/intel_gpu/runtime/options.inl	Declares `ov::intel_gpu::cached_blob_path` option.
src/plugins/intel_gpu/include/intel_gpu/runtime/internal_properties.hpp	Defines `GPU_CACHED_BLOB_PATH` property.
src/plugins/intel_gpu/include/intel_gpu/runtime/file_util.hpp	Fixes typo in comment (“throw”).
src/plugins/intel_gpu/include/intel_gpu/primitives/data.hpp	Implements fast path using parallel file reads for weight loading and adds ITT scopes.
src/plugins/intel_gpu/include/intel_gpu/graph/serialization/binary_buffer.hpp	Adds `get_stream()` accessor and updates read assertion message.
src/inference/src/dev/core_impl.cpp	Injects blob path into plugin config when supported.
src/common/util/src/file_util.cpp	Fixes `load_binary()` and implements `read_binary_file_parallel()` for Windows/Linux.
src/common/util/include/openvino/util/file_util.hpp	Declares `read_binary_file_parallel()` API.
src/common/util/CMakeLists.txt	Links Threads library for the new parallel I/O implementation.

Copilot · 2026-03-20T05:53:38Z

src/plugins/intel_gpu/include/intel_gpu/primitives/data.hpp

+                        auto cur_offset = ib.get_stream().tellg();
+
+                        // Auto-detect header offset compensation for path-based loading
+                        // This applies to both Windows and Linux Parallel loaders which open by path
+                        size_t offset_compensation = 0;
+
+                        // Save current position
+                        auto restore_pos = ib.get_stream().tellg();
+                        ib.get_stream().seekg(0, std::ios::end);
+                        auto stream_end = (size_t)ib.get_stream().tellg();
+                        ib.get_stream().seekg(restore_pos, std::ios::beg);
+
+                        int64_t phys_size = ov::util::file_size(ov::util::make_path(weights_path));
+                        size_t physical_size = (phys_size >= 0) ? static_cast<size_t>(phys_size) : 0;
+
+                        if (physical_size > stream_end) {
+                            offset_compensation = physical_size - stream_end;
+                        }
+
+                        used_fast_io = ov::util::read_binary_file_parallel(ov::util::make_path(weights_path),
+                                                                           mem->buffer_ptr(),
+                                                                           data_size,
+                                                                           (size_t)cur_offset + offset_compensation);
+                        if (used_fast_io) {
+                            ib.get_stream().seekg(data_size, std::ios::cur);


[HIGH] Fast-I/O offset computation casts tellg() results directly to size_t and uses them for file offsets ((size_t)cur_offset + offset_compensation). If tellg() fails it returns -1, which becomes a huge size_t and can drive out-of-bounds reads. Please guard fast-I/O with checks that tellg()/seekg() succeed and positions are non-negative (otherwise disable fast-I/O and use the stream path).

Suggested change

auto cur_offset = ib.get_stream().tellg();

// Auto-detect header offset compensation for path-based loading

// This applies to both Windows and Linux Parallel loaders which open by path

size_t offset_compensation = 0;

// Save current position

auto restore_pos = ib.get_stream().tellg();

ib.get_stream().seekg(0, std::ios::end);

auto stream_end = (size_t)ib.get_stream().tellg();

ib.get_stream().seekg(restore_pos, std::ios::beg);

int64_t phys_size = ov::util::file_size(ov::util::make_path(weights_path));

size_t physical_size = (phys_size >= 0) ? static_cast<size_t>(phys_size) : 0;

if (physical_size > stream_end) {

offset_compensation = physical_size - stream_end;

}

used_fast_io = ov::util::read_binary_file_parallel(ov::util::make_path(weights_path),

mem->buffer_ptr(),

data_size,

(size_t)cur_offset + offset_compensation);

if (used_fast_io) {

ib.get_stream().seekg(data_size, std::ios::cur);

auto& stream = ib.get_stream();

// Get current offset and validate it

std::streampos cur_offset_pos = stream.tellg();

bool can_use_fast_io = stream.good() && (cur_offset_pos != std::streampos(-1));

// Auto-detect header offset compensation for path-based loading

// This applies to both Windows and Linux Parallel loaders which open by path

size_t offset_compensation = 0;

// Save current position

std::streampos restore_pos = cur_offset_pos;

if (can_use_fast_io) {

stream.seekg(0, std::ios::end);

std::streampos stream_end_pos = stream.tellg();

can_use_fast_io = stream.good() && (stream_end_pos != std::streampos(-1));

// Restore original position

stream.seekg(restore_pos, std::ios::beg);

can_use_fast_io = can_use_fast_io && stream.good();

if (can_use_fast_io) {

int64_t phys_size = ov::util::file_size(ov::util::make_path(weights_path));

if (phys_size >= 0) {

size_t physical_size = static_cast<size_t>(phys_size);

size_t stream_end = static_cast<size_t>(static_cast<std::streamoff>(stream_end_pos));

if (physical_size > stream_end) {

offset_compensation = physical_size - stream_end;

}

size_t cur_offset = static_cast<size_t>(static_cast<std::streamoff>(cur_offset_pos));

size_t file_offset = cur_offset + offset_compensation;

used_fast_io = ov::util::read_binary_file_parallel(ov::util::make_path(weights_path),

mem->buffer_ptr(),

data_size,

file_offset);

} else {

can_use_fast_io = false;

}

}

}

if (used_fast_io) {

stream.seekg(static_cast<std::streamoff>(data_size), std::ios::cur);

Copilot · 2026-03-20T05:53:38Z

src/plugins/intel_gpu/include/intel_gpu/graph/serialization/binary_buffer.hpp

        auto const read_size = _stream.rdbuf()->sgetn(reinterpret_cast<char*>(data), size);
        OPENVINO_ASSERT(read_size == size,
-            "[GPU] Failed to read " + std::to_string(size) + " bytes from stream! Read " + std::to_string(read_size));
+            "[GPU] Failed to read " + std::to_string(size) + " bytes to stream! Read " + std::to_string(read_size));


Assertion message says "Failed to read ... bytes to stream" but this function reads from the input stream. Please change wording back to "from stream" to avoid confusing diagnostics when cache import fails.

Suggested change

"[GPU] Failed to read " + std::to_string(size) + " bytes to stream! Read " + std::to_string(read_size));

"[GPU] Failed to read " + std::to_string(size) + " bytes from stream! Read " + std::to_string(read_size));

Copilot · 2026-03-20T05:53:38Z

src/common/util/src/file_util.cpp

+        futures.emplace_back(std::async(std::launch::async, [path, file_offset, ptr, read_size] {
+            std::ifstream t_ifs(path, std::ios::binary);
+            if (t_ifs.is_open()) {
+                t_ifs.seekg(file_offset, std::ios::beg);
+                t_ifs.read(static_cast<char*>(ptr), read_size);
+            }
+        }));
+
+        current_offset += read_size;
+    }
+
+    for (auto& f : futures) {
+        f.get();
+    }
+    return true;  // Simplified error handling for parallel ifstream
+}


[BLOCKER] read_binary_file_parallel on non-Windows always returns true and doesn’t validate that each thread successfully opened/seeked/read the requested bytes. This can silently return partially/uninitialized data and cause model corruption. Please propagate per-thread read/seek failures via an atomic status, validate file size vs offset+size, and return false if any chunk read fails (and consider a single-thread fallback when parallel read fails).

Copilot · 2026-03-20T05:53:39Z

src/common/util/src/file_util.cpp

+            while (remaining_size > 0 && overall_status) {
+                DWORD to_read = static_cast<DWORD>(std::min(remaining_size, static_cast<size_t>(UINT_MAX - 1024)));
+
+                OVERLAPPED ov = {0};
+                ov.Offset = static_cast<DWORD>(current_file_offset & 0xFFFFFFFF);
+                ov.OffsetHigh = static_cast<DWORD>((current_file_offset >> 32) & 0xFFFFFFFF);
+
+                DWORD bytesRead = 0;
+                if (!ReadFile(t_hFile, current_ptr, to_read, &bytesRead, &ov) || bytesRead != to_read) {
+                    if (GetLastError() != ERROR_IO_PENDING) {
+                        overall_status = false;
+                        break;
+                    }


[BLOCKER] Windows implementation: on ReadFile failure you treat ERROR_IO_PENDING as non-fatal, but the handles are opened without FILE_FLAG_OVERLAPPED and there is no GetOverlappedResult/event wait. If ERROR_IO_PENDING ever occurs, bytesRead can remain 0, causing an infinite loop or silent short read. Suggestion: either open with FILE_FLAG_OVERLAPPED and properly wait for completion, or (simpler) treat any ReadFile failure / short read as an immediate error and return false.

Suggested change

while (remaining_size > 0 && overall_status) {

DWORD to_read = static_cast<DWORD>(std::min(remaining_size, static_cast<size_t>(UINT_MAX - 1024)));

OVERLAPPED ov = {0};

ov.Offset = static_cast<DWORD>(current_file_offset & 0xFFFFFFFF);

ov.OffsetHigh = static_cast<DWORD>((current_file_offset >> 32) & 0xFFFFFFFF);

DWORD bytesRead = 0;

if (!ReadFile(t_hFile, current_ptr, to_read, &bytesRead, &ov) || bytesRead != to_read) {

if (GetLastError() != ERROR_IO_PENDING) {

overall_status = false;

break;

}

LARGE_INTEGER li;

li.QuadPart = static_cast<LONGLONG>(file_offset);

if (!SetFilePointerEx(t_hFile, li, nullptr, FILE_BEGIN)) {

overall_status = false;

CloseHandle(t_hFile);

return;

}

while (remaining_size > 0 && overall_status) {

DWORD to_read = static_cast<DWORD>(std::min(remaining_size, static_cast<size_t>(UINT_MAX - 1024)));

DWORD bytesRead = 0;

if (!ReadFile(t_hFile, current_ptr, to_read, &bytesRead, nullptr) || bytesRead != to_read) {

overall_status = false;

break;

Copilot · 2026-03-20T05:53:39Z

src/plugins/intel_gpu/include/intel_gpu/primitives/data.hpp

                        if (buf_flag) {
-                            ib >> make_data(_buf1.data(), copy_size);
+                            if (can_use_fast_io) {
+                                ov::util::read_binary_file_parallel(file_path, _buf1, copy_size, file_base_offset + dst_offset);
+                            } else {
+                                ib >> make_data(_buf1, copy_size);
+                            }
                            if (ev2 != nullptr) {
                                ev2->wait();
                                ev2 = nullptr;
                            }
-                            ev1 = mem->copy_from(strm, _buf1.data(), src_offset, dst_offset, copy_size, is_blocking);
+                            ev1 = mem->copy_from(strm, *buf1_mem, src_offset, dst_offset, copy_size, is_blocking);
                        } else {
-                            ib >> make_data(_buf2.data(), copy_size);
+                            if (can_use_fast_io) {
+                                ov::util::read_binary_file_parallel(file_path, _buf2, copy_size, file_base_offset + dst_offset);
+                            } else {
+                                ib >> make_data(_buf2, copy_size);
+                            }


[HIGH] In the non-host-accessible path, when can_use_fast_io is true the return value from ov::util::read_binary_file_parallel(...) is ignored. If the parallel read fails, _buf1/_buf2 may contain stale/uninitialized bytes and will still be copied into GPU memory. Please check the boolean result per chunk (or once for the whole region), and fall back to stream-based ib >> make_data(...) or throw on failure before issuing mem->copy_from.

[GPU] model cache loading latency optimization

9514661

github-actions bot added the category: GPU OpenVINO GPU plugin label Feb 11, 2026

p-durandin assigned praasz and maxnick Feb 11, 2026

p-durandin added this to the 2026.1 milestone Feb 11, 2026

praasz reviewed Feb 11, 2026

View reviewed changes

Move parallel loading into common utils

6c533e9

github-actions bot added the category: Core OpenVINO Core (aka ngraph) label Feb 16, 2026

praasz reviewed Feb 17, 2026

View reviewed changes

src/common/util/src/file_util.cpp Outdated Show resolved Hide resolved

mlukasze mentioned this pull request Feb 25, 2026

Add auto-review activation, WIP/POC suppression, and CI monitoring rules #34281

Merged

github-actions bot added the category: inference OpenVINO Runtime library - Inference label Feb 28, 2026

Add internal property to pass cache blob file path

dba859e

riverlijunjie marked this pull request as ready for review February 28, 2026 00:38

riverlijunjie requested review from a team as code owners February 28, 2026 00:38

github-actions bot added the category: build OpenVINO cmake script / infra label Feb 28, 2026

riverlijunjie force-pushed the river/cache_loading_opt branch from 7ef90f1 to bc5e0dc Compare February 28, 2026 14:46

riverlijunjie added 2 commits February 28, 2026 22:47

Fix Clang-format and build errors issue

bc5e0dc

Merge branch 'master' into river/cache_loading_opt

f59493a

riverlijunjie requested a review from praasz March 3, 2026 01:08

riverlijunjie force-pushed the river/cache_loading_opt branch from 0931e3c to cae70ee Compare March 3, 2026 06:35

praasz requested changes Mar 4, 2026

View reviewed changes

praasz assigned olpipi Mar 5, 2026

sungeunk approved these changes Mar 10, 2026

View reviewed changes

maxnick reviewed Mar 10, 2026

View reviewed changes

mlukasze requested a review from Copilot March 20, 2026 05:50

Copilot started reviewing on behalf of mlukasze March 20, 2026 05:50 View session

Copilot AI reviewed Mar 20, 2026

View reviewed changes

praasz modified the milestones: 2026.1, 2026.2 Mar 20, 2026

		const std::wstring& wpath = path.native();

		HANDLE hFile = CreateFileW(wpath.c_str(),

	"[GPU] Failed to read " + std::to_string(size) + " bytes to stream! Read " + std::to_string(read_size));
	"[GPU] Failed to read " + std::to_string(size) + " bytes from stream! Read " + std::to_string(read_size));

Conversation

riverlijunjie commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Details:

Tickets:

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

praasz left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sungeunk left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

riverlijunjie commented Mar 20, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

riverlijunjie commented Feb 11, 2026 •

edited

Loading

praasz left a comment •

edited

Loading