[GPU] Remove Unnecessary Convert Before Permute#33088
Closed
kwieloch-intel wants to merge 38 commits intoopenvinotoolkit:masterfrom
Closed
[GPU] Remove Unnecessary Convert Before Permute#33088kwieloch-intel wants to merge 38 commits intoopenvinotoolkit:masterfrom
kwieloch-intel wants to merge 38 commits intoopenvinotoolkit:masterfrom
Conversation
Description: If the Permute (tile) node supports FP16 input, there is no need to convert data to FP32 before the permute operation. This change removes redundant type conversions, improving graph execution efficiency. Keywords: Performance, Graph Optimization Testing: Manually verified Ticket: CVS-175920
Lyamin-Roman
reviewed
Dec 3, 2025
src/plugins/intel_gpu/src/graph/graph_optimizer/reorder_transfer.cpp
Outdated
Show resolved
Hide resolved
…ansfer Description: Added usage of `data_type_traits::size_of`. Keywords: Graph Optimization Testing: Manually verified Ticket: CVS-175920
Description: Introduced a condition that preserves the existing behavior for constant reorders, while enforcing the data size check for dynamic reorders only. Keywords: Graph Optimization Testing: Manually verified Ticket: CVS-175920
Description: Removed an unintended blank line that was mistakenly added below the recently modified condition in the previous commit. Keywords: Graph Optimization Testing: Manually verified Ticket: CVS-175920
Description: Added test verifying that constant reorder nodes ignore input/output size check and test verifying that dynamic reorder nodes apply input/output size check. These tests cover cases for input_size <, >, and == output_size, ensuring robustness of reorder_transfer optimization logic. Keywords: Graph Optimization Testing: Manually verified Ticket: CVS-175920
Contributor
|
no perf issue from dgpu daily test for static shape |
Description: Added missing test logic comparing cases with and without the graph optimizer, as well as with increasing and decreasing data types. Keywords: Graph Optimization Testing: Manually verified Ticket: CVS-175920
Lyamin-Roman
approved these changes
Dec 5, 2025
Contributor
|
no issue from dgpu static-shape daily test |
p-durandin
approved these changes
Dec 8, 2025
Contributor
|
build_jenkins |
Contributor
Author
|
Please don't merge yet, I'm working on an additional accuracy fix. |
…timizer
Description:
The permute order ({0,3,1,2}) condition restricts the optimization to the exact byxf→bfyx pattern. Without this condition, the output format of the convolution (f16:byxf) is incorrect. Previously, before removing the 'convert' node, the output of the convolution was (f16:b_fs_yx_fsv16), which was correct. The added condition limits the optimiser but ensures the correctness of results.
Keywords:
Graph Optimization
Testing:
Manually verified
Ticket:
CVS-175920
Description: The optimization handles cases where a permute is followed by a reorder that only changes data type, if the permute can support the output data type directly, we can fuse the reorder into the permute. Should be run after remove_redundant_reorders because some reorders might be removed there and permute might become directly connected to another node Keywords: Performance, Graph Optimization Testing: Manually verified Ticket: CVS-175920
Description: Extracted optimization logic into a named lambda function (try_fuse_reorder_to_permute) for consistency with other local optimization functions. Moved the update_implementations condition outside the loop to skip the entire optimization pass when not enabled, improving efficiency. Keywords: Performance, Graph Optimization Testing: Manually verified Ticket: CVS-175920
Description: Added simple test to test introduced optimization of fusing permute and reorder. Keywords: Performance, Graph Optimization Testing: Manually verified Ticket: CVS-175920
Description: Name changed, because the previous one was confusing. Now we clearly distinguish between nodes and primitives. Keywords: Performance, Graph Optimization Testing: Manually verified Ticket: CVS-175920
Description: Improved set_primitive_output_data_type safety by adding copy-on-write for shared primitives, early return for redundant changes, and explicit bounds checking for output data types. Ensured graph consistency by recalculating output layouts after the modification. Keywords: Performance, Graph Optimization Testing: Manually verified Ticket: CVS-175920
…transfer Description: Because is_type_conversion_only() throws an exception for some nodes we use try catch blocks. It should fix python "tensorflow_tests/test_tf_Conv2DBackpropInput.py" tests. Keywords: Performance, Graph Optimization Testing: Manually verified Ticket: CVS-175920
Description: Optimization to the convert node have opened up new execution paths for permute kernels. The CI test of `fusings_gpu/permute_quant_u8` started failing due to incorrect casting of uchar4 to float4 in the permute_f_y_axes.cl kernel. This fix resolves the issue. Keywords: Performance, Graph Optimization Testing: Manually verified Ticket: CVS-175920
Description: Update the test to check if the "permute" node is present, as it may be optimized out. If the node is missing, the test now exits early to avoid exceptions and unnecessary assertions. Keywords: Graph Optimization Testing: Manually verified Ticket: CVS-175920
Description: Added a check in remove_redundant_reorders::run to skip redundant reorder removal when the dependency node is a constant. This ensures correct handling of constant data and avoids potential incorrect optimizations. Keywords: Graph Optimization Testing: Manually verified Ticket: CVS-175920
…tion-based execution [Description] Refactored the logic for identifying simple type conversion reorders by explicitly checking for valid output layouts before calling is_type_conversion_only(). This removes the need for exception handling and simplifies the code by eliminating the try-catch block and additional variable. Keywords: Graph Optimization Testing: Manually verified Ticket: CVS-175920
Do not optimize out reorder ops before permute nodes that have fused primitives, as changing output data type/layout may affect fusion. Updated the related test to original form ensuring they are not optimized out when fusion is present. Keywords: Graph Optimization Testing: Manually verified Ticket: CVS-175920
Description: Function set_primitive_output_data_type throws now exceptions for null descriptor and out-of-range index in program_node::set_primitive_output_data_type, instead of silently returning or resizing. This prevents silent failures. Minor formatting cleanup in reorder_transfer::run. Keywords: Graph Optimization Testing: Manually verified Ticket: CVS-175920
Description: Replaced manual exceptions with OPENVINO_ASSERT to validate null primitive descriptors and output data type index bounds in program_node. This aligns error handling with current OpenVINO standards. Keywords: Graph Optimization Testing: Manually verified Ticket: CVS-175920
Contributor
|
This PR will be closed in a week because of 2 weeks of no activity. |
Contributor
@kwieloch-intel is any continuation expected |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
[GPU] Remove Unnecessary Conversion Before Permute
Description:
When the Permute Tile node supports FP16 input, converting data to FP32 prior to the permute operation is redundant. This update eliminates the unnecessary conversion node, resulting in improved performance.
Comprehensive performance results are documented in CVS-175920.
Implementation Details:
Reproduction Steps and Snapshot:
A detailed description is available at the end of the description section in the JIRA ticket: CVS-175920.
Graph Visualization:
BEFORE
flowchart LR A[Upsample resample_ref] -->|FP16| B[Convert reorder_data_fast_b1] B -->|FP32| C[Resize permute_tile_8x8_4x4] C -->|FP32| D[Output]AFTER
flowchart LR E[Upsample resample_ref] -->|FP16| F[Resize permute_tile_8x8_4x4] F -->|FP32| G[Output]Checklist:
Tickets: