Skip to content

[GPU] Remove Unnecessary Convert Before Permute#33088

Closed
kwieloch-intel wants to merge 38 commits intoopenvinotoolkit:masterfrom
kwieloch-intel:permute_tile_convert_fix
Closed

[GPU] Remove Unnecessary Convert Before Permute#33088
kwieloch-intel wants to merge 38 commits intoopenvinotoolkit:masterfrom
kwieloch-intel:permute_tile_convert_fix

Conversation

@kwieloch-intel
Copy link
Copy Markdown
Contributor

@kwieloch-intel kwieloch-intel commented Dec 1, 2025

[GPU] Remove Unnecessary Conversion Before Permute

Description:

When the Permute Tile node supports FP16 input, converting data to FP32 prior to the permute operation is redundant. This update eliminates the unnecessary conversion node, resulting in improved performance.

Comprehensive performance results are documented in CVS-175920.

Implementation Details:

  • The graph optimizer conditions within reorder_transfer have been updated to perform conversion from a lower to a higher data type only, or to skip conversion entirely if the transpose supports the required output data type.
  • The existing behavior for conversions from higher to lower data types is preserved to minimize the amount of data processed in transpose.
  • This change enhances performance, as demonstrated by the results attached to CVS-175920.

Reproduction Steps and Snapshot:

A detailed description is available at the end of the description section in the JIRA ticket: CVS-175920.

Graph Visualization:

BEFORE
flowchart LR
    A[Upsample resample_ref] -->|FP16| B[Convert reorder_data_fast_b1]
    B -->|FP32| C[Resize permute_tile_8x8_4x4]
    C -->|FP32| D[Output]
Loading
AFTER
flowchart LR
    E[Upsample resample_ref] -->|FP16| F[Resize permute_tile_8x8_4x4]
    F -->|FP32| G[Output]
Loading

Checklist:

  • Is this a proper fix?
  • Have you included a test case for this fix, if necessary?
  • Have you reviewed existing tests that could be extended to cover this scenario?

Tickets:

Description:
If the Permute (tile) node supports FP16 input, there is no need to convert data to FP32 before the permute operation. This change removes redundant type conversions, improving graph execution efficiency.

Keywords:
Performance, Graph Optimization

Testing:
Manually verified

Ticket:
CVS-175920
@github-actions github-actions bot added the category: GPU OpenVINO GPU plugin label Dec 1, 2025
@sys-openvino-ci sys-openvino-ci added the ExternalPR External contributor label Dec 1, 2025
…ansfer

Description:
Added usage of `data_type_traits::size_of`.

Keywords:
Graph Optimization

Testing:
Manually verified

Ticket:
CVS-175920
Description:
Introduced a condition that preserves the existing behavior for constant reorders, while enforcing the data size check for dynamic reorders only.

Keywords:
Graph Optimization

Testing:
Manually verified

Ticket:
CVS-175920
Description:
Removed an unintended blank line that was mistakenly added below the recently modified condition in the previous commit.

Keywords:
Graph Optimization

Testing:
Manually verified

Ticket:
CVS-175920
Description:
Added test verifying that constant reorder nodes ignore input/output size check and test verifying that dynamic reorder nodes apply input/output size check.
These tests cover cases for input_size <, >, and == output_size, ensuring robustness of reorder_transfer optimization logic.

Keywords:
Graph Optimization

Testing:
Manually verified

Ticket:
CVS-175920
@isanghao
Copy link
Copy Markdown
Contributor

isanghao commented Dec 5, 2025

no perf issue from dgpu daily test for static shape

Description:
Added missing test logic comparing cases with and without the graph optimizer, as well as with increasing and decreasing data types.

Keywords:
Graph Optimization

Testing:
Manually verified

Ticket:
CVS-175920
@Lyamin-Roman Lyamin-Roman marked this pull request as ready for review December 5, 2025 17:33
@Lyamin-Roman Lyamin-Roman requested review from a team as code owners December 5, 2025 17:33
@isanghao
Copy link
Copy Markdown
Contributor

isanghao commented Dec 8, 2025

no issue from dgpu static-shape daily test

@p-durandin
Copy link
Copy Markdown
Contributor

build_jenkins

@p-durandin p-durandin added this to the 2026.0 milestone Dec 8, 2025
@kwieloch-intel kwieloch-intel marked this pull request as draft December 8, 2025 13:38
@kwieloch-intel
Copy link
Copy Markdown
Contributor Author

Please don't merge yet, I'm working on an additional accuracy fix.

kwieloch-intel and others added 6 commits December 11, 2025 03:43
…timizer

Description:
The permute order ({0,3,1,2}) condition restricts the optimization to the exact byxf→bfyx pattern. Without this condition, the output format of the convolution (f16:byxf) is incorrect. Previously, before removing the 'convert' node, the output of the convolution was (f16:b_fs_yx_fsv16), which was correct. The added condition limits the optimiser but ensures the correctness of results.

Keywords:
Graph Optimization

Testing:
Manually verified

Ticket:
CVS-175920
@e-ddykim e-ddykim marked this pull request as ready for review January 13, 2026 15:37
Description:
The optimization handles cases where a permute is followed by a reorder that only changes data type, if the permute can support the output data type directly, we can fuse the reorder into the permute. Should be run after remove_redundant_reorders because some reorders might be removed there and permute might become directly connected to another node

Keywords:
Performance, Graph Optimization

Testing:
Manually verified

Ticket:
CVS-175920
Description:
Extracted optimization logic into a named lambda function (try_fuse_reorder_to_permute) for consistency with other local optimization functions. Moved the update_implementations condition outside the loop to skip the entire optimization pass when not enabled, improving efficiency.

Keywords:
Performance, Graph Optimization

Testing:
Manually verified

Ticket:
CVS-175920
Description:
Added simple test to test introduced optimization of fusing permute and reorder.

Keywords:
Performance, Graph Optimization

Testing:
Manually verified

Ticket:
CVS-175920
Description:
Name changed, because the previous one was confusing. Now we clearly distinguish between nodes and primitives.

Keywords:
Performance, Graph Optimization

Testing:
Manually verified

Ticket:
CVS-175920
Description:
Improved set_primitive_output_data_type safety by adding copy-on-write for shared primitives, early return for redundant changes, and explicit bounds checking for output data types. Ensured graph consistency by recalculating output layouts after the modification.

Keywords:
Performance, Graph Optimization

Testing:
Manually verified

Ticket:
CVS-175920
…transfer

Description:
Because is_type_conversion_only() throws an exception for some nodes we use try catch blocks. It should fix python "tensorflow_tests/test_tf_Conv2DBackpropInput.py" tests.

Keywords:
Performance, Graph Optimization

Testing:
Manually verified

Ticket:
CVS-175920
Description:
Optimization to the convert node have opened up new execution paths for permute kernels. The CI test of  `fusings_gpu/permute_quant_u8` started failing due to incorrect casting of uchar4 to float4 in the permute_f_y_axes.cl kernel. This fix resolves the issue.

Keywords:
Performance, Graph Optimization

Testing:
Manually verified

Ticket:
CVS-175920
Description:
Update the test to check if the "permute" node is present, as it may be optimized out. If the node is missing, the test now exits early to avoid exceptions and unnecessary assertions.

Keywords:
Graph Optimization

Testing:
Manually verified

Ticket:
CVS-175920
Description:
Added a check in remove_redundant_reorders::run to skip redundant reorder removal when the dependency node is a constant. This ensures correct handling of constant data and avoids potential incorrect optimizations.

Keywords:
Graph Optimization

Testing:
Manually verified

Ticket:
CVS-175920
…tion-based execution

[Description]
Refactored the logic for identifying simple type conversion reorders by explicitly checking for valid output layouts before calling is_type_conversion_only(). This removes the need for exception handling and simplifies the code by eliminating the try-catch block and additional variable.

Keywords:
Graph Optimization

Testing:
Manually verified

Ticket:
CVS-175920
Do not optimize out reorder ops before permute nodes that have fused primitives, as changing output data type/layout may affect fusion. Updated the related test to original form ensuring they are not optimized out when fusion is present.

Keywords:
Graph Optimization

Testing:
Manually verified

Ticket:
CVS-175920
Description:
Function set_primitive_output_data_type throws now exceptions for null descriptor and out-of-range index in program_node::set_primitive_output_data_type, instead of silently returning or resizing. This prevents silent failures. Minor formatting cleanup in reorder_transfer::run.

Keywords:
Graph Optimization

Testing:
Manually verified

Ticket:
CVS-175920
Description:
Replaced manual exceptions with OPENVINO_ASSERT to validate null primitive descriptors and output data type index bounds in program_node. This aligns error handling with current OpenVINO standards.

Keywords:
Graph Optimization

Testing:
Manually verified

Ticket:
CVS-175920
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 1, 2026

This PR will be closed in a week because of 2 weeks of no activity.

@github-actions github-actions bot added the Stale label Mar 1, 2026
@kwieloch-intel kwieloch-intel marked this pull request as draft March 4, 2026 11:05
@p-durandin
Copy link
Copy Markdown
Contributor

Please don't merge yet, I'm working on an additional accuracy fix.

@kwieloch-intel is any continuation expected

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants