[LinalgExt] Implement direct vectorization for im2col op by Max191 · Pull Request #23855 · iree-org/iree

Max191 · 2026-03-19T16:13:59Z

Implements direct vectorization for im2col, and completes the support for padding on the im2col op. The padding attributes are used to compute the read mask when vectorizing. In the old path, we would have separate padding on the input and the result of the im2col, and we try to compose those pads into a single masked read. This is fragile and difficult for cases where the im2col result dims don't map well to the input dims. With this direct vectorization approach, we can compute the mask based on the input and result padding simultaneously. This will make flattening of the spatial dimensions of convolutions possible.

Performance results:

These runs were taken on different commits, but they are functionally the same (just some cleanup differences). I am only able to reproduce 3 of the regressions locally, and most of the improvers (~35 of them with 10-40% speedup) are real. There seems to have been some noise in the runs, but overall there is a good perf improvement.

Max191 · 2026-03-19T18:13:21Z

This PR will depend on #23859 landing in order to avoid performance regressions, but the implementation in this PR will stay the same.

I will share benchmark results once I have them. My first run failed for some reason.

compiler/src/iree/compiler/Codegen/LLVMGPU/Passes.cpp

compiler/src/iree/compiler/Dialect/LinalgExt/IR/test/canonicalize.mlir

compiler/src/iree/compiler/Dialect/LinalgExt/Transforms/test/decompose_im2col.mlir

compiler/src/iree/compiler/Codegen/Interfaces/VectorizableOpInterface.cpp

compiler/src/iree/compiler/Dialect/LinalgExt/IR/LinalgExtOps.cpp

…uctionsOptimization (#23947) The pass was bailing out on vector.transfer_read ops with non-identity permutation maps (e.g., 1D reads from a 4D memref). After #23855, we will frequently see 1D reads, which need to be supported here. Ideally, we will do something like what is done in #23859, but that approach is causing performance regressions that are difficult to deal with. For now, this provides a solution for the new mask types we will be seeing. Signed-off-by: Max Dawkins <max.dawkins@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

hanhanW

This is a very large PR. I wonder if we break it into at least two PRs or not:

padding semangics
vectorization

hanhanW · 2026-03-27T17:10:51Z

compiler/src/iree/compiler/Codegen/Common/test/generic_vectorization_im2col.mlir

It is not clear how the vector sizes are derived for this naming. I categorize them a few, and I wonder if we move the tests into one of them?

I'm not sure I follow exactly what you mean. Is your point that the op-specific tests (like map_store transfer_gather, etc.) should be moved into masked_configured/masked_inferred/unmasked?

Each file tests with different configuration. E.g., _unmasked tests are mainly for cases that input_vector_size is empty. masked_configured is mainly for cases that have input_vector_size where it is derived from configurations, etc. This one belongs to _unmasked IMO.

I see, thanks for the context. I'll move it. I think map_store should also be moved there, which I can do in another PR.

Max191 · 2026-03-27T17:55:04Z

This is a very large PR. I wonder if we break it into at least two PRs or not:

padding semangics

vectorization

This is already split up into as small of a chunk as seems reasonable to me. This is part of a larger refactor to the IGEMM path in order to enable additional flattening, and each piece needs to land without performance regressions. Going through the process of splitting out parts of the implementation, re-reviewing the split up changes, and testing/debugging on our full suite of convolutions is time consuming, and I've already spent a great deal of time on that to get to the split that I have so far. The issue is that the vectorization requires the padding semantics to be meaningful, and adding padding semantics without the vectorization and making that work with purely the decomposition without causing any performance regressions will take more time to do. One option would be to just add the fields to the im2col op but not use them, although I think that doesn't provide much benefit in terms of review (the padding PR would be very small, and it provides useful context when looking at it together with this PR anyway), and will more likely just slow down the process.

I agree that we should try to keep well-scoped PRs to help make review easier, but sometimes splitting up implementation can just slow down the process, which I believe is the case here. I feel that I've already spent too much time spinning on debugging/fixing random performance regressions that happen at intermediate stages of the split but not in the final state, and I don't want to keep wasting time on things like that.

If folks really need this PR to be split up further, then I can do it, but IMO it isn't really worth it.

hanhanW

This is a very large PR. I wonder if we break it into at least two PRs or not:

padding semangics

vectorization

This is already split up into as small of a chunk as seems reasonable to me. This is part of a larger refactor to the IGEMM path in order to enable additional flattening, and each piece needs to land without performance regressions. Going through the process of splitting out parts of the implementation, re-reviewing the split up changes, and testing/debugging on our full suite of convolutions is time consuming, and I've already spent a great deal of time on that to get to the split that I have so far. The issue is that the vectorization requires the padding semantics to be meaningful, and adding padding semantics without the vectorization and making that work with purely the decomposition without causing any performance regressions will take more time to do. One option would be to just add the fields to the im2col op but not use them, although I think that doesn't provide much benefit in terms of review (the padding PR would be very small, and it provides useful context when looking at it together with this PR anyway), and will more likely just slow down the process.

I agree that we should try to keep well-scoped PRs to help make review easier, but sometimes splitting up implementation can just slow down the process, which I believe is the case here. I feel that I've already spent too much time spinning on debugging/fixing random performance regressions that happen at intermediate stages of the split but not in the final state, and I don't want to keep wasting time on things like that.

If folks really need this PR to be split up further, then I can do it, but IMO it isn't really worth it.

My "skimming through read" is that there are changes in both tiling and vectorization. The vectorization is new and the tiling implementation already exists today. So my intuition is that you can add padding semantics with tiling changes. Then we can add vectorization support later on. I thought that you can always isolate the vectorization changes from this PR, and the rest can go with a single PR. Do I miss something?

The issue is that the vectorization requires the padding semantics to be meaningful, and adding padding semantics without the vectorization and making that work with purely the decomposition without causing any performance regressions will take more time to do.

I don't follow the performance regression issue. You add padding semantics to the op, but it does not change the input and the folding does not exist? In this case, adding the semantics (with existing interface implementation) does not hurt performance, right?

compiler/src/iree/compiler/Dialect/LinalgExt/IR/LinalgExtOps.td

Max191 · 2026-03-27T18:44:43Z

My "skimming through read" is that there are changes in both tiling and vectorization. The vectorization is new and the tiling implementation already exists today. So my intuition is that you can add padding semantics with tiling changes. Then we can add vectorization support later on. I thought that you can always isolate the vectorization changes from this PR, and the rest can go with a single PR. Do I miss something?

Let me try to split it up. I think that should be fine, since it will effectively be a non-functional change. My concern is about adding the im2col(pad) canonicalizers, which would require me to actually support decomposition with padding in the intermediate state, which would probably just cause performance regressions. I can leave decomposition with padding unimplemented in the intermediate state, though.

On a more meta-side-note: I think my view on PR reviews has changed somewhat with AI assisted coding and review, so that's probably contributing to my feeling here. I think having the changes together makes it easier to navigate the full picture, and AI helps us filter through large changes more easily. It makes it easier to review large PRs, and we typically do end up implementing large, unsplit prototypes before splitting them up into reviewable pieces. I think my opinion is becoming more and more that the splitting up part of the process is less important because having all the context in one place/PR can actually make it easier to use agents to understand and review large changes. But I feel that we are in a transition phase now, and it takes time to adapt to new workflows, so I still understand the need to split things up when the reviewers request it. I will try splitting this one up further.

hanhanW · 2026-03-27T19:14:27Z

On a more meta-side-note: I think my view on PR reviews has changed somewhat with AI assisted coding and review, so that's probably contributing to my feeling here. I think having the changes together makes it easier to navigate the full picture, and AI helps us filter through large changes more easily. It makes it easier to review large PRs, and we typically do end up implementing large, unsplit prototypes before splitting them up into reviewable pieces. I think my opinion is becoming more and more that the splitting up part of the process is less important because having all the context in one place/PR can actually make it easier to use agents to understand and review large changes. But I feel that we are in a transition phase now, and it takes time to adapt to new workflows, so I still understand the need to split things up when the reviewers request it. I will try splitting this one up further.

I agree that it is good to have a single PR for full picture as long as they are splited into several commit. If we are going to do that, we may want to flip the "merge guidance" from "squash and merge" into "merge a chain of commit`. You still want "smaller piecies" (e.g., several commits) when you want people to review the code.

The other benefits of having several PRs is that you can easily revert a problematic one if it breaks something. You can argue that they'd be caught in a single big PR, but that's not 100% because some failures only happenned on post-submit or downstream projects. It is much easier to manage the states if they are several commits.

AI is a good tool, and it may change our mental model. I'm actually happy that we are growing with these tools.

I was going to offer that I can review a single PR or part of it, but I was in a meeting. I won't insist here, as it does not bother me much. But I'd like to share the pros of breaking PRs/commits above.

Max191 · 2026-03-27T20:14:08Z

Padding + tiling implementation PR is here: #23950

Add optional padding fields to the im2col op: - input_pad_low/high: per-input-dimension padding amounts - output_pad_low/high: per-output-dimension padding amounts - pad_value: the value for out-of-bounds positions This includes: - Op definition with custom printer/parser for optional padding - Padding accessors, setters, and verifier checks - Output padding propagation in the tiling interface - Decomposition bail-out for padded im2col ops (full padding support via direct vectorization comes in a follow-up) - Roundtrip, verifier, and tiling tests Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Max Dawkins <max.dawkins@gmail.com>

Add direct vectorization for im2col ops via VectorizableOpInterface, and complete the padding support introduced in the previous commit. This includes: - Im2colUtils (computeIm2colSourceIndices, computeIm2colValidSize, chooseDimToVectorize) shared between vectorization and decomposition - Im2colOpVectorizationModel implementing VectorizableOpInterface - Full decomposition refactor: 1D slices, tensor.pad for padded path, read offset clamping to [0, dimSize-1] - Canonicalization patterns: FoldInputPadIntoIm2col, FoldOutputPadIntoIm2col - Pass reordering: DecomposeIm2col after GenericVectorization (vectorize first, decompose as fallback) - subOfrs utility - Tests for vectorization, decomposition, and canonicalization Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Max Dawkins <max.dawkins@gmail.com>

Max191 requested a review from yzhang93 March 19, 2026 16:14

Max191 force-pushed the users/Max191/im2col-direct-vectorization branch from 155af4b to 35a5c1e Compare March 19, 2026 17:58

yzhang93 reviewed Mar 19, 2026

View reviewed changes

Max191 mentioned this pull request Mar 27, 2026

[LLVMGPU] Support minor identity permutation maps in ROCDLBufferInstructionsOptimization #23947

Merged

Max191 force-pushed the users/Max191/im2col-direct-vectorization branch from 35a5c1e to 12e6f6c Compare March 27, 2026 16:15

Max191 marked this pull request as ready for review March 27, 2026 16:17

Max191 requested review from Groverkss, IanWood1, MaheshRavishankar, hanhanW, krzysz00, kuhar, nirvedhmeshram and qedawkins as code owners March 27, 2026 16:17

Max191 requested a review from yzhang93 March 27, 2026 16:17

Max191 force-pushed the users/Max191/im2col-direct-vectorization branch from 1ab1043 to 53f0f34 Compare March 27, 2026 16:18

hanhanW requested changes Mar 27, 2026

View reviewed changes

Max191 requested a review from hanhanW March 27, 2026 17:57

hanhanW reviewed Mar 27, 2026

View reviewed changes

compiler/src/iree/compiler/Dialect/LinalgExt/IR/LinalgExtOps.td Outdated Show resolved Hide resolved

Max191 force-pushed the users/Max191/im2col-direct-vectorization branch from 53f0f34 to b57354f Compare March 27, 2026 19:33

Max191 mentioned this pull request Mar 27, 2026

[LinalgExt] Add padding attributes to im2col op #23950

Open

Max191 requested a review from hanhanW March 27, 2026 20:18

Max191 and others added 2 commits March 27, 2026 20:21

Max191 force-pushed the users/Max191/im2col-direct-vectorization branch from b57354f to 2ec6ebc Compare March 27, 2026 20:25

Max191 changed the title ~~[LinalgExt] Add padding semantics and direct vectorization for im2col op~~ [LinalgExt] Implement direct vectorization for im2col op Mar 27, 2026

Conversation

Max191 commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Performance results:

Uh oh!

Max191 commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hanhanW left a comment

Choose a reason for hiding this comment

Uh oh!

hanhanW Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

Max191 Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

hanhanW Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

Max191 Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

Max191 commented Mar 27, 2026

Uh oh!

hanhanW left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Max191 commented Mar 27, 2026

Uh oh!

hanhanW commented Mar 27, 2026

Uh oh!

Max191 commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Max191 commented Mar 19, 2026 •

edited

Loading

Max191 commented Mar 19, 2026 •

edited

Loading