[LinalgExt] Implement direct vectorization for im2col op#23855
[LinalgExt] Implement direct vectorization for im2col op#23855
Conversation
155af4b to
35a5c1e
Compare
|
This PR will depend on #23859 landing in order to avoid performance regressions, but the implementation in this PR will stay the same. I will share benchmark results once I have them. My first run failed for some reason. |
compiler/src/iree/compiler/Dialect/LinalgExt/Transforms/test/decompose_im2col.mlir
Outdated
Show resolved
Hide resolved
compiler/src/iree/compiler/Codegen/Interfaces/VectorizableOpInterface.cpp
Show resolved
Hide resolved
compiler/src/iree/compiler/Codegen/Interfaces/VectorizableOpInterface.cpp
Show resolved
Hide resolved
compiler/src/iree/compiler/Dialect/LinalgExt/IR/LinalgExtOps.cpp
Outdated
Show resolved
Hide resolved
35a5c1e to
12e6f6c
Compare
…uctionsOptimization (#23947) The pass was bailing out on vector.transfer_read ops with non-identity permutation maps (e.g., 1D reads from a 4D memref). After #23855, we will frequently see 1D reads, which need to be supported here. Ideally, we will do something like what is done in #23859, but that approach is causing performance regressions that are difficult to deal with. For now, this provides a solution for the new mask types we will be seeing. Signed-off-by: Max Dawkins <max.dawkins@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1ab1043 to
53f0f34
Compare
hanhanW
left a comment
There was a problem hiding this comment.
This is a very large PR. I wonder if we break it into at least two PRs or not:
- padding semangics
- vectorization
There was a problem hiding this comment.
I'm not sure I follow exactly what you mean. Is your point that the op-specific tests (like map_store transfer_gather, etc.) should be moved into masked_configured/masked_inferred/unmasked?
There was a problem hiding this comment.
Each file tests with different configuration. E.g., _unmasked tests are mainly for cases that input_vector_size is empty. masked_configured is mainly for cases that have input_vector_size where it is derived from configurations, etc. This one belongs to _unmasked IMO.
There was a problem hiding this comment.
I see, thanks for the context. I'll move it. I think map_store should also be moved there, which I can do in another PR.
This is already split up into as small of a chunk as seems reasonable to me. This is part of a larger refactor to the IGEMM path in order to enable additional flattening, and each piece needs to land without performance regressions. Going through the process of splitting out parts of the implementation, re-reviewing the split up changes, and testing/debugging on our full suite of convolutions is time consuming, and I've already spent a great deal of time on that to get to the split that I have so far. The issue is that the vectorization requires the padding semantics to be meaningful, and adding padding semantics without the vectorization and making that work with purely the decomposition without causing any performance regressions will take more time to do. One option would be to just add the fields to the im2col op but not use them, although I think that doesn't provide much benefit in terms of review (the padding PR would be very small, and it provides useful context when looking at it together with this PR anyway), and will more likely just slow down the process. I agree that we should try to keep well-scoped PRs to help make review easier, but sometimes splitting up implementation can just slow down the process, which I believe is the case here. I feel that I've already spent too much time spinning on debugging/fixing random performance regressions that happen at intermediate stages of the split but not in the final state, and I don't want to keep wasting time on things like that. If folks really need this PR to be split up further, then I can do it, but IMO it isn't really worth it. |
hanhanW
left a comment
There was a problem hiding this comment.
This is a very large PR. I wonder if we break it into at least two PRs or not:
- padding semangics
- vectorization
This is already split up into as small of a chunk as seems reasonable to me. This is part of a larger refactor to the IGEMM path in order to enable additional flattening, and each piece needs to land without performance regressions. Going through the process of splitting out parts of the implementation, re-reviewing the split up changes, and testing/debugging on our full suite of convolutions is time consuming, and I've already spent a great deal of time on that to get to the split that I have so far. The issue is that the vectorization requires the padding semantics to be meaningful, and adding padding semantics without the vectorization and making that work with purely the decomposition without causing any performance regressions will take more time to do. One option would be to just add the fields to the im2col op but not use them, although I think that doesn't provide much benefit in terms of review (the padding PR would be very small, and it provides useful context when looking at it together with this PR anyway), and will more likely just slow down the process.
I agree that we should try to keep well-scoped PRs to help make review easier, but sometimes splitting up implementation can just slow down the process, which I believe is the case here. I feel that I've already spent too much time spinning on debugging/fixing random performance regressions that happen at intermediate stages of the split but not in the final state, and I don't want to keep wasting time on things like that.
If folks really need this PR to be split up further, then I can do it, but IMO it isn't really worth it.
My "skimming through read" is that there are changes in both tiling and vectorization. The vectorization is new and the tiling implementation already exists today. So my intuition is that you can add padding semantics with tiling changes. Then we can add vectorization support later on. I thought that you can always isolate the vectorization changes from this PR, and the rest can go with a single PR. Do I miss something?
The issue is that the vectorization requires the padding semantics to be meaningful, and adding padding semantics without the vectorization and making that work with purely the decomposition without causing any performance regressions will take more time to do.
I don't follow the performance regression issue. You add padding semantics to the op, but it does not change the input and the folding does not exist? In this case, adding the semantics (with existing interface implementation) does not hurt performance, right?
compiler/src/iree/compiler/Dialect/LinalgExt/IR/LinalgExtOps.td
Outdated
Show resolved
Hide resolved
Let me try to split it up. I think that should be fine, since it will effectively be a non-functional change. My concern is about adding the im2col(pad) canonicalizers, which would require me to actually support decomposition with padding in the intermediate state, which would probably just cause performance regressions. I can leave decomposition with padding unimplemented in the intermediate state, though. On a more meta-side-note: I think my view on PR reviews has changed somewhat with AI assisted coding and review, so that's probably contributing to my feeling here. I think having the changes together makes it easier to navigate the full picture, and AI helps us filter through large changes more easily. It makes it easier to review large PRs, and we typically do end up implementing large, unsplit prototypes before splitting them up into reviewable pieces. I think my opinion is becoming more and more that the splitting up part of the process is less important because having all the context in one place/PR can actually make it easier to use agents to understand and review large changes. But I feel that we are in a transition phase now, and it takes time to adapt to new workflows, so I still understand the need to split things up when the reviewers request it. I will try splitting this one up further. |
I agree that it is good to have a single PR for full picture as long as they are splited into several commit. If we are going to do that, we may want to flip the "merge guidance" from "squash and merge" into "merge a chain of commit`. You still want "smaller piecies" (e.g., several commits) when you want people to review the code. The other benefits of having several PRs is that you can easily revert a problematic one if it breaks something. You can argue that they'd be caught in a single big PR, but that's not 100% because some failures only happenned on post-submit or downstream projects. It is much easier to manage the states if they are several commits. AI is a good tool, and it may change our mental model. I'm actually happy that we are growing with these tools. I was going to offer that I can review a single PR or part of it, but I was in a meeting. I won't insist here, as it does not bother me much. But I'd like to share the pros of breaking PRs/commits above. |
53f0f34 to
b57354f
Compare
|
Padding + tiling implementation PR is here: #23950 |
Add optional padding fields to the im2col op: - input_pad_low/high: per-input-dimension padding amounts - output_pad_low/high: per-output-dimension padding amounts - pad_value: the value for out-of-bounds positions This includes: - Op definition with custom printer/parser for optional padding - Padding accessors, setters, and verifier checks - Output padding propagation in the tiling interface - Decomposition bail-out for padded im2col ops (full padding support via direct vectorization comes in a follow-up) - Roundtrip, verifier, and tiling tests Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Max Dawkins <max.dawkins@gmail.com>
Add direct vectorization for im2col ops via VectorizableOpInterface, and complete the padding support introduced in the previous commit. This includes: - Im2colUtils (computeIm2colSourceIndices, computeIm2colValidSize, chooseDimToVectorize) shared between vectorization and decomposition - Im2colOpVectorizationModel implementing VectorizableOpInterface - Full decomposition refactor: 1D slices, tensor.pad for padded path, read offset clamping to [0, dimSize-1] - Canonicalization patterns: FoldInputPadIntoIm2col, FoldOutputPadIntoIm2col - Pass reordering: DecomposeIm2col after GenericVectorization (vectorize first, decompose as fallback) - subOfrs utility - Tests for vectorization, decomposition, and canonicalization Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Max Dawkins <max.dawkins@gmail.com>
b57354f to
2ec6ebc
Compare

Implements direct vectorization for im2col, and completes the support for padding on the im2col op. The padding attributes are used to compute the read mask when vectorizing. In the old path, we would have separate padding on the input and the result of the im2col, and we try to compose those pads into a single masked read. This is fragile and difficult for cases where the im2col result dims don't map well to the input dims. With this direct vectorization approach, we can compute the mask based on the input and result padding simultaneously. This will make flattening of the spatial dimensions of convolutions possible.
Performance results:
These runs were taken on different commits, but they are functionally the same (just some cleanup differences). I am only able to reproduce 3 of the regressions locally, and most of the improvers (~35 of them with 10-40% speedup) are real. There seems to have been some noise in the runs, but overall there is a good perf improvement.