[Codegen] Masked vectorization of inner_tiled#23935
[Codegen] Masked vectorization of inner_tiled#23935sommerlukas wants to merge 5 commits intoiree-org:mainfrom
Conversation
Signed-off-by: Lukas Sommer <lukas.sommer@amd.com>
Signed-off-by: Lukas Sommer <lukas.sommer@amd.com>
| // Pack/unpack: look up tile sizes in the unpacked domain and transform | ||
| // to the vectorization domain via mapPackSourceToDest. | ||
| // For pack, the source lattice holds unpacked tile sizes. | ||
| // For unpack, the result lattice holds unpacked tile sizes. |
There was a problem hiding this comment.
I think this needs a bit more context. What is the iteration space of pack/unpack?
There was a problem hiding this comment.
I've simplified the code and clarified the comment.
| rewriter, loc, operand, readShape, padValue, | ||
| /*useInBoundsInsteadOfMasking=*/!needsMasking); |
There was a problem hiding this comment.
Where is the padValue coming from btw? Using a padValue of zero is not always correct. If needed, we should plumb padding value through the intrinsic first, otherwise we will special case for matmul (my understanding is inner_tiled is more general).
There was a problem hiding this comment.
There is a long-standing comment about this where padValue is constructed:
iree/compiler/src/iree/compiler/Codegen/Interfaces/VectorizableOpInterface.cpp
Lines 1074 to 1077 in ffcbc3b
I agree with you (and the comment) that this isn't ideal but considered it out-of-scope for this PR to change this. Happy to address this in a follow up, though.
|
I think it's a bit unfortunate that we aren't using linalg.pack to seed tile sizes as well. Are we still taking the max of possible tile sizes in the analysis? Ideally, linalg.pack/linalg.unpack also give information that you should use some specific tile sizes and the padding can be done just via selects. |
No, we aren't taking the maximum of multiple values, we're only tracking a single tile size value per dimension and go to an overdefined/top state if different tile sizes meet/join. I had considered seeding the vector tile size analysis from %pack_7 = linalg.pack %35 padding_value(%cst : f16) inner_dims_pos = [2, 1] inner_tiles = [32, 16] into %38 :
tensor<1x?x64xf16> -> tensor<1x?x2x32x16xf16>While the In this case, seeding the analysis from
The vectorization treats those operations individually, so we have to insert masked // Vectorization of linalg.pack
%64 = vector.create_mask %c1, %35, %c64 : vector<1x64x64xi1>
%65 = vector.transfer_read %57[%c0, %c0, %c0], %cst_4, %64 ...
%66 = vector.shape_cast %65 : vector<1x64x64xf16> to vector<1x4x16x2x32xf16>
%67 = vector.transpose %66, [0, 1, 3, 4, 2] : vector<1x4x16x2x32xf16> to vector<1x4x2x32x16xf16>
%68 = vector.create_mask %c1, %62, %c2, %c32, %c16 : vector<1x4x2x32x16xi1>
%69 = vector.transfer_write %67, %63[%c0, %c0, %c0, %c0, %c0], %68 ...
// Vectorization of inner_tiled
%78 = vector.create_mask %c1, %62, %c2, %c32, %c16 : vector<1x4x2x32x16xi1>
%79 = vector.transfer_read %69[%c0, %c0, %c0, %c0, %c0], %cst_4, %78 ...
...
%82 = iree_codegen.inner_tiled ins(%25, %79) outs(%81)The %cst_3 = arith.constant dense<0.000000e+00> : vector<1x4x2x32x16xf16>
%cst_4 = arith.constant dense<0.000000e+00> : vector<1x64x64xf16>
...
// linalg.pack
%37 = arith.select %28, %34, %cst_4 : vector<1x64x64xi1>, vector<1x64x64xf16>
%38 = vector.shape_cast %37 : vector<1x64x64xf16> to vector<1x4x16x2x32xf16>
%39 = vector.transpose %38, [0, 1, 3, 4, 2] : vector<1x4x16x2x32xf16> to vector<1x4x2x32x16xf16>
// inner_tiled
%40 = vector.create_mask %c1, %36, %c2, %c32, %c16 : vector<1x4x2x32x16xi1>
%45 = arith.select %40, %39, %cst_3 : vector<1x4x2x32x16xi1>, vector<1x4x2x32x16xf16>
%47 = iree_codegen.inner_tiled ins(%19, %45) outs(%46) ...If we wanted to vectorize |
Signed-off-by: Lukas Sommer <lukas.sommer@amd.com>
hanhanW
left a comment
There was a problem hiding this comment.
It looks like two PRs? One for inner_tiled vectorization improvement and the other is pack/unpack vector size propagation? (I haven't reviewed the latter part yet)
| // If vector sizes are provided (from tile size analysis or config), | ||
| // dynamic outer shapes are fine — they'll be masked during vectorization. | ||
| if (!vectorSizes.empty()) { | ||
| return true; | ||
| } |
There was a problem hiding this comment.
Similar to the upstream method, you should check the vector sizes are greator than dim sizes if they are static.
There was a problem hiding this comment.
I'm now using the upstream isValidMaskedInputVector to check that.
| // Determine whether we need masking: vectorSizes present and any operand | ||
| // has dynamic outer dimensions. | ||
| bool needsMasking = | ||
| !vectorSizes.empty() && llvm::any_of(argTypes, [](ShapedType st) { | ||
| return !st.hasStaticShape(); | ||
| }); |
There was a problem hiding this comment.
I thought needMasking should just check whether vectorSizes is empty or not. Static shape should be maskable. Think that you have 3x5 tensor, and you may want to mask it with 4x8. The vectorSizes control the behavior, IMO.
IMO, you should construct the vectorSizes when it is empty. Then the read/write ops get the shape from vectorSizes. It is how it's done in upstream when we worked on vectorization. E.g.,
There was a problem hiding this comment.
Thanks, I've made the code similar to upstream, so that vector tile sizes greater than the static shapes also trigger masking.
Signed-off-by: Lukas Sommer <lukas.sommer@amd.com>
The two changes belong together. The |
Signed-off-by: Lukas Sommer <lukas.sommer@amd.com>
So far, vectorization of
inner_tileddid not support dynamic tensor shapes. This PR enables vectorization ofinner_tiledwith dynamic tensor shapes in the outer dimensions through masking for the case that vector sizes are provided.The two main changes in this PR are:
inner_tiled,linalg.packandlinalg.unpackoperations to the vector tile size analysis. The vector tile sizes computed by this analysis can serve as vector sizes for the vectorization.inner_tiledwith dynamic tensor sizes, assuming the provided vector sizes and using masked reads/writes to mask out OOB values.This is part of #23415.
Assisted-by: Claude Code