Skip to content

[Codegen] Masked vectorization of inner_tiled#23935

Open
sommerlukas wants to merge 5 commits intoiree-org:mainfrom
sommerlukas:inner-tiled-masked-vectorization
Open

[Codegen] Masked vectorization of inner_tiled#23935
sommerlukas wants to merge 5 commits intoiree-org:mainfrom
sommerlukas:inner-tiled-masked-vectorization

Conversation

@sommerlukas
Copy link
Copy Markdown
Contributor

So far, vectorization of inner_tiled did not support dynamic tensor shapes. This PR enables vectorization of inner_tiled with dynamic tensor shapes in the outer dimensions through masking for the case that vector sizes are provided.

The two main changes in this PR are:

  • Add support for inner_tiled, linalg.pack and linalg.unpack operations to the vector tile size analysis. The vector tile sizes computed by this analysis can serve as vector sizes for the vectorization.
  • Vectorization of inner_tiled with dynamic tensor sizes, assuming the provided vector sizes and using masked reads/writes to mask out OOB values.

This is part of #23415.

Assisted-by: Claude Code

Signed-off-by: Lukas Sommer <lukas.sommer@amd.com>
Signed-off-by: Lukas Sommer <lukas.sommer@amd.com>
Comment on lines +669 to +672
// Pack/unpack: look up tile sizes in the unpacked domain and transform
// to the vectorization domain via mapPackSourceToDest.
// For pack, the source lattice holds unpacked tile sizes.
// For unpack, the result lattice holds unpacked tile sizes.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs a bit more context. What is the iteration space of pack/unpack?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've simplified the code and clarified the comment.

Comment on lines +1123 to +1124
rewriter, loc, operand, readShape, padValue,
/*useInBoundsInsteadOfMasking=*/!needsMasking);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is the padValue coming from btw? Using a padValue of zero is not always correct. If needed, we should plumb padding value through the intrinsic first, otherwise we will special case for matmul (my understanding is inner_tiled is more general).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a long-standing comment about this where padValue is constructed:

// Construct the zero padding value for each operand. Ideally, we'd need the
// InnerTile interface to return the padding value to use. If it is not
// provided, ub::Poison is a better choice. Zero was chosen because the op
// was designed for matmul, and zero padding is the most common case.

I agree with you (and the comment) that this isn't ideal but considered it out-of-scope for this PR to change this. Happy to address this in a follow up, though.

@Groverkss
Copy link
Copy Markdown
Contributor

I think it's a bit unfortunate that we aren't using linalg.pack to seed tile sizes as well. Are we still taking the max of possible tile sizes in the analysis? Ideally, linalg.pack/linalg.unpack also give information that you should use some specific tile sizes and the padding can be done just via selects.

@sommerlukas
Copy link
Copy Markdown
Contributor Author

I think it's a bit unfortunate that we aren't using linalg.pack to seed tile sizes as well. Are we still taking the max of possible tile sizes in the analysis?

No, we aren't taking the maximum of multiple values, we're only tracking a single tile size value per dimension and go to an overdefined/top state if different tile sizes meet/join.

I had considered seeding the vector tile size analysis from linalg.pack, but they don't really provide much additional information. In our masked/dynamic case, they look something like this:

%pack_7 = linalg.pack %35 padding_value(%cst : f16) inner_dims_pos = [2, 1] inner_tiles = [32, 16] into %38 : 
  tensor<1x?x64xf16> -> tensor<1x?x2x32x16xf16>

While the inner_tiles tell us that the tensor is at least 16x32, they don't tell us the overall size. On the other hand, the to_layout operation on the operand tells us the exact size of the overall tensor (1x4x2x16x32 in this case).

In this case, seeding the analysis from linalg.pack would be actively harmful. If we backpropagated the inner_tiles sizes 16x32 from the linalg.pack to its operands, it would meet with the 64x64 information from the to_layout information, leading to the overdefined state and we would lose the tile size altogether.

Ideally, linalg.pack/linalg.unpack also give information that you should use some specific tile sizes and the padding can be done just via selects.

The vectorization treats those operations individually, so we have to insert masked transfer_read/write for each of the operations.

// Vectorization of linalg.pack
%64 = vector.create_mask %c1, %35, %c64 : vector<1x64x64xi1>
%65 = vector.transfer_read %57[%c0, %c0, %c0], %cst_4, %64 ...
%66 = vector.shape_cast %65 : vector<1x64x64xf16> to vector<1x4x16x2x32xf16>
%67 = vector.transpose %66, [0, 1, 3, 4, 2] : vector<1x4x16x2x32xf16> to vector<1x4x2x32x16xf16>
%68 = vector.create_mask %c1, %62, %c2, %c32, %c16 : vector<1x4x2x32x16xi1>
%69 = vector.transfer_write %67, %63[%c0, %c0, %c0, %c0, %c0], %68 ...
// Vectorization of inner_tiled
%78 = vector.create_mask %c1, %62, %c2, %c32, %c16 : vector<1x4x2x32x16xi1>
%79 = vector.transfer_read %69[%c0, %c0, %c0, %c0, %c0], %cst_4, %78 ...
...
%82 = iree_codegen.inner_tiled ins(%25, %79) outs(%81)

The OptimizeTensorInsertExtractSlicesPass however has patterns to eliminate the transfer_read -> transfer_write pairs and to materialize as arith.select. After running that pass, we end up with this IR:

%cst_3 = arith.constant dense<0.000000e+00> : vector<1x4x2x32x16xf16>
%cst_4 = arith.constant dense<0.000000e+00> : vector<1x64x64xf16>
...
// linalg.pack
%37 = arith.select %28, %34, %cst_4 : vector<1x64x64xi1>, vector<1x64x64xf16>
%38 = vector.shape_cast %37 : vector<1x64x64xf16> to vector<1x4x16x2x32xf16>
%39 = vector.transpose %38, [0, 1, 3, 4, 2] : vector<1x4x16x2x32xf16> to vector<1x4x2x32x16xf16>
// inner_tiled
%40 = vector.create_mask %c1, %36, %c2, %c32, %c16 : vector<1x4x2x32x16xi1>
%45 = arith.select %40, %39, %cst_3 : vector<1x4x2x32x16xi1>, vector<1x4x2x32x16xf16>
%47 = iree_codegen.inner_tiled ins(%19, %45) outs(%46) ...

If we wanted to vectorize inner_tiled together with the packing/unpacking, we would need to change the pattern to handle linalg.pack on its operands (and linalg.unpack on the result) or switch GenericVectorization to be a dialect conversion, where we could use the remapped operands provided by the ConversionPattern.

Signed-off-by: Lukas Sommer <lukas.sommer@amd.com>
@sommerlukas sommerlukas requested a review from Groverkss March 27, 2026 08:06
Copy link
Copy Markdown
Contributor

@hanhanW hanhanW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like two PRs? One for inner_tiled vectorization improvement and the other is pack/unpack vector size propagation? (I haven't reviewed the latter part yet)

Comment on lines +1058 to +1062
// If vector sizes are provided (from tile size analysis or config),
// dynamic outer shapes are fine — they'll be masked during vectorization.
if (!vectorSizes.empty()) {
return true;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to the upstream method, you should check the vector sizes are greator than dim sizes if they are static.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm now using the upstream isValidMaskedInputVector to check that.

Comment on lines +1081 to +1086
// Determine whether we need masking: vectorSizes present and any operand
// has dynamic outer dimensions.
bool needsMasking =
!vectorSizes.empty() && llvm::any_of(argTypes, [](ShapedType st) {
return !st.hasStaticShape();
});
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought needMasking should just check whether vectorSizes is empty or not. Static shape should be maskable. Think that you have 3x5 tensor, and you may want to mask it with 4x8. The vectorSizes control the behavior, IMO.

IMO, you should construct the vectorSizes when it is empty. Then the read/write ops get the shape from vectorSizes. It is how it's done in upstream when we worked on vectorization. E.g.,

https://github.com/llvm/llvm-project/blob/00aebbff71ff4e348538708064ba2e033ccd6b2a/mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp#L1883-L1893

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I've made the code similar to upstream, so that vector tile sizes greater than the static shapes also trigger masking.

Signed-off-by: Lukas Sommer <lukas.sommer@amd.com>
@sommerlukas
Copy link
Copy Markdown
Contributor Author

It looks like two PRs? One for inner_tiled vectorization improvement and the other is pack/unpack vector size propagation? (I haven't reviewed the latter part yet)

The two changes belong together. The to_layout that provides the seed for the tile size analysis is applied to the operands/result of pack/unpack, so we need the vector tile size propagation for pack/unpack to determine the vector tile sizes used for vectorization of inner_tiled.

Signed-off-by: Lukas Sommer <lukas.sommer@amd.com>
@sommerlukas sommerlukas requested a review from hanhanW March 30, 2026 16:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants