[Codegen] Masked vectorization of inner_tiled by sommerlukas · Pull Request #23935 · iree-org/iree

sommerlukas · 2026-03-26T15:24:45Z

So far, vectorization of inner_tiled did not support dynamic tensor shapes. This PR enables vectorization of inner_tiled with dynamic tensor shapes in the outer dimensions through masking for the case that vector sizes are provided.

The two main changes in this PR are:

Add support for inner_tiled, linalg.pack and linalg.unpack operations to the vector tile size analysis. The vector tile sizes computed by this analysis can serve as vector sizes for the vectorization.
Vectorization of inner_tiled with dynamic tensor sizes, assuming the provided vector sizes and using masked reads/writes to mask out OOB values.

This is part of #23415.

Assisted-by: Claude Code

Signed-off-by: Lukas Sommer <lukas.sommer@amd.com>

Groverkss · 2026-03-27T05:06:56Z

compiler/src/iree/compiler/Codegen/Common/MaterializeVectorTileSizes.cpp

+      // Pack/unpack: look up tile sizes in the unpacked domain and transform
+      // to the vectorization domain via mapPackSourceToDest.
+      // For pack, the source lattice holds unpacked tile sizes.
+      // For unpack, the result lattice holds unpacked tile sizes.


I think this needs a bit more context. What is the iteration space of pack/unpack?

I've simplified the code and clarified the comment.

Groverkss · 2026-03-27T05:08:59Z

compiler/src/iree/compiler/Codegen/Interfaces/VectorizableOpInterface.cpp

+          rewriter, loc, operand, readShape, padValue,
+          /*useInBoundsInsteadOfMasking=*/!needsMasking);


Where is the padValue coming from btw? Using a padValue of zero is not always correct. If needed, we should plumb padding value through the intrinsic first, otherwise we will special case for matmul (my understanding is inner_tiled is more general).

There is a long-standing comment about this where padValue is constructed:

iree/compiler/src/iree/compiler/Codegen/Interfaces/VectorizableOpInterface.cpp

Lines 1074 to 1077 in ffcbc3b

// Construct the zero padding value for each operand. Ideally, we'd need the

// InnerTile interface to return the padding value to use. If it is not

// provided, ub::Poison is a better choice. Zero was chosen because the op

// was designed for matmul, and zero padding is the most common case.

I agree with you (and the comment) that this isn't ideal but considered it out-of-scope for this PR to change this. Happy to address this in a follow up, though.

Groverkss · 2026-03-27T05:10:18Z

I think it's a bit unfortunate that we aren't using linalg.pack to seed tile sizes as well. Are we still taking the max of possible tile sizes in the analysis? Ideally, linalg.pack/linalg.unpack also give information that you should use some specific tile sizes and the padding can be done just via selects.

sommerlukas · 2026-03-27T06:42:55Z

I think it's a bit unfortunate that we aren't using linalg.pack to seed tile sizes as well. Are we still taking the max of possible tile sizes in the analysis?

No, we aren't taking the maximum of multiple values, we're only tracking a single tile size value per dimension and go to an overdefined/top state if different tile sizes meet/join.

I had considered seeding the vector tile size analysis from linalg.pack, but they don't really provide much additional information. In our masked/dynamic case, they look something like this:

%pack_7 = linalg.pack %35 padding_value(%cst : f16) inner_dims_pos = [2, 1] inner_tiles = [32, 16] into %38 : 
  tensor<1x?x64xf16> -> tensor<1x?x2x32x16xf16>

While the inner_tiles tell us that the tensor is at least 16x32, they don't tell us the overall size. On the other hand, the to_layout operation on the operand tells us the exact size of the overall tensor (1x4x2x16x32 in this case).

In this case, seeding the analysis from linalg.pack would be actively harmful. If we backpropagated the inner_tiles sizes 16x32 from the linalg.pack to its operands, it would meet with the 64x64 information from the to_layout information, leading to the overdefined state and we would lose the tile size altogether.

Ideally, linalg.pack/linalg.unpack also give information that you should use some specific tile sizes and the padding can be done just via selects.

The vectorization treats those operations individually, so we have to insert masked transfer_read/write for each of the operations.

// Vectorization of linalg.pack
%64 = vector.create_mask %c1, %35, %c64 : vector<1x64x64xi1>
%65 = vector.transfer_read %57[%c0, %c0, %c0], %cst_4, %64 ...
%66 = vector.shape_cast %65 : vector<1x64x64xf16> to vector<1x4x16x2x32xf16>
%67 = vector.transpose %66, [0, 1, 3, 4, 2] : vector<1x4x16x2x32xf16> to vector<1x4x2x32x16xf16>
%68 = vector.create_mask %c1, %62, %c2, %c32, %c16 : vector<1x4x2x32x16xi1>
%69 = vector.transfer_write %67, %63[%c0, %c0, %c0, %c0, %c0], %68 ...
// Vectorization of inner_tiled
%78 = vector.create_mask %c1, %62, %c2, %c32, %c16 : vector<1x4x2x32x16xi1>
%79 = vector.transfer_read %69[%c0, %c0, %c0, %c0, %c0], %cst_4, %78 ...
...
%82 = iree_codegen.inner_tiled ins(%25, %79) outs(%81)

The OptimizeTensorInsertExtractSlicesPass however has patterns to eliminate the transfer_read -> transfer_write pairs and to materialize as arith.select. After running that pass, we end up with this IR:

%cst_3 = arith.constant dense<0.000000e+00> : vector<1x4x2x32x16xf16>
%cst_4 = arith.constant dense<0.000000e+00> : vector<1x64x64xf16>
...
// linalg.pack
%37 = arith.select %28, %34, %cst_4 : vector<1x64x64xi1>, vector<1x64x64xf16>
%38 = vector.shape_cast %37 : vector<1x64x64xf16> to vector<1x4x16x2x32xf16>
%39 = vector.transpose %38, [0, 1, 3, 4, 2] : vector<1x4x16x2x32xf16> to vector<1x4x2x32x16xf16>
// inner_tiled
%40 = vector.create_mask %c1, %36, %c2, %c32, %c16 : vector<1x4x2x32x16xi1>
%45 = arith.select %40, %39, %cst_3 : vector<1x4x2x32x16xi1>, vector<1x4x2x32x16xf16>
%47 = iree_codegen.inner_tiled ins(%19, %45) outs(%46) ...

If we wanted to vectorize inner_tiled together with the packing/unpacking, we would need to change the pattern to handle linalg.pack on its operands (and linalg.unpack on the result) or switch GenericVectorization to be a dialect conversion, where we could use the remapped operands provided by the ConversionPattern.

Signed-off-by: Lukas Sommer <lukas.sommer@amd.com>

hanhanW

It looks like two PRs? One for inner_tiled vectorization improvement and the other is pack/unpack vector size propagation? (I haven't reviewed the latter part yet)

hanhanW · 2026-03-27T17:58:45Z

compiler/src/iree/compiler/Codegen/Interfaces/VectorizableOpInterface.cpp

+    // If vector sizes are provided (from tile size analysis or config),
+    // dynamic outer shapes are fine — they'll be masked during vectorization.
+    if (!vectorSizes.empty()) {
+      return true;
+    }


Similar to the upstream method, you should check the vector sizes are greator than dim sizes if they are static.

I'm now using the upstream isValidMaskedInputVector to check that.

hanhanW · 2026-03-27T18:03:53Z

compiler/src/iree/compiler/Codegen/Interfaces/VectorizableOpInterface.cpp

+    // Determine whether we need masking: vectorSizes present and any operand
+    // has dynamic outer dimensions.
+    bool needsMasking =
+        !vectorSizes.empty() && llvm::any_of(argTypes, [](ShapedType st) {
+          return !st.hasStaticShape();
+        });


I thought needMasking should just check whether vectorSizes is empty or not. Static shape should be maskable. Think that you have 3x5 tensor, and you may want to mask it with 4x8. The vectorSizes control the behavior, IMO.

IMO, you should construct the vectorSizes when it is empty. Then the read/write ops get the shape from vectorSizes. It is how it's done in upstream when we worked on vectorization. E.g.,

https://github.com/llvm/llvm-project/blob/00aebbff71ff4e348538708064ba2e033ccd6b2a/mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp#L1883-L1893

Thanks, I've made the code similar to upstream, so that vector tile sizes greater than the static shapes also trigger masking.

Signed-off-by: Lukas Sommer <lukas.sommer@amd.com>

sommerlukas · 2026-03-30T15:58:21Z

It looks like two PRs? One for inner_tiled vectorization improvement and the other is pack/unpack vector size propagation? (I haven't reviewed the latter part yet)

The two changes belong together. The to_layout that provides the seed for the tile size analysis is applied to the operands/result of pack/unpack, so we need the vector tile size propagation for pack/unpack to determine the vector tile sizes used for vectorization of inner_tiled.

Signed-off-by: Lukas Sommer <lukas.sommer@amd.com>

sommerlukas added 2 commits March 26, 2026 15:09

Support pack, unpack and inner_tiled in VectorTileSizeAnalysis

0d15ebc

Signed-off-by: Lukas Sommer <lukas.sommer@amd.com>

Support inner_tiled on dynamic shapes through masking

b75ee66

Signed-off-by: Lukas Sommer <lukas.sommer@amd.com>

sommerlukas requested review from MaheshRavishankar, Max191, hanhanW and qedawkins as code owners March 26, 2026 15:24

sommerlukas requested a review from Groverkss March 26, 2026 15:24

Groverkss reviewed Mar 27, 2026

View reviewed changes

Simplify pack/unpack materialization

22bef9d

Signed-off-by: Lukas Sommer <lukas.sommer@amd.com>

sommerlukas requested a review from Groverkss March 27, 2026 08:06

hanhanW requested changes Mar 27, 2026

View reviewed changes

Adress inner_tiled vectorization feedback

ad23985

Signed-off-by: Lukas Sommer <lukas.sommer@amd.com>

Add missing build dependency

b0c4e00

Signed-off-by: Lukas Sommer <lukas.sommer@amd.com>

sommerlukas requested a review from hanhanW March 30, 2026 16:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Codegen] Masked vectorization of inner_tiled#23935

[Codegen] Masked vectorization of inner_tiled#23935
sommerlukas wants to merge 5 commits intoiree-org:mainfrom
sommerlukas:inner-tiled-masked-vectorization

sommerlukas commented Mar 26, 2026

Uh oh!

Groverkss Mar 27, 2026

Uh oh!

sommerlukas Mar 27, 2026

Uh oh!

Groverkss Mar 27, 2026

Uh oh!

sommerlukas Mar 27, 2026

Uh oh!

Groverkss commented Mar 27, 2026

Uh oh!

sommerlukas commented Mar 27, 2026

Uh oh!

hanhanW left a comment

Uh oh!

hanhanW Mar 27, 2026

Uh oh!

sommerlukas Mar 30, 2026

Uh oh!

hanhanW Mar 27, 2026

Uh oh!

sommerlukas Mar 30, 2026

Uh oh!

sommerlukas commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		rewriter, loc, operand, readShape, padValue,
		/useInBoundsInsteadOfMasking=/!needsMasking);

	// Construct the zero padding value for each operand. Ideally, we'd need the
	// InnerTile interface to return the padding value to use. If it is not
	// provided, ub::Poison is a better choice. Zero was chosen because the op
	// was designed for matmul, and zero padding is the most common case.

Conversation

sommerlukas commented Mar 26, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Groverkss commented Mar 27, 2026

Uh oh!

sommerlukas commented Mar 27, 2026

Uh oh!

hanhanW left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sommerlukas commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants