[Codegen] Enable XOR swizzle for non-scaled BF16 matmul with DMA by Yu-Zhewen · Pull Request #23932 · iree-org/iree

Yu-Zhewen · 2026-03-25T23:00:40Z

Results: `transposed_rhs` (A[M,K] × B[N,K]) — LHS + RHS swizzled

Shape	Intrinsic	tile_k	No Swizzle Bank Conflicts	`xor_shuffle<128,8>` Bank Conflicts	`xor_shuffle<64,8>` Bank Conflicts	No Swizzle Time	`xor_shuffle<128,8>` Time	`xor_shuffle<64,8>` Time
512	16x16x32	4	7.00	0.00	1.00	0.063ms	0.058ms	0.055ms
1024	16x16x32	4	7.00	0.00	1.00	0.068ms	0.062ms	0.057ms
2048	16x16x32	1	1.00	1.00	0.00	0.096ms	0.092ms	0.086ms
4096	16x16x32	1	1.00	1.00	0.00	0.222ms	0.214ms	0.210ms
8192	16x16x32	1	1.00	1.00	0.00	1.61ms	1.61ms	1.47ms
16384	32x32x16	2	3.00	0.00	1.00	10.0ms	9.35ms	9.42ms

320 product shapes

Geometric mean speedup vs no-swizzle baseline:

Config	Shapes Compiled	Geomean Speedup
`xor_shuffle<128,8>`	320/320	+6.5%
`xor_shuffle<64,8>`	320/320	+5.5%
Oracle (best per shape)	320/320	+8.4%

The oracle picks the best config per shape, showing +1.9% additional headroom. However, it is actually difficult to summarize a simple compile-time heuristic. We default to <128,8> as it gives the best single-config geomean.

Depends on #23807

Fixes: #23901

Assisted-by: Cursor (Claude)

Signed-off-by: Yu-Zhewen <zhewenyu@amd.com>

krzysz00

It might be a good idea to move the pipeliner changes into their own patch and add a test for them?

krzysz00 · 2026-03-26T18:05:46Z

compiler/src/iree/compiler/Codegen/LLVMGPU/Utils/ROCDLPrefetchSharedMemoryCopy.cpp

+      hint.getResult().setType(hint.getOperand().getType());
+    }
+  });
+  forOp->walk([](memref::ExpandShapeOp expandOp) {


This looks suspicious. Can you limit this to just expand_shape users of the swizzle hint? Or just uses of the swizzle hint in general?

krzysz00 · 2026-03-26T18:06:56Z

compiler/src/iree/compiler/Codegen/LLVMGPU/Utils/ROCDLPrefetchSharedMemoryCopy.cpp

+
+    // Insert collapse_shape -> swizzle_hint -> expand_shape.
+    auto insertSwizzleHint = [&](Value value) {
+      auto collapse = memref::CollapseShapeOp::create(builder, loc, *flatType,


... I hope this folds away

yes, the chain effectively folds away, after the swizzle_hint gets resolved.

krzysz00 · 2026-03-26T18:57:11Z

Also, re multi-buffer, I wonder if we could just do a createFlattenSwizzleHintAllocsPass after pipelining and keep the swizzle hint outside of the loop

Yu-Zhewen · 2026-03-27T11:32:59Z

Also, re multi-buffer, I wonder if we could just do a createFlattenSwizzleHintAllocsPass after pipelining and keep the swizzle hint outside of the loop

I think both FlattenSwizzleHintAllocsPass and ResolveSwizzleHintsPass currently only operate on direct users of the hint/alloc. Neither propagates through loop bodies of the multi-buffering

Yu-Zhewen · 2026-03-27T12:08:53Z

It might be a good idea to move the pipeliner changes into their own patch and add a test for them?

Split the pipeliner changes to #23945

Transposed operands produce sub-access-width vector loads (e.g. vector<1xbf16>) that are incompatible with the XOR swizzle resolution pass. Only apply swizzle when the reduction dimension is innermost (contiguous reads). Made-with: Cursor Signed-off-by: Yu-Zhewen <zhewenyu@amd.com>

Made-with: Cursor Signed-off-by: Yu-Zhewen <zhewenyu@amd.com>

After multi-buffering, allocations get strided layouts. The identity layout check would reject these valid swizzle hints. Made-with: Cursor Signed-off-by: Yu-Zhewen <zhewenyu@amd.com>

regular matmul

aaa2e4b

Signed-off-by: Yu-Zhewen <zhewenyu@amd.com>

Muzammiluddin-Syed-ECE self-requested a review March 26, 2026 17:33

krzysz00 reviewed Mar 26, 2026

View reviewed changes

Yu-Zhewen added 2 commits March 27, 2026 14:12

add lit tests for swizzle config

6d36505

Made-with: Cursor Signed-off-by: Yu-Zhewen <zhewenyu@amd.com>

Yu-Zhewen force-pushed the dma_swizzle branch from 7ece59c to 6d36505 Compare March 27, 2026 14:14

Relax identity layout check in ResolveSwizzleHints verification

79f6059

After multi-buffering, allocations get strided layouts. The identity layout check would reject these valid swizzle hints. Made-with: Cursor Signed-off-by: Yu-Zhewen <zhewenyu@amd.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Codegen] Enable XOR swizzle for non-scaled BF16 matmul with DMA#23932

[Codegen] Enable XOR swizzle for non-scaled BF16 matmul with DMA#23932
Yu-Zhewen wants to merge 4 commits intoiree-org:users/lialan/swizzle_dmafrom
Yu-Zhewen:dma_swizzle

Yu-Zhewen commented Mar 25, 2026 •

edited

Loading

Uh oh!

krzysz00 left a comment •

edited

Loading

Uh oh!

krzysz00 Mar 26, 2026

Uh oh!

Yu-Zhewen Mar 27, 2026

Uh oh!

krzysz00 Mar 26, 2026

Uh oh!

Yu-Zhewen Mar 27, 2026

Uh oh!

krzysz00 commented Mar 26, 2026

Uh oh!

Yu-Zhewen commented Mar 27, 2026

Uh oh!

Yu-Zhewen commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Yu-Zhewen commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Results: transposed_rhs (A[M,K] × B[N,K]) — LHS + RHS swizzled

320 product shapes

Uh oh!

krzysz00 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

krzysz00 Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Yu-Zhewen Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

krzysz00 Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Yu-Zhewen Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

krzysz00 commented Mar 26, 2026

Uh oh!

Yu-Zhewen commented Mar 27, 2026

Uh oh!

Yu-Zhewen commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Yu-Zhewen commented Mar 25, 2026 •

edited

Loading

Results: `transposed_rhs` (A[M,K] × B[N,K]) — LHS + RHS swizzled

krzysz00 left a comment •

edited

Loading