Skip to content

[Codegen] Enable XOR swizzle for non-scaled BF16 matmul with DMA#23932

Draft
Yu-Zhewen wants to merge 4 commits intoiree-org:users/lialan/swizzle_dmafrom
Yu-Zhewen:dma_swizzle
Draft

[Codegen] Enable XOR swizzle for non-scaled BF16 matmul with DMA#23932
Yu-Zhewen wants to merge 4 commits intoiree-org:users/lialan/swizzle_dmafrom
Yu-Zhewen:dma_swizzle

Conversation

@Yu-Zhewen
Copy link
Copy Markdown
Contributor

@Yu-Zhewen Yu-Zhewen commented Mar 25, 2026

Results: transposed_rhs (A[M,K] × B[N,K]) — LHS + RHS swizzled

Shape Intrinsic tile_k No Swizzle Bank Conflicts xor_shuffle<128,8> Bank Conflicts xor_shuffle<64,8> Bank Conflicts No Swizzle Time xor_shuffle<128,8> Time xor_shuffle<64,8> Time
512 16x16x32 4 7.00 0.00 1.00 0.063ms 0.058ms 0.055ms
1024 16x16x32 4 7.00 0.00 1.00 0.068ms 0.062ms 0.057ms
2048 16x16x32 1 1.00 1.00 0.00 0.096ms 0.092ms 0.086ms
4096 16x16x32 1 1.00 1.00 0.00 0.222ms 0.214ms 0.210ms
8192 16x16x32 1 1.00 1.00 0.00 1.61ms 1.61ms 1.47ms
16384 32x32x16 2 3.00 0.00 1.00 10.0ms 9.35ms 9.42ms

320 product shapes

Geometric mean speedup vs no-swizzle baseline:

Config Shapes Compiled Geomean Speedup
xor_shuffle<128,8> 320/320 +6.5%
xor_shuffle<64,8> 320/320 +5.5%
Oracle (best per shape) 320/320 +8.4%

The oracle picks the best config per shape, showing +1.9% additional headroom. However, it is actually difficult to summarize a simple compile-time heuristic. We default to <128,8> as it gives the best single-config geomean.

Depends on #23807

Fixes: #23901

Assisted-by: Cursor (Claude)

Signed-off-by: Yu-Zhewen <zhewenyu@amd.com>
Copy link
Copy Markdown
Contributor

@krzysz00 krzysz00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be a good idea to move the pipeliner changes into their own patch and add a test for them?

hint.getResult().setType(hint.getOperand().getType());
}
});
forOp->walk([](memref::ExpandShapeOp expandOp) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks suspicious. Can you limit this to just expand_shape users of the swizzle hint? Or just uses of the swizzle hint in general?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


// Insert collapse_shape -> swizzle_hint -> expand_shape.
auto insertSwizzleHint = [&](Value value) {
auto collapse = memref::CollapseShapeOp::create(builder, loc, *flatType,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... I hope this folds away

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, the chain effectively folds away, after the swizzle_hint gets resolved.

@krzysz00
Copy link
Copy Markdown
Contributor

Also, re multi-buffer, I wonder if we could just do a createFlattenSwizzleHintAllocsPass after pipelining and keep the swizzle hint outside of the loop

@Yu-Zhewen
Copy link
Copy Markdown
Contributor Author

Also, re multi-buffer, I wonder if we could just do a createFlattenSwizzleHintAllocsPass after pipelining and keep the swizzle hint outside of the loop

I think both FlattenSwizzleHintAllocsPass and ResolveSwizzleHintsPass currently only operate on direct users of the hint/alloc. Neither propagates through loop bodies of the multi-buffering

@Yu-Zhewen
Copy link
Copy Markdown
Contributor Author

It might be a good idea to move the pipeliner changes into their own patch and add a test for them?

Split the pipeliner changes to #23945

Transposed operands produce sub-access-width vector loads
(e.g. vector<1xbf16>) that are incompatible with the XOR swizzle
resolution pass. Only apply swizzle when the reduction dimension
is innermost (contiguous reads).

Made-with: Cursor
Signed-off-by: Yu-Zhewen <zhewenyu@amd.com>
Made-with: Cursor
Signed-off-by: Yu-Zhewen <zhewenyu@amd.com>
After multi-buffering, allocations get strided layouts. The identity
layout check would reject these valid swizzle hints.

Made-with: Cursor
Signed-off-by: Yu-Zhewen <zhewenyu@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants