[Codegen] Enable XOR swizzle for non-scaled BF16 matmul with DMA#23932
[Codegen] Enable XOR swizzle for non-scaled BF16 matmul with DMA#23932Yu-Zhewen wants to merge 4 commits intoiree-org:users/lialan/swizzle_dmafrom
Conversation
Signed-off-by: Yu-Zhewen <zhewenyu@amd.com>
| hint.getResult().setType(hint.getOperand().getType()); | ||
| } | ||
| }); | ||
| forOp->walk([](memref::ExpandShapeOp expandOp) { |
There was a problem hiding this comment.
This looks suspicious. Can you limit this to just expand_shape users of the swizzle hint? Or just uses of the swizzle hint in general?
|
|
||
| // Insert collapse_shape -> swizzle_hint -> expand_shape. | ||
| auto insertSwizzleHint = [&](Value value) { | ||
| auto collapse = memref::CollapseShapeOp::create(builder, loc, *flatType, |
There was a problem hiding this comment.
... I hope this folds away
There was a problem hiding this comment.
yes, the chain effectively folds away, after the swizzle_hint gets resolved.
|
Also, re multi-buffer, I wonder if we could just do a createFlattenSwizzleHintAllocsPass after pipelining and keep the swizzle hint outside of the loop |
I think both |
Split the pipeliner changes to #23945 |
Transposed operands produce sub-access-width vector loads (e.g. vector<1xbf16>) that are incompatible with the XOR swizzle resolution pass. Only apply swizzle when the reduction dimension is innermost (contiguous reads). Made-with: Cursor Signed-off-by: Yu-Zhewen <zhewenyu@amd.com>
Made-with: Cursor Signed-off-by: Yu-Zhewen <zhewenyu@amd.com>
After multi-buffering, allocations get strided layouts. The identity layout check would reject these valid swizzle hints. Made-with: Cursor Signed-off-by: Yu-Zhewen <zhewenyu@amd.com>
Results:
transposed_rhs(A[M,K] × B[N,K]) — LHS + RHS swizzledxor_shuffle<128,8>Bank Conflictsxor_shuffle<64,8>Bank Conflictsxor_shuffle<128,8>Timexor_shuffle<64,8>Time320 product shapes
Geometric mean speedup vs no-swizzle baseline:
xor_shuffle<128,8>xor_shuffle<64,8>The oracle picks the best config per shape, showing +1.9% additional headroom. However, it is actually difficult to summarize a simple compile-time heuristic. We default to
<128,8>as it gives the best single-config geomean.Depends on #23807
Fixes: #23901
Assisted-by: Cursor (Claude)