-
Notifications
You must be signed in to change notification settings - Fork 871
Open
Description
When XOR swizzle is enabled alongside DMA, ROCDLPrefetchSharedMemoryPass currently fails to apply double-buffering.
With swizzle enabled, the alloc becomes flat 1D and swizzle_hint + expand_shape sit between the alloc and the K-loop:
%alloc = memref.alloc() : memref<4096xbf16, #gpu.address_space<workgroup>>
// swizzle_hint + expand_shape sit OUTSIDE the K-loop
%hint = iree_codegen.swizzle_hint %alloc[#iree_codegen.xor_shuffle<128, 8>]
: memref<4096xbf16, #gpu.address_space<workgroup>>
%expand = memref.expand_shape %hint [[0, 1]] output_shape [128, 32]
: memref<4096xbf16, ...> into memref<128x32xbf16, ...>
scf.for %iv = %c0 to %c16 step %c4 { // K-loop
%sub = memref.subview %expand[%off, 0] ...
amdgpu.gather_to_lds ..., %sub // DMA write
vector.transfer_read %expand ... // MMA read
}
memref::multiBuffer produces memref<2x4096xbf16> with per-iteration subviews (memref<4096xbf16, strided<[1], offset: ?>>), but swizzle_hint requires a 1D non-strided memref. Double-buffering silently fails.
Should we:
(a) Clone SwizzleHintOp into the K-loop so it operates on per-iteration subviews, or
(b) Change multi-buffering to produce a flat 1D alloc (memref<8192xbf16> instead of memref<2x4096xbf16>) so swizzle can stay outside the loop?
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels