Required prerequisites
Motivation
Currently, SM100 TCGEN5MMA supports the SS (shared-shared) variant via tcgen5mma_gemm_ss
in gemm_sm100.h, but the TS (TMEM-shared) variant — where operand A is read from Tensor
Memory and operand B from shared memory — is not yet implemented.
There is an existing TODO at src/tl_templates/cuda/gemm_sm100.h:419:
// TODO (lei): Implement gemm_ts
The low-level PTX intrinsics for tcgen05mma_ts are already fully implemented in
src/tl_templates/cuda/instruction/tcgen05mma.h (lines 26-143), covering F16/BF16, TF32,
INT8/UINT8, and FP8 (E4M3/E5M2) data types. The backend lowering in src/op/gemm.cc
(line 462) also already routes shared.tmem A-scope to tl::tcgen5mma_gemm_ts.
What's missing is the upper-level C++ template wrapper (tcgen5mma_gemm_ts) that bridges
the gap between the lowering logic and the PTX intrinsics — analogous to what
tcgen5mma_gemm_ss does for the SS variant.
The TS variant is important because it enables reading the A operand directly from Tensor
Memory, which can reduce shared memory pressure and enable new dataflow patterns (e.g.,
chaining the output of one MMA as the input of the next without going through shared memory).
Solution
Implement tcgen5mma_gemm_ts in src/tl_templates/cuda/gemm_sm100.h, following the same
pattern as the existing tcgen5mma_gemm_ss. The key differences are:
- Operand A comes from TMEM (
uint32_t address) instead of SMEM (uint64_t descriptor)
- The underlying intrinsic uses
tcgen05.mma ... [tmem_c], [tmem_a], desc_b instead of
tcgen05.mma ... [tmem_c], desc_a, desc_b
- Need to add corresponding
DispatchInstruction specializations and GemmTensorOp::body_ts
A matching CuTe MMA_Atom (similar to SM100_MMA_F16BF16_WS_SS but for the TS variant) may
also be needed.
I'd like to work on this feature. Before starting, I want to confirm:
- Is anyone currently working on this?
- Are there any specific design considerations or constraints I should be aware of?
- Should the TS variant also support the
.ws (warp-specialized) mode, or is the
non-ws variant sufficient as a first step?
I have access to Blackwell hardware (with tcgen05 MMA support) for development and testing.
Alternatives
Users can currently work around this by manually copying data from TMEM to shared memory
before calling tcgen5mma_gemm_ss, but this introduces unnecessary data movement overhead
and defeats the purpose of the Tensor Memory architecture.
Additional context
Related code locations:
- TODO: src/tl_templates/cuda/gemm_sm100.h:419-424
- TS intrinsics: src/tl_templates/cuda/instruction/tcgen05mma.h:26-143
- SS reference implementation: src/tl_templates/cuda/gemm_sm100.h:339-406 (GemmTensorOp::body_ss)
- Lowering logic: src/op/gemm.cc:461-462 (already routes shared.tmem scope to gemm_ts)
- Python tileop: tilelang/tileop/gemm/gemm_tcgen05.py
Required prerequisites
Motivation
Currently, SM100 TCGEN5MMA supports the SS (shared-shared) variant via
tcgen5mma_gemm_ssin
gemm_sm100.h, but the TS (TMEM-shared) variant — where operand A is read from TensorMemory and operand B from shared memory — is not yet implemented.
There is an existing TODO at
src/tl_templates/cuda/gemm_sm100.h:419:The low-level PTX intrinsics for
tcgen05mma_tsare already fully implemented insrc/tl_templates/cuda/instruction/tcgen05mma.h(lines 26-143), covering F16/BF16, TF32,INT8/UINT8, and FP8 (E4M3/E5M2) data types. The backend lowering in
src/op/gemm.cc(line 462) also already routes
shared.tmemA-scope totl::tcgen5mma_gemm_ts.What's missing is the upper-level C++ template wrapper (
tcgen5mma_gemm_ts) that bridgesthe gap between the lowering logic and the PTX intrinsics — analogous to what
tcgen5mma_gemm_ssdoes for the SS variant.The TS variant is important because it enables reading the A operand directly from Tensor
Memory, which can reduce shared memory pressure and enable new dataflow patterns (e.g.,
chaining the output of one MMA as the input of the next without going through shared memory).
Solution
Implement
tcgen5mma_gemm_tsinsrc/tl_templates/cuda/gemm_sm100.h, following the samepattern as the existing
tcgen5mma_gemm_ss. The key differences are:uint32_taddress) instead of SMEM (uint64_tdescriptor)tcgen05.mma ... [tmem_c], [tmem_a], desc_binstead oftcgen05.mma ... [tmem_c], desc_a, desc_bDispatchInstructionspecializations andGemmTensorOp::body_tsA matching CuTe MMA_Atom (similar to SM100_MMA_F16BF16_WS_SS but for the TS variant) may
also be needed.
I'd like to work on this feature. Before starting, I want to confirm:
.ws(warp-specialized) mode, or is thenon-ws variant sufficient as a first step?
I have access to Blackwell hardware (with tcgen05 MMA support) for development and testing.
Alternatives
Users can currently work around this by manually copying data from TMEM to shared memory
before calling
tcgen5mma_gemm_ss, but this introduces unnecessary data movement overheadand defeats the purpose of the Tensor Memory architecture.
Additional context
Related code locations: