Skip to content

[Feature Request] Implement gemm_ts wrapper for SM100 TCGEN5MMA (A from TMEM, B from SMEM) #1862

@Hale423

Description

@Hale423

Required prerequisites

  • I have searched the Issue Tracker that this hasn't already been reported. (comment there if it has.)

Motivation

Currently, SM100 TCGEN5MMA supports the SS (shared-shared) variant via tcgen5mma_gemm_ss
in gemm_sm100.h, but the TS (TMEM-shared) variant — where operand A is read from Tensor
Memory and operand B from shared memory — is not yet implemented.

There is an existing TODO at src/tl_templates/cuda/gemm_sm100.h:419:

// TODO (lei): Implement gemm_ts

The low-level PTX intrinsics for tcgen05mma_ts are already fully implemented in
src/tl_templates/cuda/instruction/tcgen05mma.h (lines 26-143), covering F16/BF16, TF32,
INT8/UINT8, and FP8 (E4M3/E5M2) data types. The backend lowering in src/op/gemm.cc
(line 462) also already routes shared.tmem A-scope to tl::tcgen5mma_gemm_ts.

What's missing is the upper-level C++ template wrapper (tcgen5mma_gemm_ts) that bridges
the gap between the lowering logic and the PTX intrinsics — analogous to what
tcgen5mma_gemm_ss does for the SS variant.

The TS variant is important because it enables reading the A operand directly from Tensor
Memory, which can reduce shared memory pressure and enable new dataflow patterns (e.g.,
chaining the output of one MMA as the input of the next without going through shared memory).

Solution

Implement tcgen5mma_gemm_ts in src/tl_templates/cuda/gemm_sm100.h, following the same
pattern as the existing tcgen5mma_gemm_ss. The key differences are:

  1. Operand A comes from TMEM (uint32_t address) instead of SMEM (uint64_t descriptor)
  2. The underlying intrinsic uses tcgen05.mma ... [tmem_c], [tmem_a], desc_b instead of
    tcgen05.mma ... [tmem_c], desc_a, desc_b
  3. Need to add corresponding DispatchInstruction specializations and GemmTensorOp::body_ts

A matching CuTe MMA_Atom (similar to SM100_MMA_F16BF16_WS_SS but for the TS variant) may
also be needed.

I'd like to work on this feature. Before starting, I want to confirm:

  1. Is anyone currently working on this?
  2. Are there any specific design considerations or constraints I should be aware of?
  3. Should the TS variant also support the .ws (warp-specialized) mode, or is the
    non-ws variant sufficient as a first step?

I have access to Blackwell hardware (with tcgen05 MMA support) for development and testing.

Alternatives

Users can currently work around this by manually copying data from TMEM to shared memory
before calling tcgen5mma_gemm_ss, but this introduces unnecessary data movement overhead
and defeats the purpose of the Tensor Memory architecture.

Additional context

Related code locations:

  • TODO: src/tl_templates/cuda/gemm_sm100.h:419-424
  • TS intrinsics: src/tl_templates/cuda/instruction/tcgen05mma.h:26-143
  • SS reference implementation: src/tl_templates/cuda/gemm_sm100.h:339-406 (GemmTensorOp::body_ss)
  • Lowering logic: src/op/gemm.cc:461-462 (already routes shared.tmem scope to gemm_ts)
  • Python tileop: tilelang/tileop/gemm/gemm_tcgen05.py

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions