[Feature Request] Implement gemm_ts wrapper for SM100 TCGEN5MMA (A from TMEM, B from SMEM)

### Required prerequisites

- [x] I have searched the [Issue Tracker](https://github.com/tile-ai/tilelang/issues) that this hasn't already been reported. (comment there if it has.)

### Motivation

Currently, SM100 TCGEN5MMA supports the SS (shared-shared) variant via `tcgen5mma_gemm_ss`
in `gemm_sm100.h`, but the TS (TMEM-shared) variant — where operand A is read from Tensor
Memory and operand B from shared memory — is not yet implemented.

There is an existing TODO at `src/tl_templates/cuda/gemm_sm100.h:419`:

    // TODO (lei): Implement gemm_ts

The low-level PTX intrinsics for `tcgen05mma_ts` are already fully implemented in
`src/tl_templates/cuda/instruction/tcgen05mma.h` (lines 26-143), covering F16/BF16, TF32,
INT8/UINT8, and FP8 (E4M3/E5M2) data types. The backend lowering in `src/op/gemm.cc`
(line 462) also already routes `shared.tmem` A-scope to `tl::tcgen5mma_gemm_ts`.

What's missing is the upper-level C++ template wrapper (`tcgen5mma_gemm_ts`) that bridges
the gap between the lowering logic and the PTX intrinsics — analogous to what
`tcgen5mma_gemm_ss` does for the SS variant.

The TS variant is important because it enables reading the A operand directly from Tensor
Memory, which can reduce shared memory pressure and enable new dataflow patterns (e.g.,
chaining the output of one MMA as the input of the next without going through shared memory).


### Solution

Implement `tcgen5mma_gemm_ts` in `src/tl_templates/cuda/gemm_sm100.h`, following the same
pattern as the existing `tcgen5mma_gemm_ss`. The key differences are:

1. Operand A comes from TMEM (`uint32_t` address) instead of SMEM (`uint64_t` descriptor)
2. The underlying intrinsic uses `tcgen05.mma ... [tmem_c], [tmem_a], desc_b` instead of
   `tcgen05.mma ... [tmem_c], desc_a, desc_b`
3. Need to add corresponding `DispatchInstruction` specializations and `GemmTensorOp::body_ts`

A matching CuTe MMA_Atom (similar to SM100_MMA_F16BF16_WS_SS but for the TS variant) may
also be needed.

I'd like to work on this feature. Before starting, I want to confirm:

1. Is anyone currently working on this?
2. Are there any specific design considerations or constraints I should be aware of?
3. Should the TS variant also support the `.ws` (warp-specialized) mode, or is the
   non-ws variant sufficient as a first step?

I have access to Blackwell hardware (with tcgen05 MMA support) for development and testing.

### Alternatives

Users can currently work around this by manually copying data from TMEM to shared memory
before calling `tcgen5mma_gemm_ss`, but this introduces unnecessary data movement overhead
and defeats the purpose of the Tensor Memory architecture.

### Additional context

Related code locations:
- TODO: src/tl_templates/cuda/gemm_sm100.h:419-424
- TS intrinsics: src/tl_templates/cuda/instruction/tcgen05mma.h:26-143
- SS reference implementation: src/tl_templates/cuda/gemm_sm100.h:339-406 (GemmTensorOp::body_ss)
- Lowering logic: src/op/gemm.cc:461-462 (already routes shared.tmem scope to gemm_ts)
- Python tileop: tilelang/tileop/gemm/gemm_tcgen05.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] Implement gemm_ts wrapper for SM100 TCGEN5MMA (A from TMEM, B from SMEM) #1862

Required prerequisites

Motivation

Solution

Alternatives

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature Request] Implement gemm_ts wrapper for SM100 TCGEN5MMA (A from TMEM, B from SMEM) #1862

Description

Required prerequisites

Motivation

Solution

Alternatives

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions