The CUTLASS notes series will begin with a minimal GEMM implementation, gradually expand to incorporate CuTe and various CUTLASS components, as well as features of new architectures, e.g. Hopper and Blackwell, ultimately achieving a high-performance fused GEMM operator.
git clone https://github.com/ArthurinRUC/cutlass-notes.git
make update # clone cutlassAll example code in this GitHub repository can be compiled and run by simply executing the Python script. For example:
cd 01-minimal-gemm
python minimal_gemm.py| Notes | Summary | Links |
|---|---|---|
| 00-Intro | Brief introduction to CUTLASS | intro |
| 01-minimal-gemm | - Introduces CuTe fundamentals - Implements 16x8x8 GEMM kernel using single MMA instruction from scratch - Python kernel invocation, precision validation & performance benchmarking - Profiling with Nsight Compute (ncu) |
minimal-gemm |
| 02-mixed-precision-gemm | - Implements mixed-precision GEMM supporting varying input/output/accumulation precisions - Explores technical details for numerical precision conversion within kernels - Demonstrates custom FP8 GEMM kernel implementation via PTX instructions (for CUTLASS-unsupported MMA ops) |
mixed-precision-gemm |
| 03-tiled-mma | - Introduces the key conceptual model of GEMM operator: Three-Level Tiling - Details the implementation of Tiled MMA operations in CUTLASS CuTe - Explains the usage and semantics of various parameters in the Tiled MMA API - Extends the GEMM kernel from single instruction to single tile operation |
tiled-mma |
| 04-tiled-copy | - Explains the core principles of CuTe TiledCopy and its role in data movement between global and shared memory - Describes the API parameters and semantics of TiledCopy - Demonstrates how to implement data copying at the Tile level - Introduces foundational knowledge of GPU global memory access characteristics |
tiled-copy |
| 05-block-mma | - Extends Tiled MMA to the Block level for larger-scale GEMM computations - Explains how multiple Tiled MMA operations are combined within a thread block - Describes the tiling and coordination of TiledCopy and TiledMMA at the Block level - Illustrates the hierarchical dataflow from global memory to shared memory to registers for Block-level MMA |
block-mma |
| 06-block-copy | Coming soon | Stay tuned |
This project is licensed under the MIT License - see the LICENSE file for details.