Skip to content

ArthurinRUC/cutlass-notes

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CUTLASS Notes

The CUTLASS notes series will begin with a minimal GEMM implementation, gradually expand to incorporate CuTe and various CUTLASS components, as well as features of new architectures, e.g. Hopper and Blackwell, ultimately achieving a high-performance fused GEMM operator.

Usage

git clone https://github.com/ArthurinRUC/cutlass-notes.git

make update  # clone cutlass

Run sample code

All example code in this GitHub repository can be compiled and run by simply executing the Python script. For example:

cd 01-minimal-gemm
python minimal_gemm.py

Note list

Notes Summary Links
00-Intro Brief introduction to CUTLASS intro
01-minimal-gemm - Introduces CuTe fundamentals
- Implements 16x8x8 GEMM kernel using single MMA instruction from scratch
- Python kernel invocation, precision validation & performance benchmarking
- Profiling with Nsight Compute (ncu)
minimal-gemm
02-mixed-precision-gemm - Implements mixed-precision GEMM supporting varying input/output/accumulation precisions
- Explores technical details for numerical precision conversion within kernels
- Demonstrates custom FP8 GEMM kernel implementation via PTX instructions (for CUTLASS-unsupported MMA ops)
mixed-precision-gemm
03-tiled-mma - Introduces the key conceptual model of GEMM operator: Three-Level Tiling
- Details the implementation of Tiled MMA operations in CUTLASS CuTe
- Explains the usage and semantics of various parameters in the Tiled MMA API
- Extends the GEMM kernel from single instruction to single tile operation
tiled-mma
04-tiled-copy - Explains the core principles of CuTe TiledCopy and its role in data movement between global and shared memory
- Describes the API parameters and semantics of TiledCopy
- Demonstrates how to implement data copying at the Tile level
- Introduces foundational knowledge of GPU global memory access characteristics
tiled-copy
05-block-mma - Extends Tiled MMA to the Block level for larger-scale GEMM computations
- Explains how multiple Tiled MMA operations are combined within a thread block
- Describes the tiling and coordination of TiledCopy and TiledMMA at the Block level
- Illustrates the hierarchical dataflow from global memory to shared memory to registers for Block-level MMA
block-mma
06-block-copy Coming soon Stay tuned

License

This project is licensed under the MIT License - see the LICENSE file for details.

About

From Minimal GEMM to Everything

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published