CUTLASS Notes

The CUTLASS notes series will begin with a minimal GEMM implementation, gradually expand to incorporate CuTe and various CUTLASS components, as well as features of new architectures, e.g. Hopper and Blackwell, ultimately achieving a high-performance fused GEMM operator.

Usage

git clone https://github.com/ArthurinRUC/cutlass-notes.git

make update  # clone cutlass

Run sample code

All example code in this GitHub repository can be compiled and run by simply executing the Python script. For example:

cd 01-minimal-gemm
python minimal_gemm.py

Note list

Notes	Summary	Links
00-Intro	Brief introduction to CUTLASS	intro
01-minimal-gemm	- Introduces CuTe fundamentals - Implements 16x8x8 GEMM kernel using single MMA instruction from scratch - Python kernel invocation, precision validation & performance benchmarking - Profiling with Nsight Compute (ncu)	minimal-gemm
02-mixed-precision-gemm	- Implements mixed-precision GEMM supporting varying input/output/accumulation precisions - Explores technical details for numerical precision conversion within kernels - Demonstrates custom FP8 GEMM kernel implementation via PTX instructions (for CUTLASS-unsupported MMA ops)	mixed-precision-gemm
03-tiled-mma	- Introduces the key conceptual model of GEMM operator: Three-Level Tiling - Details the implementation of Tiled MMA operations in CUTLASS CuTe - Explains the usage and semantics of various parameters in the Tiled MMA API - Extends the GEMM kernel from single instruction to single tile operation	tiled-mma
04-tiled-copy	- Explains the core principles of CuTe TiledCopy and its role in data movement between global and shared memory - Describes the API parameters and semantics of TiledCopy - Demonstrates how to implement data copying at the Tile level - Introduces foundational knowledge of GPU global memory access characteristics	tiled-copy
05-block-mma	- Extends Tiled MMA to the Block level for larger-scale GEMM computations - Explains how multiple Tiled MMA operations are combined within a thread block - Describes the tiling and coordination of TiledCopy and TiledMMA at the Block level - Illustrates the hierarchical dataflow from global memory to shared memory to registers for Block-level MMA	block-mma
06-block-copy	Coming soon	Stay tuned

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
.github/workflows		.github/workflows
01-minimal-gemm		01-minimal-gemm
02-mixed-precision-gemm		02-mixed-precision-gemm
03-tiled-mma		03-tiled-mma
04-tiled-copy		04-tiled-copy
05-block-mma		05-block-mma
06-block-copy		06-block-copy
07-swizzling		07-swizzling
08-dynamic-mma		08-dynamic-mma
09-pipelining		09-pipelining
10-gemm-api		10-gemm-api
11-tma-load-store		11-tma-load-store
12-tma-multicast-reduce		12-tma-multicast-reduce
13-warpgroup-mma		13-warpgroup-mma
14-warp-specialization		14-warp-specialization
third-party		third-party
.clang-format		.clang-format
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CUTLASS Notes

Usage

Run sample code

Note list

License

About

Uh oh!

Releases

Packages

Languages

License

ArthurinRUC/cutlass-notes

Folders and files

Latest commit

History

Repository files navigation

CUTLASS Notes

Usage

Run sample code

Note list

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages