-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
Summary
SM120 (Blackwell GeForce: RTX 5090, RTX 5080) CUTLASS 4.x kernels require narrow precision data types.
CUTLASS 4.3.3's SM120 CollectiveBuilder only supports F8F6F4 MMA (FP8/FP6/FP4), NOT FP32/FP16/BF16.
Error when attempting FP32/FP16/BF16 on SM120:
SM120 TmaWarpSpecialized builder currently only supports F8F6F4 MMA
No MMA matches SM120_16x8x32_TN for given data types
Current Status (v0.2.10)
- SM100 (B200): ✅ Enabled - FP32/FP16/BF16 via CUTLASS 4.x
- SM120 (RTX 5090): ❌ Disabled - Falls back to SM89 (CUTLASS 2.x)
Header files prepared but disabled:
native/ops/matmul_cutlass_sm120.cuh- Ready for FP8 integration
Requirements for SM120 Support
1. FP8 Data Type Infrastructure
- Add
Dtype::FP8_E4M3andDtype::FP8_E5M2to dtype enum - Memory allocation/deallocation for FP8 tensors
- GPUArray support for FP8 dtype
2. FP8 Conversion Routines
- FP32 ↔ FP8 conversion kernels
- FP16 ↔ FP8 conversion kernels
- BF16 ↔ FP8 conversion kernels
3. CUTLASS SM120 FP8 Kernels
- Update
matmul_cutlass_sm120.cuhto use FP8 element types - FP8 GEMM:
cutlass::float_e4m3_t,cutlass::float_e5m2_t - Mixed precision: FP8 input → FP32 accumulator → FP16/BF16 output
4. Python API
-
pygpukit.float8_e4m3,pygpukit.float8_e5m2dtype constants -
GPUArray.astype()support for FP8 -
matmul()dispatch for FP8 inputs
5. LLM Integration (Optional)
- FP8 weight loading from SafeTensors
- FP8 inference path for supported models
Technical Notes
FP8 Formats
| Format | Exponent | Mantissa | Range | Use Case |
|---|---|---|---|---|
| E4M3 | 4 bits | 3 bits | ±240 | Weights, activations |
| E5M2 | 5 bits | 2 bits | ±57344 | Gradients |
CUTLASS FP8 Types
#include <cutlass/float8.h>
using ElementA = cutlass::float_e4m3_t; // E4M3
using ElementB = cutlass::float_e5m2_t; // E5M2SM120 Tile Constraints
- 101KB shared memory (vs 232KB on SM100)
- No TMA multicast (cluster shape 1x1x1)
- CLC (Cluster Launch Control) scheduler
References
- NVIDIA FP8 Primer
- CUTLASS FP8 GEMM Examples
- Issue [v0.3+] feat: SM100/SM120 Blackwell-specific CUTLASS optimizations #77 (SM100/SM120 investigation)
Priority
Low - RTX 5090 not yet widely available. SM120 falls back to SM89 kernels which work correctly.
Related: #77
Metadata
Metadata
Assignees
Labels
No labels