Skip to content

feat: SM120 (RTX 5090) support via FP8/FP6/FP4 precision #91

@m96-chan

Description

@m96-chan

Summary

SM120 (Blackwell GeForce: RTX 5090, RTX 5080) CUTLASS 4.x kernels require narrow precision data types.

CUTLASS 4.3.3's SM120 CollectiveBuilder only supports F8F6F4 MMA (FP8/FP6/FP4), NOT FP32/FP16/BF16.

Error when attempting FP32/FP16/BF16 on SM120:

SM120 TmaWarpSpecialized builder currently only supports F8F6F4 MMA
No MMA matches SM120_16x8x32_TN for given data types

Current Status (v0.2.10)

  • SM100 (B200): ✅ Enabled - FP32/FP16/BF16 via CUTLASS 4.x
  • SM120 (RTX 5090): ❌ Disabled - Falls back to SM89 (CUTLASS 2.x)

Header files prepared but disabled:

  • native/ops/matmul_cutlass_sm120.cuh - Ready for FP8 integration

Requirements for SM120 Support

1. FP8 Data Type Infrastructure

  • Add Dtype::FP8_E4M3 and Dtype::FP8_E5M2 to dtype enum
  • Memory allocation/deallocation for FP8 tensors
  • GPUArray support for FP8 dtype

2. FP8 Conversion Routines

  • FP32 ↔ FP8 conversion kernels
  • FP16 ↔ FP8 conversion kernels
  • BF16 ↔ FP8 conversion kernels

3. CUTLASS SM120 FP8 Kernels

  • Update matmul_cutlass_sm120.cuh to use FP8 element types
  • FP8 GEMM: cutlass::float_e4m3_t, cutlass::float_e5m2_t
  • Mixed precision: FP8 input → FP32 accumulator → FP16/BF16 output

4. Python API

  • pygpukit.float8_e4m3, pygpukit.float8_e5m2 dtype constants
  • GPUArray.astype() support for FP8
  • matmul() dispatch for FP8 inputs

5. LLM Integration (Optional)

  • FP8 weight loading from SafeTensors
  • FP8 inference path for supported models

Technical Notes

FP8 Formats

Format Exponent Mantissa Range Use Case
E4M3 4 bits 3 bits ±240 Weights, activations
E5M2 5 bits 2 bits ±57344 Gradients

CUTLASS FP8 Types

#include <cutlass/float8.h>
using ElementA = cutlass::float_e4m3_t;  // E4M3
using ElementB = cutlass::float_e5m2_t;  // E5M2

SM120 Tile Constraints

  • 101KB shared memory (vs 232KB on SM100)
  • No TMA multicast (cluster shape 1x1x1)
  • CLC (Cluster Launch Control) scheduler

References

Priority

Low - RTX 5090 not yet widely available. SM120 falls back to SM89 kernels which work correctly.


Related: #77

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions