feat: SM120 (RTX 5090) support via FP8/FP6/FP4 precision

## Summary

SM120 (Blackwell GeForce: RTX 5090, RTX 5080) CUTLASS 4.x kernels require narrow precision data types.

CUTLASS 4.3.3's SM120 CollectiveBuilder only supports **F8F6F4 MMA** (FP8/FP6/FP4), NOT FP32/FP16/BF16.

Error when attempting FP32/FP16/BF16 on SM120:
```
SM120 TmaWarpSpecialized builder currently only supports F8F6F4 MMA
No MMA matches SM120_16x8x32_TN for given data types
```

## Current Status (v0.2.10)

- **SM100 (B200)**: ✅ Enabled - FP32/FP16/BF16 via CUTLASS 4.x
- **SM120 (RTX 5090)**: ❌ Disabled - Falls back to SM89 (CUTLASS 2.x)

Header files prepared but disabled:
- `native/ops/matmul_cutlass_sm120.cuh` - Ready for FP8 integration

## Requirements for SM120 Support

### 1. FP8 Data Type Infrastructure
- [ ] Add `Dtype::FP8_E4M3` and `Dtype::FP8_E5M2` to dtype enum
- [ ] Memory allocation/deallocation for FP8 tensors
- [ ] GPUArray support for FP8 dtype

### 2. FP8 Conversion Routines
- [ ] FP32 ↔ FP8 conversion kernels
- [ ] FP16 ↔ FP8 conversion kernels
- [ ] BF16 ↔ FP8 conversion kernels

### 3. CUTLASS SM120 FP8 Kernels
- [ ] Update `matmul_cutlass_sm120.cuh` to use FP8 element types
- [ ] FP8 GEMM: `cutlass::float_e4m3_t`, `cutlass::float_e5m2_t`
- [ ] Mixed precision: FP8 input → FP32 accumulator → FP16/BF16 output

### 4. Python API
- [ ] `pygpukit.float8_e4m3`, `pygpukit.float8_e5m2` dtype constants
- [ ] `GPUArray.astype()` support for FP8
- [ ] `matmul()` dispatch for FP8 inputs

### 5. LLM Integration (Optional)
- [ ] FP8 weight loading from SafeTensors
- [ ] FP8 inference path for supported models

## Technical Notes

### FP8 Formats
| Format | Exponent | Mantissa | Range | Use Case |
|--------|----------|----------|-------|----------|
| E4M3 | 4 bits | 3 bits | ±240 | Weights, activations |
| E5M2 | 5 bits | 2 bits | ±57344 | Gradients |

### CUTLASS FP8 Types
```cpp
#include <cutlass/float8.h>
using ElementA = cutlass::float_e4m3_t;  // E4M3
using ElementB = cutlass::float_e5m2_t;  // E5M2
```

### SM120 Tile Constraints
- 101KB shared memory (vs 232KB on SM100)
- No TMA multicast (cluster shape 1x1x1)
- CLC (Cluster Launch Control) scheduler

## References

- [NVIDIA FP8 Primer](https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/examples/fp8_primer.html)
- [CUTLASS FP8 GEMM Examples](https://github.com/NVIDIA/cutlass/tree/main/examples)
- Issue #77 (SM100/SM120 investigation)

## Priority

Low - RTX 5090 not yet widely available. SM120 falls back to SM89 kernels which work correctly.

---
Related: #77

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: SM120 (RTX 5090) support via FP8/FP6/FP4 precision #91

Summary

Current Status (v0.2.10)

Requirements for SM120 Support

1. FP8 Data Type Infrastructure

2. FP8 Conversion Routines

3. CUTLASS SM120 FP8 Kernels

4. Python API

5. LLM Integration (Optional)

Technical Notes

FP8 Formats

CUTLASS FP8 Types

SM120 Tile Constraints

References

Priority

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Format	Exponent	Mantissa	Range	Use Case
E4M3	4 bits	3 bits	±240	Weights, activations
E5M2	5 bits	2 bits	±57344	Gradients

feat: SM120 (RTX 5090) support via FP8/FP6/FP4 precision #91

Description

Summary

Current Status (v0.2.10)

Requirements for SM120 Support

1. FP8 Data Type Infrastructure

2. FP8 Conversion Routines

3. CUTLASS SM120 FP8 Kernels

4. Python API

5. LLM Integration (Optional)

Technical Notes

FP8 Formats

CUTLASS FP8 Types

SM120 Tile Constraints

References

Priority

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions