Skip to content

[mxfp8 training] update triton_to_mxfp8_dim0 nan handling#4201

Open
danielvegamyhre wants to merge 1 commit intomainfrom
danielvegamyhre/stack/162
Open

[mxfp8 training] update triton_to_mxfp8_dim0 nan handling#4201
danielvegamyhre wants to merge 1 commit intomainfrom
danielvegamyhre/stack/162

Conversation

@danielvegamyhre
Copy link
Copy Markdown
Contributor

@danielvegamyhre danielvegamyhre commented Mar 30, 2026

Summary

  • Received reports of NaN loss with MXFP8 training that resolved when opting out of using Triton kernel for dim0 quantization (triton_to_mxfp8_dim0)
  • Added unit tests with various special values (nan, inf, -inf, subnormals, extremely large/small normal values, etc) to find discrepancies between torch impl and triton
  • To make a long story short, I became suspicious that the torch reference impl was also not handling certain cases correctly. To get triton to match, i had to do special sweeps of NaN first, then inf, then -inf (in that order), which killed perf. All of which the CUDA code doesn't have to do! (suspicious). So I updated both torch and triton to match the same TE RCEIL logic, which doesn't need any of this special handling .

Changes

Benchmarks

(torch) dev@gpu-dev-8951ebdf:~/ao$ CUDA_VISIBLE_DEVICES=1 PYTHONPATH=/home/$USER/ao:$PYTHONPATH python benchmarks/mx_formats/cast_bench.py --mode dim0_mxfp8_triton_rceil --M 32768 --K 7168
M 32768 K 7168 BLOCK_SIZE 32
GPU: NVIDIA B200
torch version: 2.11.0+cu130
triton version: 3.6.0
mode: dim0_mxfp8_triton_rceil
time_us 123.9359974861145
mem_bw_gbps 5744.76438195262

danielvegamyhre added a commit that referenced this pull request Mar 30, 2026
…rch reference

stack-info: PR: #4201, branch: danielvegamyhre/stack/162
@danielvegamyhre danielvegamyhre force-pushed the danielvegamyhre/stack/162 branch from f0b651a to 74a1db5 Compare March 30, 2026 20:14
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot bot commented Mar 30, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/4201

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 8 Unrelated Failures

As of commit 82e7fc9 with merge base 3ad1067 (image):

NEW FAILURE - The following job has failed:

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 30, 2026
@danielvegamyhre danielvegamyhre added mx module: training quantize_ api training flow labels Mar 30, 2026
@danielvegamyhre danielvegamyhre marked this pull request as draft March 30, 2026 20:24
danielvegamyhre added a commit that referenced this pull request Mar 30, 2026
…rch reference

stack-info: PR: #4201, branch: danielvegamyhre/stack/162
@danielvegamyhre danielvegamyhre force-pushed the danielvegamyhre/stack/162 branch from 74a1db5 to e813474 Compare March 30, 2026 20:24
@danielvegamyhre danielvegamyhre marked this pull request as ready for review March 30, 2026 20:24
@danielvegamyhre danielvegamyhre marked this pull request as draft March 30, 2026 20:31
danielvegamyhre added a commit that referenced this pull request Mar 30, 2026
…rch reference

stack-info: PR: #4201, branch: danielvegamyhre/stack/162
@danielvegamyhre danielvegamyhre force-pushed the danielvegamyhre/stack/162 branch from e813474 to 346fcca Compare March 30, 2026 20:31
@danielvegamyhre danielvegamyhre marked this pull request as ready for review March 30, 2026 20:31
@danielvegamyhre danielvegamyhre marked this pull request as draft March 30, 2026 22:09
danielvegamyhre added a commit that referenced this pull request Mar 30, 2026
…rch reference

stack-info: PR: #4201, branch: danielvegamyhre/stack/162
@danielvegamyhre danielvegamyhre force-pushed the danielvegamyhre/stack/162 branch from 346fcca to 8df8fea Compare March 30, 2026 22:09
@danielvegamyhre danielvegamyhre marked this pull request as ready for review March 30, 2026 22:09
@danielvegamyhre danielvegamyhre marked this pull request as draft March 30, 2026 23:09
danielvegamyhre added a commit that referenced this pull request Mar 30, 2026
…rch reference

stack-info: PR: #4201, branch: danielvegamyhre/stack/162
@danielvegamyhre danielvegamyhre force-pushed the danielvegamyhre/stack/162 branch from 8df8fea to b7c40f1 Compare March 30, 2026 23:09
@danielvegamyhre danielvegamyhre marked this pull request as ready for review March 30, 2026 23:09
@danielvegamyhre danielvegamyhre marked this pull request as draft March 30, 2026 23:41
danielvegamyhre added a commit that referenced this pull request Mar 30, 2026
…rch reference

stack-info: PR: #4201, branch: danielvegamyhre/stack/162
@danielvegamyhre danielvegamyhre force-pushed the danielvegamyhre/stack/162 branch from b7c40f1 to 46eec1e Compare March 30, 2026 23:41
@danielvegamyhre danielvegamyhre marked this pull request as ready for review March 30, 2026 23:41
@danielvegamyhre
Copy link
Copy Markdown
Contributor Author

danielvegamyhre commented Mar 30, 2026

edit: nevermind, managed to recover performance

@danielvegamyhre danielvegamyhre marked this pull request as draft March 31, 2026 01:16
danielvegamyhre added a commit that referenced this pull request Mar 31, 2026
…rch reference

stack-info: PR: #4201, branch: danielvegamyhre/stack/162
@danielvegamyhre danielvegamyhre force-pushed the danielvegamyhre/stack/162 branch from 46eec1e to 43a59a1 Compare March 31, 2026 01:16
@danielvegamyhre danielvegamyhre marked this pull request as draft March 31, 2026 02:14
danielvegamyhre added a commit that referenced this pull request Mar 31, 2026
…rch reference

stack-info: PR: #4201, branch: danielvegamyhre/stack/162
@danielvegamyhre danielvegamyhre force-pushed the danielvegamyhre/stack/162 branch from cf861ec to c38b55c Compare March 31, 2026 02:15
@danielvegamyhre danielvegamyhre marked this pull request as ready for review March 31, 2026 02:15
@danielvegamyhre danielvegamyhre marked this pull request as draft March 31, 2026 02:17
danielvegamyhre added a commit that referenced this pull request Mar 31, 2026
…rch reference

stack-info: PR: #4201, branch: danielvegamyhre/stack/162
@danielvegamyhre danielvegamyhre force-pushed the danielvegamyhre/stack/162 branch from c38b55c to bf5918a Compare March 31, 2026 02:17
@danielvegamyhre danielvegamyhre marked this pull request as ready for review March 31, 2026 02:17
@danielvegamyhre danielvegamyhre marked this pull request as draft March 31, 2026 02:19
danielvegamyhre added a commit that referenced this pull request Mar 31, 2026
…rch reference

stack-info: PR: #4201, branch: danielvegamyhre/stack/162
@danielvegamyhre danielvegamyhre force-pushed the danielvegamyhre/stack/162 branch from bf5918a to c231e2b Compare March 31, 2026 02:19
@danielvegamyhre danielvegamyhre marked this pull request as ready for review March 31, 2026 02:19
@danielvegamyhre danielvegamyhre marked this pull request as draft March 31, 2026 02:26
danielvegamyhre added a commit that referenced this pull request Mar 31, 2026
…rch reference

stack-info: PR: #4201, branch: danielvegamyhre/stack/162
@danielvegamyhre danielvegamyhre force-pushed the danielvegamyhre/stack/162 branch from c231e2b to 39aae18 Compare March 31, 2026 02:26
@danielvegamyhre danielvegamyhre marked this pull request as ready for review March 31, 2026 02:27
@danielvegamyhre danielvegamyhre marked this pull request as draft March 31, 2026 02:30
danielvegamyhre added a commit that referenced this pull request Mar 31, 2026
…rch reference

stack-info: PR: #4201, branch: danielvegamyhre/stack/162
@danielvegamyhre danielvegamyhre force-pushed the danielvegamyhre/stack/162 branch from 39aae18 to 8858c92 Compare March 31, 2026 02:30
@danielvegamyhre danielvegamyhre marked this pull request as ready for review March 31, 2026 02:31
@danielvegamyhre danielvegamyhre marked this pull request as draft March 31, 2026 02:36
danielvegamyhre added a commit that referenced this pull request Mar 31, 2026
…rch reference

stack-info: PR: #4201, branch: danielvegamyhre/stack/162
@danielvegamyhre danielvegamyhre force-pushed the danielvegamyhre/stack/162 branch from 8858c92 to c4bf474 Compare March 31, 2026 02:36
@danielvegamyhre danielvegamyhre marked this pull request as ready for review March 31, 2026 02:37
@danielvegamyhre danielvegamyhre marked this pull request as draft March 31, 2026 02:51
danielvegamyhre added a commit that referenced this pull request Mar 31, 2026
…rch reference

stack-info: PR: #4201, branch: danielvegamyhre/stack/162
@danielvegamyhre danielvegamyhre force-pushed the danielvegamyhre/stack/162 branch from c4bf474 to 5c6b807 Compare March 31, 2026 02:51
danielvegamyhre added a commit that referenced this pull request Mar 31, 2026
…rch reference

stack-info: PR: #4201, branch: danielvegamyhre/stack/162
@danielvegamyhre danielvegamyhre force-pushed the danielvegamyhre/stack/162 branch from 5c6b807 to 7dfe87c Compare March 31, 2026 02:53
…rch reference

stack-info: PR: #4201, branch: danielvegamyhre/stack/162
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. module: training quantize_ api training flow mx

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant