Skip to content

Commit 91dc506

Browse files
Ubuntuclaude
andcommitted
Fuse butterfly stages in AVX2 SPQLIOS FFT/IFFT assembly
Port the stage-fusion optimization from the AVX-512 assembly to the AVX2/FMA path, dramatically reducing memory traffic by processing multiple butterfly stages in-register before storing. FFT (forward): - Fused size-2 + size-4 + size-8 (halfnn=4) pass: load once, apply all 3 butterfly stages in-register, store once. Uses hardcoded W8 twiddle constants. Eliminates 2 full load-store round trips. - Fused last butterfly + final twist: the last general-loop iteration applies the twist multiply in-register before storing, eliminating 1 more round trip. - General loop starts at halfnn=8 (halfnn=4 handled by fused pass). IFFT (inverse): - Fused first twist + first butterfly (largest halfnn): applies twist and DIF butterfly in one pass, eliminating 1 round trip. - Fused size-8 + size-4 + size-2 (last 3 stages): same in-register fusion as FFT but in DIF order. Eliminates 2 round trips. For N=1024, the original code did 9 passes; the fused code does ~5, reducing memory traffic by ~44%. Benchmarks on AMD EPYC 7542 (Zen 2, 2.9 GHz, AVX2+FMA): FFT forward: 1873 → 1407 ns (−24.9%) IFFT inverse: 1366 → 1116 ns (−18.3%) IFFT+Mul+FFT: 3386 → 2715 ns (−19.8%) NAND gate: 12.53 → 11.15 ms (−11.0%) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 0aa6ba0 commit 91dc506

File tree

2 files changed

+490
-447
lines changed

2 files changed

+490
-447
lines changed

0 commit comments

Comments
 (0)