Commit 91dc506
Fuse butterfly stages in AVX2 SPQLIOS FFT/IFFT assembly
Port the stage-fusion optimization from the AVX-512 assembly to the
AVX2/FMA path, dramatically reducing memory traffic by processing multiple
butterfly stages in-register before storing.
FFT (forward):
- Fused size-2 + size-4 + size-8 (halfnn=4) pass: load once, apply all
3 butterfly stages in-register, store once. Uses hardcoded W8 twiddle
constants. Eliminates 2 full load-store round trips.
- Fused last butterfly + final twist: the last general-loop iteration
applies the twist multiply in-register before storing, eliminating
1 more round trip.
- General loop starts at halfnn=8 (halfnn=4 handled by fused pass).
IFFT (inverse):
- Fused first twist + first butterfly (largest halfnn): applies twist
and DIF butterfly in one pass, eliminating 1 round trip.
- Fused size-8 + size-4 + size-2 (last 3 stages): same in-register
fusion as FFT but in DIF order. Eliminates 2 round trips.
For N=1024, the original code did 9 passes; the fused code does ~5,
reducing memory traffic by ~44%.
Benchmarks on AMD EPYC 7542 (Zen 2, 2.9 GHz, AVX2+FMA):
FFT forward: 1873 → 1407 ns (−24.9%)
IFFT inverse: 1366 → 1116 ns (−18.3%)
IFFT+Mul+FFT: 3386 → 2715 ns (−19.8%)
NAND gate: 12.53 → 11.15 ms (−11.0%)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>1 parent 0aa6ba0 commit 91dc506
File tree
2 files changed
+490
-447
lines changed- thirdparties/spqlios
2 files changed
+490
-447
lines changed
0 commit comments