Commit 34bcd25
Add Option C/D stage fusion to AVX2 FFT/IFFT assembly
Port the multi-stage fusion passes from the AVX-512 assembly to AVX2,
further reducing memory traffic by processing pairs of butterfly stages
in a single pass over the data.
New fused passes (AVX2):
FFT (DIT order):
- Option C: fused halfnn=8+16 using 4-column groups at stride 8.
Each inner iteration loads 4 re + 4 im YMMs, applies both butterfly
stages with different twiddle factors, stores once.
- Option D: fused halfnn=32+64 using 4-column groups at stride 32.
Same pattern at larger stride.
IFFT (DIF order):
- Option D: fused halfnn=64+32 (DIF: large first).
- Option C: fused halfnn=16+8 (DIF: large first).
For N=1024 (ns4=256), the pass count is now:
FFT: fused(2+4+8) → optC(8+16) → optD(32+64) → fused(128+twist) = 4 passes
IFFT: fused(twist+128) → optD(64+32) → optC(16+8) → fused(8+4+2) = 4 passes
Previously 6 passes each, originally 9. Total memory traffic reduction
vs original: ~56%.
Benchmarks on AMD EPYC 7542 (Zen 2, 2.9 GHz, AVX2+FMA):
FFT forward: 1873 → 1237 ns (−33.9%, was −24.9%)
IFFT inverse: 1366 → 998 ns (−26.9%, was −18.3%)
IFFT+Mul+FFT: 3386 → 2415 ns (−28.7%, was −19.8%)
NAND gate: 12.53 → 10.59 ms (−15.5%, was −11.0%)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>1 parent 91dc506 commit 34bcd25
File tree
2 files changed
+577
-317
lines changed- thirdparties/spqlios
2 files changed
+577
-317
lines changed
0 commit comments