Skip to content

Commit 34bcd25

Browse files
Ubuntuclaude
andcommitted
Add Option C/D stage fusion to AVX2 FFT/IFFT assembly
Port the multi-stage fusion passes from the AVX-512 assembly to AVX2, further reducing memory traffic by processing pairs of butterfly stages in a single pass over the data. New fused passes (AVX2): FFT (DIT order): - Option C: fused halfnn=8+16 using 4-column groups at stride 8. Each inner iteration loads 4 re + 4 im YMMs, applies both butterfly stages with different twiddle factors, stores once. - Option D: fused halfnn=32+64 using 4-column groups at stride 32. Same pattern at larger stride. IFFT (DIF order): - Option D: fused halfnn=64+32 (DIF: large first). - Option C: fused halfnn=16+8 (DIF: large first). For N=1024 (ns4=256), the pass count is now: FFT: fused(2+4+8) → optC(8+16) → optD(32+64) → fused(128+twist) = 4 passes IFFT: fused(twist+128) → optD(64+32) → optC(16+8) → fused(8+4+2) = 4 passes Previously 6 passes each, originally 9. Total memory traffic reduction vs original: ~56%. Benchmarks on AMD EPYC 7542 (Zen 2, 2.9 GHz, AVX2+FMA): FFT forward: 1873 → 1237 ns (−33.9%, was −24.9%) IFFT inverse: 1366 → 998 ns (−26.9%, was −18.3%) IFFT+Mul+FFT: 3386 → 2415 ns (−28.7%, was −19.8%) NAND gate: 12.53 → 10.59 ms (−15.5%, was −11.0%) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 91dc506 commit 34bcd25

File tree

2 files changed

+577
-317
lines changed

2 files changed

+577
-317
lines changed

0 commit comments

Comments
 (0)