Baseline: 152.7 ms/tok (6.55 tok/s) at 12 threads on 5900X (WSL2, DDR4 dual-channel) After: 119.0 ms/tok (8.40 tok/s) — 22% latency reduction, 28% throughput improvement
- Serial route matmul: Changed MoE routing from parallel to serial dispatch. Rayon thread wake-up latency (~30us x 11 threads x 40 layers) dominated for 256 tiny dot products (8KB per expert). Saved 32ms/tok (moe_route: 35.1ms to 3.1ms).
- AVX2 vectorized sigmoid: Replaced scalar sigmoid in SiLU activation with 8-wide AVX2 fast exp approximation. Marginal (~1ms), but cleaner code with no downside.
- Sub-tile splitting: cache line contention from two threads reading same tile data (+7%)
- Flattened MoE dispatch: cross-expert task mixing destroyed L2 cache locality (+22%)
- No nested MoE parallelism: only 8/12 threads active, lost work-stealing benefit (+10%)
- MADV_COLLAPSE (THP alternative): WSL2 returns EINVAL, Hyper-V doesn't support it
- Page table warming: pre-faulting PTEs polluted L1/L2, net slower (+5%)
- PREFETCHNTA (non-temporal prefetch on weights): ~2% improvement but within WSL2 noise
- 24 threads (SMT) is 2.3x slower than 12 threads (physical cores only) on 5900X
- THP does not work in WSL2 (AnonHugePages stays 0 despite madvise calls)
- TLB misses from 4KB pages on a 67GB model account for most of the remaining gap to theoretical bandwidth (~11 GB/s effective vs ~38 GB/s practical STREAM)
- MADV_COLLAPSE would likely work on native Linux and is worth retesting there
- On server start, existing
krasis.logis archived tologs/krasis_YYYYMMDD_HHMMSS.log(timestamped from file mtime) - Fresh
krasis.logstarted for each run - New
--noteparameter writes a run description header at the top of each log logs/directory gitignored (except.gitkeep)