Changelog

2026-02-28 — Decode optimisation: serial route matmul + AVX2 sigmoid

Baseline: 152.7 ms/tok (6.55 tok/s) at 12 threads on 5900X (WSL2, DDR4 dual-channel) After: 119.0 ms/tok (8.40 tok/s) — 22% latency reduction, 28% throughput improvement

Changes kept

Serial route matmul: Changed MoE routing from parallel to serial dispatch. Rayon thread wake-up latency (~30us x 11 threads x 40 layers) dominated for 256 tiny dot products (8KB per expert). Saved 32ms/tok (moe_route: 35.1ms to 3.1ms).
AVX2 vectorized sigmoid: Replaced scalar sigmoid in SiLU activation with 8-wide AVX2 fast exp approximation. Marginal (~1ms), but cleaner code with no downside.

Changes tested and reverted

Sub-tile splitting: cache line contention from two threads reading same tile data (+7%)
Flattened MoE dispatch: cross-expert task mixing destroyed L2 cache locality (+22%)
No nested MoE parallelism: only 8/12 threads active, lost work-stealing benefit (+10%)
MADV_COLLAPSE (THP alternative): WSL2 returns EINVAL, Hyper-V doesn't support it
Page table warming: pre-faulting PTEs polluted L1/L2, net slower (+5%)
PREFETCHNTA (non-temporal prefetch on weights): ~2% improvement but within WSL2 noise

Key findings

24 threads (SMT) is 2.3x slower than 12 threads (physical cores only) on 5900X
THP does not work in WSL2 (AnonHugePages stays 0 despite madvise calls)
TLB misses from 4KB pages on a 67GB model account for most of the remaining gap to theoretical bandwidth (~11 GB/s effective vs ~38 GB/s practical STREAM)
MADV_COLLAPSE would likely work on native Linux and is worth retesting there

2026-02-28 — Log management and run notes

On server start, existing krasis.log is archived to logs/krasis_YYYYMMDD_HHMMSS.log (timestamped from file mtime)
Fresh krasis.log started for each run
New --note parameter writes a run description header at the top of each log
logs/ directory gitignored (except .gitkeep)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Changelog

2026-02-28 — Decode optimisation: serial route matmul + AVX2 sigmoid

Changes kept

Changes tested and reverted

Key findings

2026-02-28 — Log management and run notes

FilesExpand file tree

CHANGELOG.md

Latest commit

History

CHANGELOG.md

File metadata and controls

Changelog

2026-02-28 — Decode optimisation: serial route matmul + AVX2 sigmoid

Changes kept

Changes tested and reverted

Key findings

2026-02-28 — Log management and run notes