Summary
Request to explore ScatterMoE techniques in Megatron Core's MoE implementation. ScatterMoE fuses grouped GEMMs with scattered read/write operations via a Triton kernel, eliminating the need to materialize padded tensors in HBM.
Update:
While ScatterMoE provides the foundational approach, SonicMoE further optimizes for Hopper/Blackwell architectures by reducing activation memory and improving tile utilization. See #2709
Motivation
Padded MoE implementations copy inputs into padded tensors to handle variable-length expert assignments. This overhead grows with expert count and granularity, and is amplified during training where intermediates are retained for the backward pass.
ScatterMoE pads indices rather than tensors. A scatter2scatter Triton kernel loads tiles using padded indices directly into SRAM. Reported results (8×A100, 1.5B model):
- ~38% higher training throughput vs MegaBlocks
- ~34% lower training memory, ~46% lower inference memory
- Correctness validated via Mixtral 8x7B conversion (≤0.006 error across 11 benchmarks)
Requested Feature
Investigate adding a ScatterMoE-style backend as a configurable option in megatron.core.transformer.moe, compatible with existing routers and expert parallelism.
References
Summary
Request to explore ScatterMoE techniques in Megatron Core's MoE implementation. ScatterMoE fuses grouped GEMMs with scattered read/write operations via a Triton kernel, eliminating the need to materialize padded tensors in HBM.
Update:
While ScatterMoE provides the foundational approach, SonicMoE further optimizes for Hopper/Blackwell architectures by reducing activation memory and improving tile utilization. See #2709
Motivation
Padded MoE implementations copy inputs into padded tensors to handle variable-length expert assignments. This overhead grows with expert count and granularity, and is amplified during training where intermediates are retained for the backward pass.
ScatterMoE pads indices rather than tensors. A
scatter2scatterTriton kernel loads tiles using padded indices directly into SRAM. Reported results (8×A100, 1.5B model):Requested Feature
Investigate adding a ScatterMoE-style backend as a configurable option in
megatron.core.transformer.moe, compatible with existing routers and expert parallelism.References