Feature Request: ScatterMoE (Triton-based Sparse MoE with fused scatter/gather GEMMs)

## Summary
Request to explore [ScatterMoE](https://arxiv.org/abs/2403.08245) techniques in Megatron Core's MoE implementation. ScatterMoE fuses grouped GEMMs with scattered read/write operations via a Triton kernel, eliminating the need to materialize padded tensors in HBM.

**Update:**
While ScatterMoE provides the foundational approach, SonicMoE further optimizes for Hopper/Blackwell architectures by reducing activation memory and improving tile utilization. See https://github.com/NVIDIA/Megatron-LM/issues/2709

## Motivation
Padded MoE implementations copy inputs into padded tensors to handle variable-length expert assignments. This overhead grows with expert count and granularity, and is amplified during training where intermediates are retained for the backward pass.

ScatterMoE pads **indices** rather than **tensors**. A `scatter2scatter` Triton kernel loads tiles using padded indices directly into SRAM. Reported results (8×A100, 1.5B model):

- ~38% higher training throughput vs MegaBlocks
- ~34% lower training memory, ~46% lower inference memory
- Correctness validated via Mixtral 8x7B conversion (≤0.006 error across 11 benchmarks)

## Requested Feature
Investigate adding a ScatterMoE-style backend as a configurable option in `megatron.core.transformer.moe`, compatible with existing routers and expert parallelism.

## References
- [ScatterMoE Paper (Tan, Shen, Panda, Courville)](https://arxiv.org/abs/2403.08245)
- [GitHub: shawntan/scattermoe](https://github.com/shawntan/scattermoe)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: ScatterMoE (Triton-based Sparse MoE with fused scatter/gather GEMMs) #4167

Summary

Motivation

Requested Feature

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feature Request: ScatterMoE (Triton-based Sparse MoE with fused scatter/gather GEMMs) #4167

Description

Summary

Motivation

Requested Feature

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions