Skip to content

Feature Request: ScatterMoE (Triton-based Sparse MoE with fused scatter/gather GEMMs) #4167

@sbhavani

Description

@sbhavani

Summary

Request to explore ScatterMoE techniques in Megatron Core's MoE implementation. ScatterMoE fuses grouped GEMMs with scattered read/write operations via a Triton kernel, eliminating the need to materialize padded tensors in HBM.

Update:
While ScatterMoE provides the foundational approach, SonicMoE further optimizes for Hopper/Blackwell architectures by reducing activation memory and improving tile utilization. See #2709

Motivation

Padded MoE implementations copy inputs into padded tensors to handle variable-length expert assignments. This overhead grows with expert count and granularity, and is amplified during training where intermediates are retained for the backward pass.

ScatterMoE pads indices rather than tensors. A scatter2scatter Triton kernel loads tiles using padded indices directly into SRAM. Reported results (8×A100, 1.5B model):

  • ~38% higher training throughput vs MegaBlocks
  • ~34% lower training memory, ~46% lower inference memory
  • Correctness validated via Mixtral 8x7B conversion (≤0.006 error across 11 benchmarks)

Requested Feature

Investigate adding a ScatterMoE-style backend as a configurable option in megatron.core.transformer.moe, compatible with existing routers and expert parallelism.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions