Skip to content

Feature Request: Router Replay (R3) for stable RL with MoE models #4168

@sbhavani

Description

@sbhavani

Summary

Request to investigate supporting Rollout Routing Replay (R3) in Megatron Core's MoE layer. R3 addresses a training instability specific to MoE models during reinforcement learning: when separate inference and training engines independently compute routing decisions, expert selection diverges for the same inputs, causing policy mismatch that can lead to training collapse.

Motivation

In RL pipelines that use separate inference (e.g., SGLang/vLLM) and training (e.g., Megatron) engines, routers in each engine independently select experts. Even with identical weights, numerical differences cause ~10% of routers to disagree per forward pass, with 94% of tokens differing in at least one layer. This compounding mismatch destabilizes training. In experiments on Qwen3-30B-A3B, 3 of 3 baseline GRPO runs collapsed while all R3 runs completed successfully.

R3 fixes this by caching the binary expert-selection mask from the inference engine and replaying it during training. Training logits are still computed normally (preserving gradient flow to the router), but expert selection is forced to match inference.

Results (Qwen3-30B-A3B, GRPO):

  • Average score: 71.83 vs 62.23 baseline (single mini-step SFT)
  • KL divergence reduced ~2× (approaching dense model levels)
  • Eliminates catastrophic training collapse across all configurations tested

Requested Feature

Enable R3-style replay when Megatron is used as the training engine in hybrid RL pipelines. Investigate adding an option in megatron.core.transformer.moe to accept an external expert-selection mask during the forward pass, bypassing the router's top-k selection while still computing gating weights from the training router's logits.

References

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions