Summary
Request to investigate supporting Rollout Routing Replay (R3) in Megatron Core's MoE layer. R3 addresses a training instability specific to MoE models during reinforcement learning: when separate inference and training engines independently compute routing decisions, expert selection diverges for the same inputs, causing policy mismatch that can lead to training collapse.
Motivation
In RL pipelines that use separate inference (e.g., SGLang/vLLM) and training (e.g., Megatron) engines, routers in each engine independently select experts. Even with identical weights, numerical differences cause ~10% of routers to disagree per forward pass, with 94% of tokens differing in at least one layer. This compounding mismatch destabilizes training. In experiments on Qwen3-30B-A3B, 3 of 3 baseline GRPO runs collapsed while all R3 runs completed successfully.
R3 fixes this by caching the binary expert-selection mask from the inference engine and replaying it during training. Training logits are still computed normally (preserving gradient flow to the router), but expert selection is forced to match inference.
Results (Qwen3-30B-A3B, GRPO):
- Average score: 71.83 vs 62.23 baseline (single mini-step SFT)
- KL divergence reduced ~2× (approaching dense model levels)
- Eliminates catastrophic training collapse across all configurations tested
Requested Feature
Enable R3-style replay when Megatron is used as the training engine in hybrid RL pipelines. Investigate adding an option in megatron.core.transformer.moe to accept an external expert-selection mask during the forward pass, bypassing the router's top-k selection while still computing gating weights from the training router's logits.
References
Summary
Request to investigate supporting Rollout Routing Replay (R3) in Megatron Core's MoE layer. R3 addresses a training instability specific to MoE models during reinforcement learning: when separate inference and training engines independently compute routing decisions, expert selection diverges for the same inputs, causing policy mismatch that can lead to training collapse.
Motivation
In RL pipelines that use separate inference (e.g., SGLang/vLLM) and training (e.g., Megatron) engines, routers in each engine independently select experts. Even with identical weights, numerical differences cause ~10% of routers to disagree per forward pass, with 94% of tokens differing in at least one layer. This compounding mismatch destabilizes training. In experiments on Qwen3-30B-A3B, 3 of 3 baseline GRPO runs collapsed while all R3 runs completed successfully.
R3 fixes this by caching the binary expert-selection mask from the inference engine and replaying it during training. Training logits are still computed normally (preserving gradient flow to the router), but expert selection is forced to match inference.
Results (Qwen3-30B-A3B, GRPO):
Requested Feature
Enable R3-style replay when Megatron is used as the training engine in hybrid RL pipelines. Investigate adding an option in
megatron.core.transformer.moeto accept an external expert-selection mask during the forward pass, bypassing the router's top-k selection while still computing gating weights from the training router's logits.References