Description
There is a bug when save moe checkpoints.
Code follow, which is the megatron.training.checkpointing.save_checkpoint
if not torch.distributed.is_initialized() \
or mpu.get_expert_data_parallel_rank() == 0 \
or ckpt_type != CheckpointType.LEGACY:
if ckpt_type != CheckpointType.LEGACY:
sharded_sd_metadata = _build_sharded_state_dict_metadata(args)
if args.use_distributed_optimizer:
print_rank_0(f'Storing distributed optimizer sharded state of type'
f' {sharded_sd_metadata["distrib_optim_sharding_type"]}')
else:
sharded_sd_metadata = None
state_dict = generate_state_dict(
args,
model,
optimizer,
opt_param_scheduler,
rng_state,
iteration=iteration,
optim_sd_kwargs=dict(metadata=sharded_sd_metadata),
model_sd_kwargs=dict(metadata=sharded_sd_metadata),
rerun_state=rerun_state,
)
If the TP > EP * ETP, such as world_size 8, TP4, EP2, ETP1, the state_dict will be generate at rank [0,1], but the attn state_dicts on rank [2, 3] do not be generated.
So the TP rank[2, 3] are missing. When training from the checkpoint, loading checkpoint will occur an error that file do not found. Just like:
[Errno 2] No such file or directory: '/workspace/repos/megatron/tests/functional_tests/train/deepseek/test_results/tp4_pp1_ep2/checkpoints/iter_0000010/mp_rank_02_000/model_optim_rng.pt',
[Errno 2] No such file or directory: '/workspace/repos/megatron/tests/functional_tests/train/deepseek/test_results/tp4_pp1_ep2/checkpoints/iter_0000010/mp_rank_03_001/model_optim_rng.pt'
Reproduce
version: core_v0.16.0
any moe models with 8GPUs, and set TP > EP * ETP
Solution
The dp_rank 0 to generate state_dict should be the smaller group between dp_group and edp_group.
Description
There is a bug when save moe checkpoints.
Code follow, which is the megatron.training.checkpointing.save_checkpoint
If the TP > EP * ETP, such as world_size 8, TP4, EP2, ETP1, the state_dict will be generate at rank [0,1], but the attn state_dicts on rank [2, 3] do not be generated.
So the TP rank[2, 3] are missing. When training from the checkpoint, loading checkpoint will occur an error that file do not found. Just like:
[Errno 2] No such file or directory: '/workspace/repos/megatron/tests/functional_tests/train/deepseek/test_results/tp4_pp1_ep2/checkpoints/iter_0000010/mp_rank_02_000/model_optim_rng.pt',
[Errno 2] No such file or directory: '/workspace/repos/megatron/tests/functional_tests/train/deepseek/test_results/tp4_pp1_ep2/checkpoints/iter_0000010/mp_rank_03_001/model_optim_rng.pt'
Reproduce
version: core_v0.16.0
any moe models with 8GPUs, and set TP > EP * ETP
Solution
The dp_rank 0 to generate state_dict should be the smaller group between dp_group and edp_group.