Skip to content

Bug of save_checkpoint with ckpt_format is torch. #4200

@LiJunscs

Description

@LiJunscs

Description

There is a bug when save moe checkpoints.
Code follow, which is the megatron.training.checkpointing.save_checkpoint

if not torch.distributed.is_initialized() \
      or mpu.get_expert_data_parallel_rank() == 0 \
      or ckpt_type != CheckpointType.LEGACY:
  if ckpt_type != CheckpointType.LEGACY:
      sharded_sd_metadata = _build_sharded_state_dict_metadata(args)
      if args.use_distributed_optimizer:
          print_rank_0(f'Storing distributed optimizer sharded state of type'
                       f' {sharded_sd_metadata["distrib_optim_sharding_type"]}')
  else:
      sharded_sd_metadata = None
  state_dict = generate_state_dict(
      args,
      model,
      optimizer,
      opt_param_scheduler,
      rng_state,
      iteration=iteration,
      optim_sd_kwargs=dict(metadata=sharded_sd_metadata),
      model_sd_kwargs=dict(metadata=sharded_sd_metadata),
      rerun_state=rerun_state,
  )

If the TP > EP * ETP, such as world_size 8, TP4, EP2, ETP1, the state_dict will be generate at rank [0,1], but the attn state_dicts on rank [2, 3] do not be generated.

So the TP rank[2, 3] are missing. When training from the checkpoint, loading checkpoint will occur an error that file do not found. Just like:

[Errno 2] No such file or directory: '/workspace/repos/megatron/tests/functional_tests/train/deepseek/test_results/tp4_pp1_ep2/checkpoints/iter_0000010/mp_rank_02_000/model_optim_rng.pt',
[Errno 2] No such file or directory: '/workspace/repos/megatron/tests/functional_tests/train/deepseek/test_results/tp4_pp1_ep2/checkpoints/iter_0000010/mp_rank_03_001/model_optim_rng.pt'

Image

Reproduce

version: core_v0.16.0
any moe models with 8GPUs, and set TP > EP * ETP

Solution

The dp_rank 0 to generate state_dict should be the smaller group between dp_group and edp_group.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions