Bug of save_checkpoint with ckpt_format is torch.

# Description
There is a bug when save moe checkpoints. 
Code follow, which is the megatron.training.checkpointing.save_checkpoint
```python
if not torch.distributed.is_initialized() \
      or mpu.get_expert_data_parallel_rank() == 0 \
      or ckpt_type != CheckpointType.LEGACY:
  if ckpt_type != CheckpointType.LEGACY:
      sharded_sd_metadata = _build_sharded_state_dict_metadata(args)
      if args.use_distributed_optimizer:
          print_rank_0(f'Storing distributed optimizer sharded state of type'
                       f' {sharded_sd_metadata["distrib_optim_sharding_type"]}')
  else:
      sharded_sd_metadata = None
  state_dict = generate_state_dict(
      args,
      model,
      optimizer,
      opt_param_scheduler,
      rng_state,
      iteration=iteration,
      optim_sd_kwargs=dict(metadata=sharded_sd_metadata),
      model_sd_kwargs=dict(metadata=sharded_sd_metadata),
      rerun_state=rerun_state,
  )
```
If the TP > EP * ETP, such as world_size 8, TP4, EP2, ETP1, the state_dict will be generate at rank [0,1], but the attn state_dicts on rank [2, 3] do not be generated. 

So the TP rank[2, 3] are missing. When training from the checkpoint, loading checkpoint will occur an error that file do not found. Just like:

[Errno 2] No such file or directory: '/workspace/repos/megatron/tests/functional_tests/train/deepseek/test_results/tp4_pp1_ep2/checkpoints/iter_0000010/mp_rank_02_000/model_optim_rng.pt',
[Errno 2] No such file or directory: '/workspace/repos/megatron/tests/functional_tests/train/deepseek/test_results/tp4_pp1_ep2/checkpoints/iter_0000010/mp_rank_03_001/model_optim_rng.pt'

<img width="1500" height="277" alt="Image" src="https://github.com/user-attachments/assets/1c4d6c1a-85bb-439a-b1b6-e38602296089" />

# Reproduce
version: core_v0.16.0
any moe models with 8GPUs, and set TP > EP * ETP

# Solution
The dp_rank 0 to generate state_dict  should be the smaller group between dp_group and edp_group. 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug of save_checkpoint with ckpt_format is torch. #4200

Description

Reproduce

Solution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Bug of save_checkpoint with ckpt_format is torch. #4200

Description

Description

Reproduce

Solution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions