feat: Add safe_serialization option to the dcp checkpoint conversion script#579
feat: Add safe_serialization option to the dcp checkpoint conversion script#579giordanobsf wants to merge 1 commit intoNVIDIA-NeMo:mainfrom
Conversation
Signed-off-by: Giordano B. Ferreira <giordanobsf@gmail.com>
terrykong
left a comment
There was a problem hiding this comment.
thanks for the change. could you add a unit test for this?
as a heads up for where we are heading, we will eventually integrate the native pytorch DCP checkpointing of HF so that this script wont' be needed in the future. cc @ffrujeri
It's a little longer horizon, so I'm okay with this change as long as it's tested
|
@ffrujeri @joyang-nv is this still relevant with Automodel ? |
With the Automodel integrations, the users will be able to save directly a checkpoint in .safetensors format. But if a previous checkpoint was created on the .dcp format and the user wants to convert to safetensors and then start training from there, I think this conversion script could be useful? Did I understand this PR correctly? |
|
@ffrujeri i think so. @joyang-nv what is your feedback. how do we plan to transition folks from earlier DTensor path to Automodel path ? |
|
gentle bump here @joyang-nv |
|
Deeply sorry for so late reply. |
|
Here is my feedbacks.
|
|
Reviving this thread. |
What does this PR do ?
Currently, the script converts dcp checkpoints to
.binfor the HF checkpoint. This PR allows users to also convert checkpoints to.safetensors.Issues
Closes #491
Usage
uv run python examples/converters/convert_dcp_to_hf.py \ --config results/grpo/step_170/config.yaml \ --dcp-ckpt-path results/grpo/step_170/policy/weights/ \ --hf-ckpt-path results/grpo/hf \ --safe-serialization trueBefore your PR is "Ready for review"
Pre checks: