Skip to content

feat: Add safe_serialization option to the dcp checkpoint conversion script#579

Open
giordanobsf wants to merge 1 commit intoNVIDIA-NeMo:mainfrom
giordanobsf:convert-dcp-to-safetensors
Open

feat: Add safe_serialization option to the dcp checkpoint conversion script#579
giordanobsf wants to merge 1 commit intoNVIDIA-NeMo:mainfrom
giordanobsf:convert-dcp-to-safetensors

Conversation

@giordanobsf
Copy link

What does this PR do ?

Currently, the script converts dcp checkpoints to .bin for the HF checkpoint. This PR allows users to also convert checkpoints to .safetensors.

Issues

Closes #491

Usage

  • You can potentially add a usage example below
uv run python examples/converters/convert_dcp_to_hf.py \
    --config results/grpo/step_170/config.yaml \
    --dcp-ckpt-path results/grpo/step_170/policy/weights/ \
    --hf-ckpt-path results/grpo/hf \
    --safe-serialization true

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Signed-off-by: Giordano B. Ferreira <giordanobsf@gmail.com>
@github-actions github-actions bot added the documentation Improvements or additions to documentation label Jun 28, 2025
@giordanobsf giordanobsf changed the title Add safe_serialization option to the dcp checkpoint conversion script feat: Add safe_serialization option to the dcp checkpoint conversion script Jun 30, 2025
Copy link
Collaborator

@terrykong terrykong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the change. could you add a unit test for this?

as a heads up for where we are heading, we will eventually integrate the native pytorch DCP checkpointing of HF so that this script wont' be needed in the future. cc @ffrujeri

It's a little longer horizon, so I'm okay with this change as long as it's tested

@euronymous-aithal
Copy link
Contributor

@ffrujeri @joyang-nv is this still relevant with Automodel ?

@ffrujeri
Copy link
Contributor

@ffrujeri @joyang-nv is this still relevant with Automodel ?

With the Automodel integrations, the users will be able to save directly a checkpoint in .safetensors format. But if a previous checkpoint was created on the .dcp format and the user wants to convert to safetensors and then start training from there, I think this conversion script could be useful? Did I understand this PR correctly?

@euronymous-aithal
Copy link
Contributor

@ffrujeri i think so. @joyang-nv what is your feedback. how do we plan to transition folks from earlier DTensor path to Automodel path ?

@ashors1
Copy link
Contributor

ashors1 commented Oct 22, 2025

gentle bump here @joyang-nv

@joyang-nv
Copy link
Member

Deeply sorry for so late reply.

@joyang-nv
Copy link
Member

Here is my feedbacks.
@giordanobsf , deeply sorry from me on so late reply on this PR.
Can you rebase your PR and could help to test:

  1. Convert checkpoint from v1 policy worker?
  2. Convert checkpoint from v2 policy worker?
    I knew with v2 HF converter might have issue. It will be great if your PR could work in both paths.

@dhineshkumar-r
Copy link

Reviving this thread.
@ffrujeri / @terrykong - What's the recommendation in March 2026?
I need this capability as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-request documentation Improvements or additions to documentation external x-neospaceai

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add safe_serialization (safetensors) option to the checkpoint conversion script (examples/convert_dcp_to_hf.py)

9 participants