feat: Add pinned memory optimizer offload for Megatron policy worker#2248
feat: Add pinned memory optimizer offload for Megatron policy worker#2248snivertynv wants to merge 1 commit intomainfrom
Conversation
…olled using the use_pinned_optimizer_offload setting - set to false in a couple of grpo_math* yaml configs as an example. Added test cases for this feature in test_megatron_worker.py Signed-off-by: Sriharsha Niverty <sniverty@nvidia.com>
|
/ok to test 13e7f01 |
| else: | ||
| optimizer_state = self.optimizer._get_state() | ||
|
|
||
| use_pinned = self.cfg.get("use_pinned_optimizer_offload", False) |
There was a problem hiding this comment.
this is not ideal, we should use self.cfg['use_pinned_optimizer_offload']; we should add this field to all necessary base configs under examples/configs to make sure the read doesn't fail.
| def _get_or_alloc_pinned_buf( | ||
| self, attr_name: str, total_bytes: int | ||
| ) -> torch.Tensor: | ||
| """Return a cached pinned CPU buffer, allocating only on first use or resize.""" |
There was a problem hiding this comment.
I recommend having a def _delete_pinned_buf(self, attr_name) paired with it; in the optimizer offload case, we repeatedly use the same buffer and we don't want to delete until job finish, but if you leave this API here and someone implements some new temporal offloading with it, they will assume you handled the deletion but actually not.
So either we add a deletion api here too and show people how to use them in pairs, or we make this allocation API specific to optimizer offloading (in naming or just hardcode it in the _optimizer_to functions), so that folks don't use it for other purposes.
This significantly improves performance for the optimizer_offload_before_refit pass which is quite expensive in co-located/syncRL cases.
Enabled/disabled using the
use_pinned_optimizer_offloadsetting (default=disabled). It has been set tofalsein a couple of grpo_math* yaml configs as an example. Added test cases for this feature intest_megatron_worker.pyWhat does this PR do ?
Optimizer D2H/H2D transfers used per-tensor pageable allocations, causing expensive cudaHostAlloc calls and synchronous memcpy on every step. This adds an opt-in mode (use_pinned_optimizer_offload) that coalesces all optimizer state into a single cached pinned buffer, eliminating cudaHostAlloc from the hot path and enabling non-blocking DMA transfers.
Issues
List issues that this PR closes (syntax):
Usage
# Add a code snippet demonstrating how to use thisBefore your PR is "Ready for review"
Pre checks:
Additional Information