fix: account for multi-turn chunks in agentic LR scheduler budget (#407)#439
Open
dashitongzhi wants to merge 1 commit intoalibaba:mainfrom
Open
fix: account for multi-turn chunks in agentic LR scheduler budget (#407)#439dashitongzhi wants to merge 1 commit intoalibaba:mainfrom
dashitongzhi wants to merge 1 commit intoalibaba:mainfrom
Conversation
When using AgentNativeStepEnvManager for step-level agentic training, each trajectory produces multiple training samples (one per agent turn). The base PPOConfig.set_max_steps() only accounts for rollout_batch_size (number of trajectories), causing the LR scheduler to exhaust its step budget far before all pipeline steps complete. For example, with 4 trajectories × ~10 turns each = ~40 training samples per pipeline step. With backward_batch_size=4, that's ~10 optimizer steps per pipeline step instead of the budgeted 1. This fix: - Adds estimated_chunks_per_traj config field (default 0 = auto-detect) - Overrides set_max_steps in AgenticConfig to multiply optimizer step budget by the chunks-per-trajectory estimate - Auto-detects the estimate from custom_envs max_steps (conservative: max_steps // 2) when not explicitly configured - Users can override via estimated_chunks_per_traj for precise control Fixes alibaba#407
There was a problem hiding this comment.
Pull request overview
This PR fixes premature LR-scheduler exhaustion during step-level agentic training by making AgenticConfig.set_max_steps() account for the fact that some env managers generate one training sample per turn (chunk) rather than per trajectory.
Changes:
- Added
estimated_chunks_per_trajto let users explicitly scale the optimizer-step budget for multi-turn (chunked) rollouts. - Implemented
_get_chunks_per_traj_estimate()to auto-estimate the chunk multiplier fromcustom_envs. - Overrode
set_max_steps()inAgenticConfigto multiply actor/critictraining_args.max_stepsby the chunk multiplier.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| "When 0 (default), auto-detected from custom_envs max_steps (conservative: max_steps // 2)." | ||
| }, | ||
| ) | ||
|
|
Comment on lines
+383
to
+391
| # Auto-detect from custom_envs: look for max_steps in env configs | ||
| max_env_steps = 0 | ||
| if self.custom_envs: | ||
| for tag, cfg in self.custom_envs.items(): | ||
| if hasattr(cfg, 'max_steps') and cfg.max_steps is not None: | ||
| max_env_steps = max(max_env_steps, int(cfg.max_steps)) | ||
| elif isinstance(cfg, dict) and 'max_steps' in cfg: | ||
| max_env_steps = max(max_env_steps, int(cfg['max_steps'])) | ||
|
|
Comment on lines
+383
to
+401
| # Auto-detect from custom_envs: look for max_steps in env configs | ||
| max_env_steps = 0 | ||
| if self.custom_envs: | ||
| for tag, cfg in self.custom_envs.items(): | ||
| if hasattr(cfg, 'max_steps') and cfg.max_steps is not None: | ||
| max_env_steps = max(max_env_steps, int(cfg.max_steps)) | ||
| elif isinstance(cfg, dict) and 'max_steps' in cfg: | ||
| max_env_steps = max(max_env_steps, int(cfg['max_steps'])) | ||
|
|
||
| if max_env_steps > 1: | ||
| # Conservative estimate: half of max_steps as average turns per trajectory | ||
| estimate = max(1, max_env_steps // 2) | ||
| logger.info( | ||
| f"Auto-detected estimated_chunks_per_traj={estimate} " | ||
| f"(from custom_envs max_steps={max_env_steps}, using max_steps // 2)" | ||
| ) | ||
| return estimate | ||
|
|
||
| return 1 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
When using
AgentNativeStepEnvManagerfor step-level agentic training, the LR scheduler exhausts its step budget far before all pipeline steps complete, causing the learning rate to drop to zero mid-training.Root cause:
PPOConfig.set_max_steps()computes total optimizer steps based onrollout_batch_size(number of trajectories), butAgentNativeStepEnvManager.formulate_rollouts()creates one training sample per turn — so the actual number of optimizer steps per pipeline step is much higher than budgeted.Example from the field: With 4 trajectories × ~10 turns each = ~40 training samples per pipeline step. With
backward_batch_size=4, that's ~10 optimizer steps per pipeline step — not the 1 that the scheduler was budgeted for. In a 200-step run, LR hit zero at step 123 (38.5% of training with zero LR).Fixes #407.
Solution
This PR overrides
set_max_stepsinAgenticConfigto account for multi-turn chunking:estimated_chunks_per_traj(default 0 = auto-detect)custom_envs.max_stepsusingmax_steps // 2as a conservative midpointestimated_chunks_per_trajfor precise controlmax_steps <= 1in env configs (single-turn), falls back to multiplier of 1 — identical to parent behaviorHow it works
User-facing config
Files Changed
roll/pipeline/agentic/agentic_config.py— Addedestimated_chunks_per_trajfield,_get_chunks_per_traj_estimate()method, andset_max_steps()overrideTesting
The fix is backward-compatible: for single-turn environments (max_steps=1), the auto-detect returns 1 and behavior is identical to the parent class.