[WIP] Feature/grpo#22
Open
Rouzbehat78 wants to merge 16 commits intomainfrom
Open
Conversation
TRL v1.0 adds production-grade GRPOTrainer with native rollout_func,
vLLM colocate/server modes, and async rollouts. Also adds openenv-core
as an optional extra ('uv sync --extra rl-env').
Adds rewards/ (top-level, sibling of job_configs/) for plain Python reward functions referenced from YAML by path. Recipe class bundles multiple rewards + weights per task; subclass to extend.
Adds grpo and vlm_grpo training types with colocate (default) and server-mode vLLM rollouts. VLM GRPO preserves the 0.1x vision encoder LR via a shared helper also used by VLM SFT. Ray Train passes the full dataset to every worker for GRPO since TRL's RepeatSampler handles per-rank distribution.
Example YAMLs for text GRPO (colocate + server modes) and VLM grounding. Unit tests for the reward loader, recipes, and config parser; e2e smoke tests + SLURM launchers for 1 GPU runs.
- DPO tokenizer: prompt_input_ids → prompt_ids (TRL v1 collator change) - config_parser: pre-resolve reward paths to absolute on driver so Ray workers can find them from their sandbox CWD - e2e fixtures: use_vllm=false until vllm supports transformers 5.x - test assertions updated for new column names and absolute paths
…ment of multi-node in ray, works iwth gRPO colocate, SFT, dpo
… need to change the dataset format, SFT converts to prompt,solution pair for the GRPO
…wn, reward Hub where recipes for differnet tasks accumulates for re-use. Each task contains a recipe that is essentially a bundle of rewards + weights for each reward. Combine rewards and recipes to construct your ideal reward functions
… the images, bad parses, no gradietns for image tokens: vlm_grpo trainer: lift images, alias spatial_shapes, VLM-aware logps
LFMVLMGRPOTrainer patches three gaps in TRL's multimodal data path so
LFM2-VL actually gets gradient through the vision tower during GRPO:
- Lift images from prompt message content into the top-level
key TRL inspects, so the multimodal branch fires and pixel_values
reach the training forward pass (without it, generation still sees
images but training silently runs with pixel_values=None).
- Alias the processor's output to via a
context-scoped __class__ swap, letting the tensor ride TRL's fixed
multimodal kwarg whitelist from data prep through _compute_loss.
- Override _get_per_token_logps_and_entropies to rename back to
spatial_shapes at the model-forward boundary, filter to kwargs the
model accepts, and skip TRL's per-sample pixel_values chunking
(LFM2-VL returns patch-concatenated pixels, not (B, C, H, W)).
… celanup, few exampels for tedt and VLM
…nv with reward lenght for both VLM and LLM
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.