Skip to content

[WIP] Feature/grpo#22

Open
Rouzbehat78 wants to merge 16 commits intomainfrom
feature/GRPO
Open

[WIP] Feature/grpo#22
Rouzbehat78 wants to merge 16 commits intomainfrom
feature/GRPO

Conversation

@Rouzbehat78
Copy link
Copy Markdown
Contributor

No description provided.

TRL v1.0 adds production-grade GRPOTrainer with native rollout_func,
vLLM colocate/server modes, and async rollouts. Also adds openenv-core
as an optional extra ('uv sync --extra rl-env').
Adds rewards/ (top-level, sibling of job_configs/) for plain Python
reward functions referenced from YAML by path. Recipe class bundles
multiple rewards + weights per task; subclass to extend.
Adds grpo and vlm_grpo training types with colocate (default) and
server-mode vLLM rollouts. VLM GRPO preserves the 0.1x vision encoder
LR via a shared helper also used by VLM SFT. Ray Train passes the full
dataset to every worker for GRPO since TRL's RepeatSampler handles
per-rank distribution.
Example YAMLs for text GRPO (colocate + server modes) and VLM
grounding. Unit tests for the reward loader, recipes, and config
parser; e2e smoke tests + SLURM launchers for 1 GPU runs.
- DPO tokenizer: prompt_input_ids → prompt_ids (TRL v1 collator change)
- config_parser: pre-resolve reward paths to absolute on driver so Ray
  workers can find them from their sandbox CWD
- e2e fixtures: use_vllm=false until vllm supports transformers 5.x
- test assertions updated for new column names and absolute paths
…ment of multi-node in ray, works iwth gRPO colocate, SFT, dpo
… need to change the dataset format, SFT converts to prompt,solution pair for the GRPO
…wn, reward Hub where recipes for differnet tasks accumulates for re-use. Each task contains a recipe that is essentially a bundle of rewards + weights for each reward. Combine rewards and recipes to construct your ideal reward functions
… the images, bad parses, no gradietns for image tokens: vlm_grpo trainer: lift images, alias spatial_shapes, VLM-aware logps

  LFMVLMGRPOTrainer patches three gaps in TRL's multimodal data path so
  LFM2-VL actually gets gradient through the vision tower during GRPO:

  - Lift images from prompt message content into the top-level
    key TRL inspects, so the multimodal branch fires and pixel_values
    reach the training forward pass (without it, generation still sees
    images but training silently runs with pixel_values=None).
  - Alias the processor's  output to  via a
    context-scoped __class__ swap, letting the tensor ride TRL's fixed
    multimodal kwarg whitelist from data prep through _compute_loss.
  - Override _get_per_token_logps_and_entropies to rename back to
    spatial_shapes at the model-forward boundary, filter to kwargs the
    model accepts, and skip TRL's per-sample pixel_values chunking
    (LFM2-VL returns patch-concatenated pixels, not (B, C, H, W)).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant