[WIP] Feature/grpo by Rouzbehat78 · Pull Request #22 · Liquid4All/leap-finetune

Rouzbehat78 · 2026-04-13T17:59:33Z

No description provided.

TRL v1.0 adds production-grade GRPOTrainer with native rollout_func, vLLM colocate/server modes, and async rollouts. Also adds openenv-core as an optional extra ('uv sync --extra rl-env').

Adds rewards/ (top-level, sibling of job_configs/) for plain Python reward functions referenced from YAML by path. Recipe class bundles multiple rewards + weights per task; subclass to extend.

Adds grpo and vlm_grpo training types with colocate (default) and server-mode vLLM rollouts. VLM GRPO preserves the 0.1x vision encoder LR via a shared helper also used by VLM SFT. Ray Train passes the full dataset to every worker for GRPO since TRL's RepeatSampler handles per-rank distribution.

Example YAMLs for text GRPO (colocate + server modes) and VLM grounding. Unit tests for the reward loader, recipes, and config parser; e2e smoke tests + SLURM launchers for 1 GPU runs.

- DPO tokenizer: prompt_input_ids → prompt_ids (TRL v1 collator change) - config_parser: pre-resolve reward paths to absolute on driver so Ray workers can find them from their sandbox CWD - e2e fixtures: use_vllm=false until vllm supports transformers 5.x - test assertions updated for new column names and absolute paths

…ment of multi-node in ray, works iwth gRPO colocate, SFT, dpo

… need to change the dataset format, SFT converts to prompt,solution pair for the GRPO

…wn, reward Hub where recipes for differnet tasks accumulates for re-use. Each task contains a recipe that is essentially a bundle of rewards + weights for each reward. Combine rewards and recipes to construct your ideal reward functions

… the images, bad parses, no gradietns for image tokens: vlm_grpo trainer: lift images, alias spatial_shapes, VLM-aware logps LFMVLMGRPOTrainer patches three gaps in TRL's multimodal data path so LFM2-VL actually gets gradient through the vision tower during GRPO: - Lift images from prompt message content into the top-level key TRL inspects, so the multimodal branch fires and pixel_values reach the training forward pass (without it, generation still sees images but training silently runs with pixel_values=None). - Alias the processor's output to via a context-scoped __class__ swap, letting the tensor ride TRL's fixed multimodal kwarg whitelist from data prep through _compute_loss. - Override _get_per_token_logps_and_entropies to rename back to spatial_shapes at the model-forward boundary, filter to kwargs the model accepts, and skip TRL's per-sample pixel_values chunking (LFM2-VL returns patch-concatenated pixels, not (B, C, H, W)).

… celanup, few exampels for tedt and VLM

…nv with reward lenght for both VLM and LLM

Rouzbehat78 added 16 commits April 9, 2026 23:17

chore(deps): bump trl to v1.0, transformers to 5.3

47bbb18

TRL v1.0 adds production-grade GRPOTrainer with native rollout_func, vLLM colocate/server modes, and async rollouts. Also adds openenv-core as an optional extra ('uv sync --extra rl-env').

feat(grpo): reward recipes + example library

59e09f5

Adds rewards/ (top-level, sibling of job_configs/) for plain Python reward functions referenced from YAML by path. Recipe class bundles multiple rewards + weights per task; subclass to extend.

feat(grpo): example configs, tests, and README

c341846

Example YAMLs for text GRPO (colocate + server modes) and VLM grounding. Unit tests for the reward loader, recipes, and config parser; e2e smoke tests + SLURM launchers for 1 GPU runs.

vLLM with new trl and override transformers

f66612c

GRPO needs to do Repeatsampler so deactivate Ray sharding, and enable…

2ee36e2

…ment of multi-node in ray, works iwth gRPO colocate, SFT, dpo

validator to handle GRPO formatted data from same SFT data format, no…

ca0852a

… need to change the dataset format, SFT converts to prompt,solution pair for the GRPO

gsm8k math metrics for Eval hook

a8bd545

automated tests for GRPO, e2e grpo trianing and multi-ndoe testing

4a2b255

recipe and loader for rewards and tasks

1264f38

Mulit-node helper Slurm script to setup the Ray init for hte cluster,…

8824339

… celanup, few exampels for tedt and VLM

adding support for OpenEnv, Beta verison, has been tested with Echo e…

9f4f550

…nv with reward lenght for both VLM and LLM

Linting

c137d28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Feature/grpo#22

[WIP] Feature/grpo#22
Rouzbehat78 wants to merge 16 commits intomainfrom
feature/GRPO

Rouzbehat78 commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Rouzbehat78 commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant