feat: add experimental native RL stack and arithmetic validation benchmark by PastaPastaPasta · Pull Request #6 · ARahim3/mlx-tune

PastaPastaPasta · 2026-03-06T16:29:40Z

Summary

This branch is the full native RL bring-up compared to main.

It adds the internal RL runtime, RL model-role plumbing, checkpoint/resume support, public RL API surface, multiple RL trainers/configs, a TRL-compat patch layer, and an experimental arithmetic GRPO validation benchmark for Qwen 3.

This is still very experimental. It is absolutely prototype-quality, heavily vibe coded (GPT 5.4), and not tested to the level this amount of surface area would normally require.

The goal was to keep API compatibility with unsloth; I have not manually verified that to be the case.

What changed

adds native RL runtime infrastructure in mlx_tune/_rl_runtime.py
adds RL model-role builders, reward/value/reference helpers, and checkpoint role handling
adds public RL API helpers in mlx_tune/rl_api.py
extends exports in mlx_tune/__init__.py
adds or significantly expands trainers/configs for:
- Reward modeling
- DPO
- ORPO
- GRPO
- PPO
- OnlineDPO
- KTO
- SimPO
adds TRL compatibility patching in mlx_tune/trl_compat.py
updates examples, including a new Qwen 3 arithmetic GRPO validation benchmark
adds targeted RL/runtime/model-role/integration tests

Why

The goal of this branch is to get a real native RL training stack into the repo and make it testable with a deterministic benchmark.

The arithmetic benchmark is there mainly as a validation harness:

easy to generate
easy to score
no judge model
deterministic reward
useful for checking whether GRPO is actually moving policy behavior

In a local proof run with the new arithmetic benchmark, held-out exact match and solution-tag adherence improved materially after GRPO. That is encouraging, but it should be treated as an experiment, not proof that the broader stack is production-ready.

Important caveats

This PR is large and risky.

It changes a lot of training/runtime surface area at once.
The implementation is experimental and was put together quickly.
A meaningful amount of it is vibe coded.
There are tests, but relative to the size of this change the testing is still minimal.
I would not treat the current APIs or behaviors as stable.
I would expect follow-up fixes, edge cases, and cleanup.

Suggested reviewer mindset

Please review this as:

a large experimental RL branch
an attempt to get end-to-end functionality working
not a polished or production-ready training subsystem

I would focus on:

correctness of trainer semantics
checkpoint/resume behavior
reward/data plumbing
rollout/runtime edge cases
API shape and maintenance risk
obvious regressions against existing training flows

Validation performed

Locally, I ran targeted RL tests plus a real arithmetic GRPO proof run:

baseline eval
GRPO training
post-RL eval and comparison

That gives some confidence the path is live, but not nearly enough confidence for the full scope of this branch.

Yes. Add a short “How To Try It” section aimed at maintainers.

Use this in the PR body:

How To Try It

If you want to sanity check that the RL path actually works, the easiest entrypoint is the arithmetic GRPO benchmark.

1. Generate a deterministic dataset

python examples/10_qwen3_arithmetic_grpo_validation.py generate \
  --output-dir /tmp/qwen3_arith_demo \
  --train-size 256 \
  --val-size 32 \
  --test-size 32 \
  --force-generate

2. Measure the zero-shot baseline

python examples/10_qwen3_arithmetic_grpo_validation.py baseline \
  --output-dir /tmp/qwen3_arith_demo \
  --model-name mlx-community/Qwen3-1.7B-4bit \
  --max-completion-length 128 \
  --max-seq-length 384

This writes:

/tmp/qwen3_arith_demo/baseline_outputs.jsonl
/tmp/qwen3_arith_demo/baseline_metrics.json

3. Run a small GRPO training pass

python examples/10_qwen3_arithmetic_grpo_validation.py train \
  --output-dir /tmp/qwen3_arith_demo \
  --model-name mlx-community/Qwen3-1.7B-4bit \
  --max-completion-length 128 \
  --max-seq-length 384 \
  --max-steps 30 \
  --per-device-train-batch-size 2 \
  --rollout-batch-size 2 \
  --num-generations 2 \
  --logging-steps 5 \
  --eval-steps 10 \
  --save-steps 10

This writes:

/tmp/qwen3_arith_demo/rl_training_summary.json
/tmp/qwen3_arith_demo/post_rl_outputs.jsonl
/tmp/qwen3_arith_demo/post_rl_metrics.json

4. Compare before vs after RL

python examples/10_qwen3_arithmetic_grpo_validation.py compare \
  --output-dir /tmp/qwen3_arith_demo

This writes:

/tmp/qwen3_arith_demo/comparison.json
/tmp/qwen3_arith_demo/comparison.md

What to expect

The benchmark is intentionally simple:

the model can emit whatever it wants in <think>
only the integer inside <solution>...</solution> is scored
reward is deterministic

In my local proof run on this branch:

baseline exact match was very low
post-GRPO exact match and solution-tag adherence improved materially

So if the stack is functioning, you should usually see:

higher solution_tag_rate
higher avg_reward
often higher exact_match

Important caveats

This is only a smoke/proof path.
It does not prove the whole RL stack is correct or stable.
It is just the fastest way for a maintainer to see an actual RL loop move model behavior in this branch.

ARahim3 · 2026-03-07T00:16:51Z

Hi @PastaPastaPasta ,
Thanks a lot for the contribution - really appreciate the effort here. This is a substantial change, so I’m going to take some time to review it carefully before deciding on the scope and next steps.

PastaPastaPasta · 2026-03-07T00:36:19Z

Totally agree. The reason I built this is wanting to play with RL on my Mac but I never could. I'm going to be testing it in a not so toy project as well.

Figured better to open a pr than leave it sitting in my fork.

No pressure on review but if you find things happy to resolve them.

PastaPastaPasta added 17 commits March 5, 2026 19:58

feat: add frozen-reference native rl training

29c4ab3

fix: align grpo rollout ratios with sampling temperature

a7462ab

feat: add internal rl runtime foundation

18a58a1

fix: stabilize grpo kl and reward payloads

e422b8b

feat: add rl role builders and checkpoints

a36beea

fix: restore scalar role bundles faithfully

c4e2f19

feat: add reward and online rl trainers

1190a32

fix: correct ppo kl and online dpo reward context

c6002c2

feat: add parity-focused rl api surface

b27356a

fix: honor reward source composition semantics

fa0e3f9

feat: harden on-policy rl training runtime

d5cb341

fix: validate rl resume cache and config drift

f39ab07

feat: add trl rl compatibility patch layer

fa6b58c

fix: align grpo family variant semantics

524b998

fix: activate on-policy parity control knobs

1aa7de8

fix: enforce rl rollout length caps

87a476f

feat: add qwen3 arithmetic grpo validation benchmark

7b30e63

PastaPastaPasta added 10 commits March 7, 2026 17:49

fix: stop grpo rollouts on chat end tokens

aa1cda8

fix: expose cache-aware model wrapper API

0a2d39b

feat: log grpo subphase timings

7443495

fix: initialize mlx-lm prompt cache for grpo rollouts

38d3473

fix: rebuild grpo prompt cache when rollouts finish early

1fd1d86

fix: preserve prompt context in grpo rollouts

ef14f1a

fix: stabilize grpo reference drift metrics

fbdfcf5

fix: batch grpo categorical sampling

9766cae

fix: isolate rollout caches per sample

42b860e

fix: normalize logged grpo kl by completion length

d48c738

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add experimental native RL stack and arithmetic validation benchmark#6

feat: add experimental native RL stack and arithmetic validation benchmark#6
PastaPastaPasta wants to merge 27 commits intoARahim3:mainfrom
PastaPastaPasta:codex/rl-reference-grpo

PastaPastaPasta commented Mar 6, 2026

Uh oh!

ARahim3 commented Mar 7, 2026

Uh oh!

PastaPastaPasta commented Mar 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

PastaPastaPasta commented Mar 6, 2026

Summary

What changed

Why

Important caveats

Suggested reviewer mindset

Validation performed

How To Try It

1. Generate a deterministic dataset

2. Measure the zero-shot baseline

3. Run a small GRPO training pass

4. Compare before vs after RL

What to expect

Important caveats

Uh oh!

ARahim3 commented Mar 7, 2026

Uh oh!

PastaPastaPasta commented Mar 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants