feat: add experimental native RL stack and arithmetic validation benchmark#6
Open
PastaPastaPasta wants to merge 27 commits intoARahim3:mainfrom
Open
feat: add experimental native RL stack and arithmetic validation benchmark#6PastaPastaPasta wants to merge 27 commits intoARahim3:mainfrom
PastaPastaPasta wants to merge 27 commits intoARahim3:mainfrom
Conversation
Owner
|
Hi @PastaPastaPasta , |
Author
|
Totally agree. The reason I built this is wanting to play with RL on my Mac but I never could. I'm going to be testing it in a not so toy project as well. Figured better to open a pr than leave it sitting in my fork. No pressure on review but if you find things happy to resolve them. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This branch is the full native RL bring-up compared to
main.It adds the internal RL runtime, RL model-role plumbing, checkpoint/resume support, public RL API surface, multiple RL trainers/configs, a TRL-compat patch layer, and an experimental arithmetic GRPO validation benchmark for Qwen 3.
This is still very experimental. It is absolutely prototype-quality, heavily vibe coded (GPT 5.4), and not tested to the level this amount of surface area would normally require.
The goal was to keep API compatibility with unsloth; I have not manually verified that to be the case.
What changed
mlx_tune/_rl_runtime.pymlx_tune/rl_api.pymlx_tune/__init__.pymlx_tune/trl_compat.pyWhy
The goal of this branch is to get a real native RL training stack into the repo and make it testable with a deterministic benchmark.
The arithmetic benchmark is there mainly as a validation harness:
In a local proof run with the new arithmetic benchmark, held-out exact match and solution-tag adherence improved materially after GRPO. That is encouraging, but it should be treated as an experiment, not proof that the broader stack is production-ready.
Important caveats
This PR is large and risky.
Suggested reviewer mindset
Please review this as:
I would focus on:
Validation performed
Locally, I ran targeted RL tests plus a real arithmetic GRPO proof run:
That gives some confidence the path is live, but not nearly enough confidence for the full scope of this branch.
Yes. Add a short “How To Try It” section aimed at maintainers.
Use this in the PR body:
How To Try It
If you want to sanity check that the RL path actually works, the easiest entrypoint is the arithmetic GRPO benchmark.
1. Generate a deterministic dataset
2. Measure the zero-shot baseline
This writes:
/tmp/qwen3_arith_demo/baseline_outputs.jsonl/tmp/qwen3_arith_demo/baseline_metrics.json3. Run a small GRPO training pass
This writes:
/tmp/qwen3_arith_demo/rl_training_summary.json/tmp/qwen3_arith_demo/post_rl_outputs.jsonl/tmp/qwen3_arith_demo/post_rl_metrics.json4. Compare before vs after RL
This writes:
/tmp/qwen3_arith_demo/comparison.json/tmp/qwen3_arith_demo/comparison.mdWhat to expect
The benchmark is intentionally simple:
<think><solution>...</solution>is scoredIn my local proof run on this branch:
So if the stack is functioning, you should usually see:
solution_tag_rateavg_rewardexact_matchImportant caveats
This is only a smoke/proof path.
It does not prove the whole RL stack is correct or stable.
It is just the fastest way for a maintainer to see an actual RL loop move model behavior in this branch.