Verl is the most popular open-source reinforcement learning framework for LLMs, supporting PPO, GRPO, and other algorithms.
Also see search-tooling/ and this blog for tool-augmented “search” workflows (Search-R1 style), including Google Search–backed inference and a Wikipedia FAISS retrieval service used for inference and training.
SkyPilot makes RL training easy and cost-effective:
- Get GPUs instantly across clouds and Kubernetes
- 3x cheaper with managed spot instances
- Zero setup - handles distributed Ray clusters automatically
Launch single node agent training:
sky launch -c verl-ppo llm/verl/verl-ppo.yaml --secret WANDB_API_KEY --num-nodes 1 -y
sky launch -c verl-ppo llm/verl/verl-ppo.yaml --secret WANDB_API_KEY --secret HF_TOKEN --num-nodes 1 -y
sky launch -c verl-grpo llm/verl/verl-grpo.yaml --secret WANDB_API_KEY --num-nodes 1 -y
sky launch -c verl-grpo llm/verl/verl-grpo.yaml --secret WANDB_API_KEY --secret HF_TOKEN --num-nodes 1 -yLaunch a 2-node RLHF training job on the cheapest available GPUs:
sky launch -c verl llm/verl/multinode.yamlMonitor training progress:
sky logs verlTraining logs showing PPO optimization progress with reward metrics
Access Ray dashboard:
sky status --endpoint 8280 verlRay dashboard showing real-time monitoring of distributed training across multiple nodes

