Skip to content

A curated list of recent progress and resources on Reinforcement Learning for AI Agents.

License

Notifications You must be signed in to change notification settings

yh-yao/awesome-rl-ai-agents

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 

Repository files navigation

Awesome RL for AI Agents Awesome

RL

A curated list of recent progress and resources on Reinforcement Learning for AI Agents.

🔎 Quick Navigation


Reinforcement learning (RL) is rapidly becoming a driving force for AI agents that can reason, act, and adapt in the real world. While large language models (LLMs) provide powerful priors for reasoning, they remain static without feedback. RL closes this gap by enabling agents to learn from interactions—through self-reflection, outcome-based rewards, and tool or human feedback.

This repository curates up-to-date resources on RL for AI agents, organized along three main axes:

  • Agentic workflows without training – prompting strategies that enhance reasoning without fine-tuning.
  • Evaluation and benchmarks – systematic tests for reasoning, tool use, and automation.
  • RL for single and multi-agent systems – advancing self-evolution, efficient tool use, and collaboration.

Tables provide quick overviews, while accompanying descriptions highlight deeper insights.


Agentic Workflow without Training

Title Short title Venue Year Materials Description
Tree of Thoughts: Deliberate Problem Solving with Large Language Models ToT ICML 2023 Paper Search over reasoning trees to explore alternatives before committing.
Reflexion: Language Agents with Verbal Reinforcement Learning Reflexion NeurIPS 2023 Paper Self-critique and retry loops that emulate feedback without training.
Self-Refine: Iterative Refinement with Self-Feedback Self-Refine NeurIPS 2023 Paper Iterative editing using self-generated feedback to improve outputs.
ReAct: Synergizing Reasoning and Acting in Language Models ReAct ICLR 2023 Paper Interleaves chain-of-thought with tool calls for grounded reasoning.
SwiftSage: A Generative Agent with Fast and Slow Thinking for Complex Interactive Tasks SwiftSage ACL 2023 Paper Splits fast vs slow planning to balance cost and performance.
DynaSaur: Large Language Agents Beyond Predefined Actions DynaSaur arXiv 2024 Paper Dynamically extends the agent’s action space beyond fixed tool sets.

Agent Evaluation and Benchmarks

Title Short title Venue Year Materials Description
GAIA: A Benchmark for General AI Assistants GAIA arXiv 2023 Paper 466 real-world tasks spanning tools and reasoning.
TaskBench: Benchmarking Large Language Models for Task Automation TaskBench EMNLP 2023 Paper Evaluates multi-step automation and tool integration.
AgentBench: Evaluating LLMs as Agents AgentBench arXiv 2023 Paper 51 scenarios to test agentic behaviors and robustness.
ACEBench: Who Wins the Match Point in Tool Usage? ACEBench arXiv 2025 Paper Fine-grained tool-use evaluation with step sensitivity.
Agent Leaderboard (Galileo) Galileo LB HF 2024 Dataset Community leaderboard built around GAIA-style tasks.
Agentic Predictor: Performance Prediction for Agentic Workflows Agentic Predictor arXiv 2025 Paper Predicts workflow performance for better design-time choices.

Agent Training Frameworks

Title Short title Year 🌟 Stars Materials Description
Agent Lightning: Train ANY AI Agents with Reinforcement Learning Agent Lightning 2025 Paper | Code Unified MDP; decouples execution and training with scalable workers.
SkyRL-v0: Train Real-World Long-Horizon Agents via RL SkyRL-v0 2025 Blog | Code Online RL pipeline for long-horizon agent training.
OpenManus-RL: Live-Streamed RL Tuning Framework for LLM Agents OpenManus-RL 2025 Code | Dataset Live-streamed tuning of LLM agents with dataset support.
MASLab: A Unified and Comprehensive Codebase for LLM-based Multi-Agent Systems MASLab 2025 Paper | Code Unified MAS codebase integrating 20+ multi-agent system methods.
VerlTool: Towards Holistic Agentic RL with Tool Use VerlTool 2025 Paper | Code Modular ARLT; supports asynchronous rollouts.
L0: Reinforcement Learning to Become General Agents L0 2025 Paper | Code Scalable RL pipeline; NB-Agent scaffold; concurrent worker pool.
verl-agent: Extension of veRL for LLM Agents verl-agent 2025 Code Step-independent multi-turn rollouts; memory modules; GiGPO RL algorithm.
ART: Agent Reinforcement Trainer ART 2025 Code Python harness for GRPO-based RL; OpenAI API-compatible; notebook examples.
AReaL: Ant Reasoning RL for LLMs AReaL 2025 Paper | Code Fully async RL system; scalable from 1→1K GPUs; open & reproducible.
Agent-R1: End-to-End RL for Tool-using Agents Agent-R1 2025 -- (No public repo found) Multi-tool coordination; process rewards; reward normalization (described, not yet open).
siiRL: Scalable Infrastructure for Interactive RL Agents siiRL 2025 Paper | Code Infrastructure and algorithms for large-scale RL training.
slime: Self-Improving LLM Agents slime 2025 Blog | Code Continuous improvement framework for LLM agents.
ROLL: RL for Open-Ended LLM Agents ROLL 2025 Paper | Code Alibaba’s RL framework for multi-task LLM agents.
MARTI: Multi-Agent RL with Tool Integration MARTI 2025 Code Tsinghua’s tool-augmented multi-agent RL.
RL2: Reinforcement Learning Reloaded RL2 2025 Code Individual repo exploring advanced RL.
verifiers: Benchmarking LLM Verification verifiers 2025 Code Verification-focused RL experiments.
oat: Optimizing Agent Training oat 2024 Paper | Code NUS / Sea AI’s agent optimization framework.
veRL: Volcengine RL Framework veRL 2024 Paper | Code ByteDance’s general-purpose RL framework.
OpenRLHF: Open Reinforcement Learning from Human Feedback OpenRLHF 2023 Paper | Code Open-source RLHF training platform.
TRL: Transformer Reinforcement Learning TRL 2019 Code HuggingFace’s RL library for transformers.

RL for Single Agent

Reinforcement learning methods that focus on individual agents (typically LLMs), enabling them to adapt, self-improve, and use tools effectively.

Title Short title Venue Year Materials Description
TTRL: Test-Time Reinforcement Learning TTRL ICLR 2025 Paper Inference-time RL via majority-vote rewards.
ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models ProRL ICLR 2025 Paper KL-control with reference resets for longer reasoning.
RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning RAGEN / StarPO ICLR 2025 Paper Multi-turn critic-based RL for evolving behaviors.
Alita: Generalist Agent Enabling Scalable Agentic Reasoning with Minimal Predefinition and Maximal Self-Evolution Alita GAIA LB 2025 Paper Modular framework for online self-evolution.
Gödel Agent: A Self-Referential Agent Framework for Recursive Self-Improvement Gödel Agent ACL / arXiv 2024–2025 Paper Recursive self-modification with reasoning loops.
Darwin Godel Machine: Open-Ended Evolution of Self-Improving Agents Darwin GM arXiv 2025 Paper Darwinian exploration for open-ended agent improvement.
SkyRL-v0: Train Real-World Long-Horizon Agents via Reinforcement Learning SkyRL-v0 arXiv / GitHub 2025 Blog | Code Long-horizon online RL training pipeline.

RL for Tool Use & Agent Training

Title Short title Venue Year Materials Description
AGILE: RL Framework for LLM Agents AGILE arXiv 2024 Paper Combines RL, memory, and tool use.
AgentOptimizer: Functions as Learnable Weights AgentOptimizer ICML 2024 Paper Offline training with learnable tool weights.
FireAct: Fine-tuning LLM Agents FireAct arXiv 2023 Paper Multi-task SFT baseline for RL comparison.
Tool-Integrated Reinforcement Learning ToRL arXiv 2025 Paper Large-scale tool-integrated RL training.
ToolRL: Reward is All Tool Learning Needs ToolRL arXiv 2025 Paper Studies reward shaping for tool use.
ARTIST: Unified Reasoning & Tools ARTIST arXiv 2025 Paper Joint reasoning + tool integration.
ZeroTIR: Scaling Law for Tool RL ZeroTIR arXiv 2025 Paper Scaling behavior of tool-augmented RL.
OTC: Acting Less is Reasoning More OTC arXiv 2025 Paper Optimizes efficiency by reducing unnecessary tool calls.
WebAgent-R1: End-to-End Multi-Turn RL WebAgent-R1 arXiv 2025 Paper Trains web agents on multi-turn environments.
GiGPO: Group-in-Group PPO GiGPO arXiv 2025 Paper Hierarchical PPO for agent training.
Nemotron-Research-Tool-N1 Nemotron-Tool-N1 arXiv 2025 Paper Pure RL setup for tool reasoning.
CATP-LLM: Cost-Aware Tool Planning CATP-LLM ICCV / arXiv 2024–2025 Paper | Code Optimizes tool usage under cost constraints.
Tool-Star: Multi-Tool RL via Hierarchical Rewards Tool-Star arXiv 2025 Paper Reinforcement with structured multi-tool reasoning.

Memory & Knowledge Management

Title Short title Venue Year Materials Description
Memory-R1: RL Memory Manager Memory-R1 arXiv 2025 Paper RL-based memory controller for better retrieval.
A-MEM: Agentic Memory for LLM Agents A-MEM arXiv 2025 Paper Zettelkasten-style dynamic memory management.
KnowAgent: Knowledge-Augmented Planning KnowAgent NAACL Findings 2025 Paper Planning with structured knowledge bases.

Fine-Grained RL & Trajectory Calibration

Title Short title Venue Year Materials Description
StepTool: Multi-Step Tool Usage StepTool CIKM 2025 Paper Step-grained rewards for tool usage.
RLTR: Process-Centric Rewards RLTR arXiv 2025 Paper Rewards good reasoning trajectories, not just outcomes.
SPA-RL: Stepwise Progress Attribution SPA-RL arXiv 2025 Paper Credits progress at intermediate steps.
STeCa: Step-Level Trajectory Calibration STeCa ACL Findings 2025 Paper Calibrates suboptimal steps for better learning.
SWEET-RL: Multi-Turn Collaborative RL SWEET-RL arXiv 2025 Paper Multi-turn reasoning with collaborative critic.
ATLaS: Critical Step Selection ATLaS ACL 2025 Paper Focuses learning on critical reasoning steps.

Algorithm Families (PPO, DPO, GRPO, etc.)

Summarizes key algorithm families, objectives, and available implementations.

Method Year Objective Clip KL Penalty Mechanism Signal Link Resource
PPO family
PPO 2017 Policy gradient Yes No Policy ratio clipping Reward Paper -
VAPO 2025 Policy gradient Yes Adaptive Adaptive KL penalty + variance control Reward + variance Paper -
PF-PPO 2024 Policy gradient Yes Yes Policy filtration Noisy reward Paper Code
VinePPO 2024 Policy gradient Yes Yes Unbiased value estimates Reward Paper Code
PSGPO 2024 Policy gradient Yes Yes Process supervision Process reward Paper -
DPO family
DPO 2024 Preference optimization No Yes Implicit reward Human preference Paper -
β-DPO 2024 Preference optimization No Adaptive Dynamic KL coefficient Human preference Paper Code
SimPO 2024 Preference optimization No Scaled Avg log-prob as implicit reward Human preference Paper Code
IPO 2024 Implicit preference No No Preference classification Rank Paper -
KTO 2024 Knowledge transfer optimization No Yes Teacher-student stabilization Logits Paper Code
ORPO 2024 Online regularized PO No Yes Online stabilization Feedback reward Paper Code
Step-DPO 2024 Step-wise preference No Yes Step-level supervision Step preference Paper Code
LCPO 2025 Length-conditioned PO No Yes Length preference Reward Paper -
GRPO family
GRPO 2025 Policy gradient (group reward) Yes Yes Group-based relative reward, no value estimates Group reward Paper -
DAPO 2025 Surrogate of GRPO Yes Yes Decoupled clip + dynamic sampling Dynamic group reward Paper Code | Model | Website
GSPO 2025 Surrogate of GRPO Yes Yes Sequence-level clipping & reward Smooth group reward Paper -
GMPO 2025 Surrogate of GRPO Yes Yes Geometric mean of token rewards Margin-based reward Paper Code
ProRL 2025 Same as GRPO Yes Yes Reference policy reset Group reward Paper Model
Posterior-GRPO 2025 Same as GRPO Yes Yes Rewards only successful processes Process reward Paper -
Dr.GRPO 2025 Unbiased GRPO Yes Yes Removes bias in optimization Group reward Paper Code | Model
Step-GRPO 2025 Same as GRPO Yes Yes Rule-based reasoning reward Step-wise reward Paper Code | Model
SRPO 2025 Same as GRPO Yes Yes Two-stage history resampling Reward Paper Model
GRESO 2025 Same as GRPO Yes Yes Pre-rollout filtering Reward Paper Code | Website
StarPO 2025 Same as GRPO Yes Yes Reasoning-guided multi-turn Group reward Paper Code | Website
GHPO 2025 Policy gradient Yes Yes Adaptive prompt refinement Reward Paper Code
Skywork R1V2 2025 GRPO (hybrid signal) Yes Yes Selective buffer, multimodal reward Multimodal Paper Code | Model
ASPO 2025 GRPO (shaped advantage) Yes Yes Clipped advantage bias Group reward Paper -
TreePO 2025 Surrogate of GRPO Yes Yes Self-guided rollout Group reward Paper Code | Model | Website
EDGE-GRPO 2025 Same as GRPO Yes Yes Entropy-driven advantage + error correction Group reward Paper Code | Model
DARS 2025 Same as GRPO Yes No Multi-stage hardest problems Group reward Paper Code | Model
CHORD 2025 Weighted GRPO + SFT Yes Yes Auxiliary supervised loss Group reward Paper Code
PAPO 2025 Surrogate of GRPO Yes Yes Implicit perception loss Group reward Paper Code | Model | Website
Pass@k Training 2025 Same as GRPO Yes Yes Pass@k metric as reward Group reward Paper Code

Cost-Aware Reasoning & Budget-Constrained RL

As agents scale, cost, latency, and efficiency become critical. These works tackle budget-aware reasoning, token efficiency, and cost-sensitive planning.

Title Short title Venue Year Materials Description
Cost-Augmented Monte Carlo Tree Search for LLM-Assisted Planning CATS arXiv 2025 Paper Incorporates cost into MCTS for planning under constraints.
Token-Budget-Aware LLM Reasoning TALE arXiv 2024 Paper Allocates token budget optimally across reasoning steps.
FrugalGPT: Using LLMs While Reducing Cost FrugalGPT arXiv 2023 Paper Early exploration of cost minimization by routing queries.
Efficient Contextual LLM Cascades via Budget-Constrained Policy Learning TREACLE arXiv 2024 Paper Learns cascades balancing budget and accuracy.
BudgetMLAgent: Cost-Effective Multi-Agent System for ML Automation BudgetMLAgent AIMLSystems 2025 Multi-agent framework designed for cost efficiency.
The Cost of Dynamic Reasoning: A Systems View Systems Cost arXiv 2025 Paper Measures latency, energy, and financial cost of agent reasoning.
Budget-Aware Evaluation of LLM Reasoning Strategies BudgetEval EMNLP 2024 Paper Proposes evaluation framework accounting for budget limits.
LLM Cascades with Mixture of Thoughts for Cost-Efficient Reasoning MoT Cascade ICLR / arXiv 2024 Paper | Code Uses “mixture of thoughts” cascades for efficiency.
BudgetThinker: Budget-Aware LLM Reasoning with Control Tokens BudgetThinker arXiv 2025 Paper Introduces control tokens to manage budget during inference.
Beyond One-Preference-Fits-All Alignment: Multi-Objective DPO MODPO arXiv 2023–2024 Paper Extends DPO with multi-objective alignment.

RL for Multi-Agent Systems

Planning

Title Short title Venue Year Materials Description
OWL: Optimized Workforce Learning for Real-World Automation OWL arXiv 2025 Paper Planner + workers
Profile-Aware Maneuvering for GAIA by AWorld AWorld NeurIPS 2024 Paper Guard agents
Plan-over-Graph: Towards Parallelable Agent Schedule Plan-over-Graph arXiv 2025 Paper Graph scheduling
LLM-Based Multi-Agent Reinforcement Learning: Directions MARL Survey arXiv 2024 Paper Survey
Self-Resource Allocation in Multi-Agent LLM Systems Self-ResAlloc arXiv 2025 Paper Planner vs orchestrator
MASLab (duplicate listing) MASLab arXiv 2025 Paper Unified MAS APIs
Dynamic Speculative Agent Planning DSP arXiv 2025 Paper Lossless agent planning acceleration via dynamic speculation; trades off latency vs cost; no pre-deployment setup.

Collaboration

Title Short title Venue Year Materials Description
ACC-Collab: Actor-Critic for Multi-Agent Collaboration ACC-Collab ICLR 2025 Paper Joint actor-critic
Chain of Agents: Collaborating on Long-Context Tasks Chain of Agents arXiv 2024 Paper Long-context chains
Scaling LLM-Based Multi-Agent Collaboration Scaling MAC arXiv 2024 Paper Scaling study
MMAC-Copilot: Multi-Modal Agent Collaboration MMAC-Copilot arXiv 2024 Paper Multi-modal collab
CORY: Sequential Cooperative Multi-Agent Reinforcement Learning CORY NeurIPS 2024 Paper | Code Role-swapping PPO
OpenManus-RL (duplicate listing) OpenManus-RL GitHub 2025 Code Live-streamed tuning
MAPoRL: Multi-Agent Post-Co-Training with Reinforcement Learning MAPoRL arXiv 2025 Paper Co-refine + verifier

Embodied Agents & World Models

Title Short title Venue Year Materials Description
An Embodied Generalist Agent in 3D World LEO ICML 2024 Paper 3D embodied agent
DreamerV3: Mastering Diverse Domains through World Models DreamerV3 arXiv 2023 Paper World-model RL
World-Model-Augmented Web Agent WMA Web Agent ICLR/arXiv 2025 Paper Simulative web agent
WorldCoder: Model-Based LLM Agent WorldCoder arXiv 2024 Paper Code-based world model
WALL-E 2.0: World Alignment via Neuro-Symbolic Learning WALL-E 2.0 arXiv 2025 Paper Neuro-symbolic alignment
WorldLLM: Curiosity-Driven World Modeling WorldLLM arXiv 2025 Paper Curiosity + world model
SimuRA: Simulative Reasoning Architecture with World Model SimuRA arXiv 2025 Paper Mental simulation

Task: Search & Research Agents

Method Category Base LLM Link Resource
DeepRetrieval External Qwen2.5-3B-Instruct, Llama-3.2-3B-Instruct Paper Code
Search-R1 External Qwen2.5-3B/7B-Base/Instruct Paper Code
R1-Searcher External Qwen2.5-7B, Llama3.1-8B-Instruct Paper Code
WebThinker External QwQ-32B, DeepSeek-R1-Distilled-Qwen-7B/14B/32B Paper Code
WebSailor External Qwen2.5-3B/7B/32B/72B Paper Code
SSRL Internal Qwen2.5-1.5B/3B/7B/14B/32B/72B-Instruct, Llama-3.2-1B/8B-Instruct Paper Code
OpenAI Deep Research External OpenAI Models Blog Website
Perplexity DeepResearch External - Blog Website

Task: Code Agents

Method RL Reward Type Base LLM Link Resource
AceCoder Outcome Qwen2.5-Coder-7B-Base/Instruct Paper Code
DeepCoder-14B Outcome DeepSeek-R1-Distilled-Qwen-14B Blog Code
CodeBoost Process Qwen2.5-Coder-7B-Instruct, Llama-3.1-8B-Instruct Paper Code
R1-Code-Interpreter Outcome Qwen2.5-7B/14B-Instruct-1M Paper Code
SWE-RL Outcome Llama-3.3-70B-Instruct Paper Code
Satori-SWE Outcome Qwen-2.5-Math-7B Paper Code

Task: Mathematical Agents

Method Reward Link Resource
ARTIST Outcome Paper -
ToRL Outcome Paper Code
ZeroTIR Outcome Paper Code
TTRL Outcome Paper Code
DeepSeek-Prover-v1.5 Formal Paper Code
Leanabell-Prover Formal Paper Code

Task: GUI Agents

Method Paradigm Environment Link Resource
MM-Navigator Vanilla VLM - Paper Code
SeeAct Vanilla VLM - Paper Code
GUI-R1 RL Static Paper Code
UI-R1 RL Static Paper Code
UI-TARS RL Interactive Paper Code

Surveys & Position Papers

Title Short title Venue Year Materials Description
The Landscape of Agentic Reinforcement Learning for LLMs: A Survey ARL-Surv arXiv 2025 Paper Comprehensive ARL landscape
Budget-Aware Evaluation of LLM Reasoning Strategies BudgetEval EMNLP 2024 Paper Budget-aware reasoning evaluation
Alignment & Preference Optimization in LLM Agents Align-Pos arXiv 2023 Paper Alignment and multi-objective methods
A Survey of Self-Evolving Agents: On Path to Artificial Super Intelligence SE Survey arXiv 2025 Paper Taxonomy and methods for self-evolving agents.

Task Agents

Search & Research Agents

Method Category Base LLM Link Resource
DeepRetrieval External Qwen2.5-3B-Instruct, Llama-3.2-3B-Instruct Paper Code
Search-R1 External Qwen2.5-3B/7B-Base/Instruct Paper Code
R1-Searcher External Qwen2.5-7B, Llama3.1-8B-Instruct Paper Code
WebThinker External QwQ-32B, DeepSeek-R1-Distilled-Qwen-7B/14B/32B Paper Code
WebSailor External Qwen2.5-3B/7B/32B/72B Paper Code
SSRL Internal Qwen2.5-1.5B/3B/7B/14B/32B/72B-Instruct, Llama-3.2-1B/8B-Instruct Paper Code
OpenAI Deep Research External OpenAI Models Blog Website
Perplexity DeepResearch External - Blog Website

Code Agents

Method RL Reward Type Base LLM Link Resource
AceCoder Outcome Qwen2.5-Coder-7B-Base/Instruct Paper Code
DeepCoder-14B Outcome DeepSeek-R1-Distilled-Qwen-14B Blog Code
CodeBoost Process Qwen2.5-Coder-7B-Instruct, Llama-3.1-8B-Instruct Paper Code
R1-Code-Interpreter Outcome Qwen2.5-7B/14B-Instruct-1M Paper Code
SWE-RL Outcome Llama-3.3-70B-Instruct Paper Code
Satori-SWE Outcome Qwen-2.5-Math-7B Paper Code

Mathematical Agents

Method Reward Link Resource
ARTIST Outcome Paper -
ToRL Outcome Paper Code
ZeroTIR Outcome Paper Code
TTRL Outcome Paper Code
DeepSeek-Prover-v1.5 Formal Paper Code
Leanabell-Prover Formal Paper Code

GUI Agents

Method Paradigm Environment Link Resource
MM-Navigator Vanilla VLM - Paper Code
SeeAct Vanilla VLM - Paper Code
GUI-R1 RL Static Paper Code
UI-R1 RL Static Paper Code
InFiGUI-R1 RL Static Paper Code
UI-TARS RL Interactive Paper Code

Concluding Remarks

Reinforcement learning for AI agents is rapidly evolving, driving breakthroughs in reasoning, autonomy, and collaboration. As new methods and frameworks emerge, staying current is essential for both research and practical deployment. This curated list aims to support the community in navigating the dynamic landscape and make contributions!

💡 Pull requests welcome to keep this list up to date!

References

https://github.com/xhyumiracle/Awesome-AgenticLLM-RL-Papers

https://github.com/0russwest0/Awesome-Agent-RL

https://github.com/thinkwee/AgentsMeetRL

🌟 If you find this resource helpful, star the repo and share your favorite RL agent papers or frameworks! Let's build the future of intelligent agents together.

About

A curated list of recent progress and resources on Reinforcement Learning for AI Agents.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published