A curated list of recent progress and resources on Reinforcement Learning for AI Agents.
Reinforcement learning (RL) is rapidly becoming a driving force for AI agents that can reason, act, and adapt in the real world. While large language models (LLMs) provide powerful priors for reasoning, they remain static without feedback. RL closes this gap by enabling agents to learn from interactions—through self-reflection, outcome-based rewards, and tool or human feedback.
This repository curates up-to-date resources on RL for AI agents, organized along three main axes:
- Agentic workflows without training – prompting strategies that enhance reasoning without fine-tuning.
- Evaluation and benchmarks – systematic tests for reasoning, tool use, and automation.
- RL for single and multi-agent systems – advancing self-evolution, efficient tool use, and collaboration.
Tables provide quick overviews, while accompanying descriptions highlight deeper insights.
| Title | Short title | Venue | Year | Materials | Description |
|---|---|---|---|---|---|
| Tree of Thoughts: Deliberate Problem Solving with Large Language Models | ToT | ICML | 2023 | Paper | Search over reasoning trees to explore alternatives before committing. |
| Reflexion: Language Agents with Verbal Reinforcement Learning | Reflexion | NeurIPS | 2023 | Paper | Self-critique and retry loops that emulate feedback without training. |
| Self-Refine: Iterative Refinement with Self-Feedback | Self-Refine | NeurIPS | 2023 | Paper | Iterative editing using self-generated feedback to improve outputs. |
| ReAct: Synergizing Reasoning and Acting in Language Models | ReAct | ICLR | 2023 | Paper | Interleaves chain-of-thought with tool calls for grounded reasoning. |
| SwiftSage: A Generative Agent with Fast and Slow Thinking for Complex Interactive Tasks | SwiftSage | ACL | 2023 | Paper | Splits fast vs slow planning to balance cost and performance. |
| DynaSaur: Large Language Agents Beyond Predefined Actions | DynaSaur | arXiv | 2024 | Paper | Dynamically extends the agent’s action space beyond fixed tool sets. |
| Title | Short title | Venue | Year | Materials | Description |
|---|---|---|---|---|---|
| GAIA: A Benchmark for General AI Assistants | GAIA | arXiv | 2023 | Paper | 466 real-world tasks spanning tools and reasoning. |
| TaskBench: Benchmarking Large Language Models for Task Automation | TaskBench | EMNLP | 2023 | Paper | Evaluates multi-step automation and tool integration. |
| AgentBench: Evaluating LLMs as Agents | AgentBench | arXiv | 2023 | Paper | 51 scenarios to test agentic behaviors and robustness. |
| ACEBench: Who Wins the Match Point in Tool Usage? | ACEBench | arXiv | 2025 | Paper | Fine-grained tool-use evaluation with step sensitivity. |
| Agent Leaderboard (Galileo) | Galileo LB | HF | 2024 | Dataset | Community leaderboard built around GAIA-style tasks. |
| Agentic Predictor: Performance Prediction for Agentic Workflows | Agentic Predictor | arXiv | 2025 | Paper | Predicts workflow performance for better design-time choices. |
| Title | Short title | Year | 🌟 Stars | Materials | Description |
|---|---|---|---|---|---|
| Agent Lightning: Train ANY AI Agents with Reinforcement Learning | Agent Lightning | 2025 | Paper | Code | Unified MDP; decouples execution and training with scalable workers. | |
| SkyRL-v0: Train Real-World Long-Horizon Agents via RL | SkyRL-v0 | 2025 | Blog | Code | Online RL pipeline for long-horizon agent training. | |
| OpenManus-RL: Live-Streamed RL Tuning Framework for LLM Agents | OpenManus-RL | 2025 | Code | Dataset | Live-streamed tuning of LLM agents with dataset support. | |
| MASLab: A Unified and Comprehensive Codebase for LLM-based Multi-Agent Systems | MASLab | 2025 | Paper | Code | Unified MAS codebase integrating 20+ multi-agent system methods. | |
| VerlTool: Towards Holistic Agentic RL with Tool Use | VerlTool | 2025 | Paper | Code | Modular ARLT; supports asynchronous rollouts. | |
| L0: Reinforcement Learning to Become General Agents | L0 | 2025 | Paper | Code | Scalable RL pipeline; NB-Agent scaffold; concurrent worker pool. | |
| verl-agent: Extension of veRL for LLM Agents | verl-agent | 2025 | Code | Step-independent multi-turn rollouts; memory modules; GiGPO RL algorithm. | |
| ART: Agent Reinforcement Trainer | ART | 2025 | Code | Python harness for GRPO-based RL; OpenAI API-compatible; notebook examples. | |
| AReaL: Ant Reasoning RL for LLMs | AReaL | 2025 | Paper | Code | Fully async RL system; scalable from 1→1K GPUs; open & reproducible. | |
| Agent-R1: End-to-End RL for Tool-using Agents | Agent-R1 | 2025 | -- | (No public repo found) | Multi-tool coordination; process rewards; reward normalization (described, not yet open). |
| siiRL: Scalable Infrastructure for Interactive RL Agents | siiRL | 2025 | Paper | Code | Infrastructure and algorithms for large-scale RL training. | |
| slime: Self-Improving LLM Agents | slime | 2025 | Blog | Code | Continuous improvement framework for LLM agents. | |
| ROLL: RL for Open-Ended LLM Agents | ROLL | 2025 | Paper | Code | Alibaba’s RL framework for multi-task LLM agents. | |
| MARTI: Multi-Agent RL with Tool Integration | MARTI | 2025 | Code | Tsinghua’s tool-augmented multi-agent RL. | |
| RL2: Reinforcement Learning Reloaded | RL2 | 2025 | Code | Individual repo exploring advanced RL. | |
| verifiers: Benchmarking LLM Verification | verifiers | 2025 | Code | Verification-focused RL experiments. | |
| oat: Optimizing Agent Training | oat | 2024 | Paper | Code | NUS / Sea AI’s agent optimization framework. | |
| veRL: Volcengine RL Framework | veRL | 2024 | Paper | Code | ByteDance’s general-purpose RL framework. | |
| OpenRLHF: Open Reinforcement Learning from Human Feedback | OpenRLHF | 2023 | Paper | Code | Open-source RLHF training platform. | |
| TRL: Transformer Reinforcement Learning | TRL | 2019 | Code | HuggingFace’s RL library for transformers. |
Reinforcement learning methods that focus on individual agents (typically LLMs), enabling them to adapt, self-improve, and use tools effectively.
| Title | Short title | Venue | Year | Materials | Description |
|---|---|---|---|---|---|
| TTRL: Test-Time Reinforcement Learning | TTRL | ICLR | 2025 | Paper | Inference-time RL via majority-vote rewards. |
| ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models | ProRL | ICLR | 2025 | Paper | KL-control with reference resets for longer reasoning. |
| RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning | RAGEN / StarPO | ICLR | 2025 | Paper | Multi-turn critic-based RL for evolving behaviors. |
| Alita: Generalist Agent Enabling Scalable Agentic Reasoning with Minimal Predefinition and Maximal Self-Evolution | Alita | GAIA LB | 2025 | Paper | Modular framework for online self-evolution. |
| Gödel Agent: A Self-Referential Agent Framework for Recursive Self-Improvement | Gödel Agent | ACL / arXiv | 2024–2025 | Paper | Recursive self-modification with reasoning loops. |
| Darwin Godel Machine: Open-Ended Evolution of Self-Improving Agents | Darwin GM | arXiv | 2025 | Paper | Darwinian exploration for open-ended agent improvement. |
| SkyRL-v0: Train Real-World Long-Horizon Agents via Reinforcement Learning | SkyRL-v0 | arXiv / GitHub | 2025 | Blog | Code | Long-horizon online RL training pipeline. |
| Title | Short title | Venue | Year | Materials | Description |
|---|---|---|---|---|---|
| AGILE: RL Framework for LLM Agents | AGILE | arXiv | 2024 | Paper | Combines RL, memory, and tool use. |
| AgentOptimizer: Functions as Learnable Weights | AgentOptimizer | ICML | 2024 | Paper | Offline training with learnable tool weights. |
| FireAct: Fine-tuning LLM Agents | FireAct | arXiv | 2023 | Paper | Multi-task SFT baseline for RL comparison. |
| Tool-Integrated Reinforcement Learning | ToRL | arXiv | 2025 | Paper | Large-scale tool-integrated RL training. |
| ToolRL: Reward is All Tool Learning Needs | ToolRL | arXiv | 2025 | Paper | Studies reward shaping for tool use. |
| ARTIST: Unified Reasoning & Tools | ARTIST | arXiv | 2025 | Paper | Joint reasoning + tool integration. |
| ZeroTIR: Scaling Law for Tool RL | ZeroTIR | arXiv | 2025 | Paper | Scaling behavior of tool-augmented RL. |
| OTC: Acting Less is Reasoning More | OTC | arXiv | 2025 | Paper | Optimizes efficiency by reducing unnecessary tool calls. |
| WebAgent-R1: End-to-End Multi-Turn RL | WebAgent-R1 | arXiv | 2025 | Paper | Trains web agents on multi-turn environments. |
| GiGPO: Group-in-Group PPO | GiGPO | arXiv | 2025 | Paper | Hierarchical PPO for agent training. |
| Nemotron-Research-Tool-N1 | Nemotron-Tool-N1 | arXiv | 2025 | Paper | Pure RL setup for tool reasoning. |
| CATP-LLM: Cost-Aware Tool Planning | CATP-LLM | ICCV / arXiv | 2024–2025 | Paper | Code | Optimizes tool usage under cost constraints. |
| Tool-Star: Multi-Tool RL via Hierarchical Rewards | Tool-Star | arXiv | 2025 | Paper | Reinforcement with structured multi-tool reasoning. |
| Title | Short title | Venue | Year | Materials | Description |
|---|---|---|---|---|---|
| Memory-R1: RL Memory Manager | Memory-R1 | arXiv | 2025 | Paper | RL-based memory controller for better retrieval. |
| A-MEM: Agentic Memory for LLM Agents | A-MEM | arXiv | 2025 | Paper | Zettelkasten-style dynamic memory management. |
| KnowAgent: Knowledge-Augmented Planning | KnowAgent | NAACL Findings | 2025 | Paper | Planning with structured knowledge bases. |
| Title | Short title | Venue | Year | Materials | Description |
|---|---|---|---|---|---|
| StepTool: Multi-Step Tool Usage | StepTool | CIKM | 2025 | Paper | Step-grained rewards for tool usage. |
| RLTR: Process-Centric Rewards | RLTR | arXiv | 2025 | Paper | Rewards good reasoning trajectories, not just outcomes. |
| SPA-RL: Stepwise Progress Attribution | SPA-RL | arXiv | 2025 | Paper | Credits progress at intermediate steps. |
| STeCa: Step-Level Trajectory Calibration | STeCa | ACL Findings | 2025 | Paper | Calibrates suboptimal steps for better learning. |
| SWEET-RL: Multi-Turn Collaborative RL | SWEET-RL | arXiv | 2025 | Paper | Multi-turn reasoning with collaborative critic. |
| ATLaS: Critical Step Selection | ATLaS | ACL | 2025 | Paper | Focuses learning on critical reasoning steps. |
Summarizes key algorithm families, objectives, and available implementations.
| Method | Year | Objective | Clip | KL Penalty | Mechanism | Signal | Link | Resource |
|---|---|---|---|---|---|---|---|---|
| PPO family | ||||||||
| PPO | 2017 | Policy gradient | Yes | No | Policy ratio clipping | Reward | Paper | - |
| VAPO | 2025 | Policy gradient | Yes | Adaptive | Adaptive KL penalty + variance control | Reward + variance | Paper | - |
| PF-PPO | 2024 | Policy gradient | Yes | Yes | Policy filtration | Noisy reward | Paper | Code |
| VinePPO | 2024 | Policy gradient | Yes | Yes | Unbiased value estimates | Reward | Paper | Code |
| PSGPO | 2024 | Policy gradient | Yes | Yes | Process supervision | Process reward | Paper | - |
| DPO family | ||||||||
| DPO | 2024 | Preference optimization | No | Yes | Implicit reward | Human preference | Paper | - |
| β-DPO | 2024 | Preference optimization | No | Adaptive | Dynamic KL coefficient | Human preference | Paper | Code |
| SimPO | 2024 | Preference optimization | No | Scaled | Avg log-prob as implicit reward | Human preference | Paper | Code |
| IPO | 2024 | Implicit preference | No | No | Preference classification | Rank | Paper | - |
| KTO | 2024 | Knowledge transfer optimization | No | Yes | Teacher-student stabilization | Logits | Paper | Code |
| ORPO | 2024 | Online regularized PO | No | Yes | Online stabilization | Feedback reward | Paper | Code |
| Step-DPO | 2024 | Step-wise preference | No | Yes | Step-level supervision | Step preference | Paper | Code |
| LCPO | 2025 | Length-conditioned PO | No | Yes | Length preference | Reward | Paper | - |
| GRPO family | ||||||||
| GRPO | 2025 | Policy gradient (group reward) | Yes | Yes | Group-based relative reward, no value estimates | Group reward | Paper | - |
| DAPO | 2025 | Surrogate of GRPO | Yes | Yes | Decoupled clip + dynamic sampling | Dynamic group reward | Paper | Code | Model | Website |
| GSPO | 2025 | Surrogate of GRPO | Yes | Yes | Sequence-level clipping & reward | Smooth group reward | Paper | - |
| GMPO | 2025 | Surrogate of GRPO | Yes | Yes | Geometric mean of token rewards | Margin-based reward | Paper | Code |
| ProRL | 2025 | Same as GRPO | Yes | Yes | Reference policy reset | Group reward | Paper | Model |
| Posterior-GRPO | 2025 | Same as GRPO | Yes | Yes | Rewards only successful processes | Process reward | Paper | - |
| Dr.GRPO | 2025 | Unbiased GRPO | Yes | Yes | Removes bias in optimization | Group reward | Paper | Code | Model |
| Step-GRPO | 2025 | Same as GRPO | Yes | Yes | Rule-based reasoning reward | Step-wise reward | Paper | Code | Model |
| SRPO | 2025 | Same as GRPO | Yes | Yes | Two-stage history resampling | Reward | Paper | Model |
| GRESO | 2025 | Same as GRPO | Yes | Yes | Pre-rollout filtering | Reward | Paper | Code | Website |
| StarPO | 2025 | Same as GRPO | Yes | Yes | Reasoning-guided multi-turn | Group reward | Paper | Code | Website |
| GHPO | 2025 | Policy gradient | Yes | Yes | Adaptive prompt refinement | Reward | Paper | Code |
| Skywork R1V2 | 2025 | GRPO (hybrid signal) | Yes | Yes | Selective buffer, multimodal reward | Multimodal | Paper | Code | Model |
| ASPO | 2025 | GRPO (shaped advantage) | Yes | Yes | Clipped advantage bias | Group reward | Paper | - |
| TreePO | 2025 | Surrogate of GRPO | Yes | Yes | Self-guided rollout | Group reward | Paper | Code | Model | Website |
| EDGE-GRPO | 2025 | Same as GRPO | Yes | Yes | Entropy-driven advantage + error correction | Group reward | Paper | Code | Model |
| DARS | 2025 | Same as GRPO | Yes | No | Multi-stage hardest problems | Group reward | Paper | Code | Model |
| CHORD | 2025 | Weighted GRPO + SFT | Yes | Yes | Auxiliary supervised loss | Group reward | Paper | Code |
| PAPO | 2025 | Surrogate of GRPO | Yes | Yes | Implicit perception loss | Group reward | Paper | Code | Model | Website |
| Pass@k Training | 2025 | Same as GRPO | Yes | Yes | Pass@k metric as reward | Group reward | Paper | Code |
As agents scale, cost, latency, and efficiency become critical. These works tackle budget-aware reasoning, token efficiency, and cost-sensitive planning.
| Title | Short title | Venue | Year | Materials | Description |
|---|---|---|---|---|---|
| Cost-Augmented Monte Carlo Tree Search for LLM-Assisted Planning | CATS | arXiv | 2025 | Paper | Incorporates cost into MCTS for planning under constraints. |
| Token-Budget-Aware LLM Reasoning | TALE | arXiv | 2024 | Paper | Allocates token budget optimally across reasoning steps. |
| FrugalGPT: Using LLMs While Reducing Cost | FrugalGPT | arXiv | 2023 | Paper | Early exploration of cost minimization by routing queries. |
| Efficient Contextual LLM Cascades via Budget-Constrained Policy Learning | TREACLE | arXiv | 2024 | Paper | Learns cascades balancing budget and accuracy. |
| BudgetMLAgent: Cost-Effective Multi-Agent System for ML Automation | BudgetMLAgent | AIMLSystems | 2025 | — | Multi-agent framework designed for cost efficiency. |
| The Cost of Dynamic Reasoning: A Systems View | Systems Cost | arXiv | 2025 | Paper | Measures latency, energy, and financial cost of agent reasoning. |
| Budget-Aware Evaluation of LLM Reasoning Strategies | BudgetEval | EMNLP | 2024 | Paper | Proposes evaluation framework accounting for budget limits. |
| LLM Cascades with Mixture of Thoughts for Cost-Efficient Reasoning | MoT Cascade | ICLR / arXiv | 2024 | Paper | Code | Uses “mixture of thoughts” cascades for efficiency. |
| BudgetThinker: Budget-Aware LLM Reasoning with Control Tokens | BudgetThinker | arXiv | 2025 | Paper | Introduces control tokens to manage budget during inference. |
| Beyond One-Preference-Fits-All Alignment: Multi-Objective DPO | MODPO | arXiv | 2023–2024 | Paper | Extends DPO with multi-objective alignment. |
| Title | Short title | Venue | Year | Materials | Description |
|---|---|---|---|---|---|
| OWL: Optimized Workforce Learning for Real-World Automation | OWL | arXiv | 2025 | Paper | Planner + workers |
| Profile-Aware Maneuvering for GAIA by AWorld | AWorld | NeurIPS | 2024 | Paper | Guard agents |
| Plan-over-Graph: Towards Parallelable Agent Schedule | Plan-over-Graph | arXiv | 2025 | Paper | Graph scheduling |
| LLM-Based Multi-Agent Reinforcement Learning: Directions | MARL Survey | arXiv | 2024 | Paper | Survey |
| Self-Resource Allocation in Multi-Agent LLM Systems | Self-ResAlloc | arXiv | 2025 | Paper | Planner vs orchestrator |
| MASLab (duplicate listing) | MASLab | arXiv | 2025 | Paper | Unified MAS APIs |
| Dynamic Speculative Agent Planning | DSP | arXiv | 2025 | Paper | Lossless agent planning acceleration via dynamic speculation; trades off latency vs cost; no pre-deployment setup. |
| Title | Short title | Venue | Year | Materials | Description |
|---|---|---|---|---|---|
| ACC-Collab: Actor-Critic for Multi-Agent Collaboration | ACC-Collab | ICLR | 2025 | Paper | Joint actor-critic |
| Chain of Agents: Collaborating on Long-Context Tasks | Chain of Agents | arXiv | 2024 | Paper | Long-context chains |
| Scaling LLM-Based Multi-Agent Collaboration | Scaling MAC | arXiv | 2024 | Paper | Scaling study |
| MMAC-Copilot: Multi-Modal Agent Collaboration | MMAC-Copilot | arXiv | 2024 | Paper | Multi-modal collab |
| CORY: Sequential Cooperative Multi-Agent Reinforcement Learning | CORY | NeurIPS | 2024 | Paper | Code | Role-swapping PPO |
| OpenManus-RL (duplicate listing) | OpenManus-RL | GitHub | 2025 | Code | Live-streamed tuning |
| MAPoRL: Multi-Agent Post-Co-Training with Reinforcement Learning | MAPoRL | arXiv | 2025 | Paper | Co-refine + verifier |
| Title | Short title | Venue | Year | Materials | Description |
|---|---|---|---|---|---|
| An Embodied Generalist Agent in 3D World | LEO | ICML | 2024 | Paper | 3D embodied agent |
| DreamerV3: Mastering Diverse Domains through World Models | DreamerV3 | arXiv | 2023 | Paper | World-model RL |
| World-Model-Augmented Web Agent | WMA Web Agent | ICLR/arXiv | 2025 | Paper | Simulative web agent |
| WorldCoder: Model-Based LLM Agent | WorldCoder | arXiv | 2024 | Paper | Code-based world model |
| WALL-E 2.0: World Alignment via Neuro-Symbolic Learning | WALL-E 2.0 | arXiv | 2025 | Paper | Neuro-symbolic alignment |
| WorldLLM: Curiosity-Driven World Modeling | WorldLLM | arXiv | 2025 | Paper | Curiosity + world model |
| SimuRA: Simulative Reasoning Architecture with World Model | SimuRA | arXiv | 2025 | Paper | Mental simulation |
| Method | Category | Base LLM | Link | Resource |
|---|---|---|---|---|
| DeepRetrieval | External | Qwen2.5-3B-Instruct, Llama-3.2-3B-Instruct | Paper | Code |
| Search-R1 | External | Qwen2.5-3B/7B-Base/Instruct | Paper | Code |
| R1-Searcher | External | Qwen2.5-7B, Llama3.1-8B-Instruct | Paper | Code |
| WebThinker | External | QwQ-32B, DeepSeek-R1-Distilled-Qwen-7B/14B/32B | Paper | Code |
| WebSailor | External | Qwen2.5-3B/7B/32B/72B | Paper | Code |
| SSRL | Internal | Qwen2.5-1.5B/3B/7B/14B/32B/72B-Instruct, Llama-3.2-1B/8B-Instruct | Paper | Code |
| OpenAI Deep Research | External | OpenAI Models | Blog | Website |
| Perplexity DeepResearch | External | - | Blog | Website |
| Method | RL Reward Type | Base LLM | Link | Resource |
|---|---|---|---|---|
| AceCoder | Outcome | Qwen2.5-Coder-7B-Base/Instruct | Paper | Code |
| DeepCoder-14B | Outcome | DeepSeek-R1-Distilled-Qwen-14B | Blog | Code |
| CodeBoost | Process | Qwen2.5-Coder-7B-Instruct, Llama-3.1-8B-Instruct | Paper | Code |
| R1-Code-Interpreter | Outcome | Qwen2.5-7B/14B-Instruct-1M | Paper | Code |
| SWE-RL | Outcome | Llama-3.3-70B-Instruct | Paper | Code |
| Satori-SWE | Outcome | Qwen-2.5-Math-7B | Paper | Code |
| Method | Reward | Link | Resource |
|---|---|---|---|
| ARTIST | Outcome | Paper | - |
| ToRL | Outcome | Paper | Code |
| ZeroTIR | Outcome | Paper | Code |
| TTRL | Outcome | Paper | Code |
| DeepSeek-Prover-v1.5 | Formal | Paper | Code |
| Leanabell-Prover | Formal | Paper | Code |
| Method | Paradigm | Environment | Link | Resource |
|---|---|---|---|---|
| MM-Navigator | Vanilla VLM | - | Paper | Code |
| SeeAct | Vanilla VLM | - | Paper | Code |
| GUI-R1 | RL | Static | Paper | Code |
| UI-R1 | RL | Static | Paper | Code |
| UI-TARS | RL | Interactive | Paper | Code |
| Title | Short title | Venue | Year | Materials | Description |
|---|---|---|---|---|---|
| The Landscape of Agentic Reinforcement Learning for LLMs: A Survey | ARL-Surv | arXiv | 2025 | Paper | Comprehensive ARL landscape |
| Budget-Aware Evaluation of LLM Reasoning Strategies | BudgetEval | EMNLP | 2024 | Paper | Budget-aware reasoning evaluation |
| Alignment & Preference Optimization in LLM Agents | Align-Pos | arXiv | 2023 | Paper | Alignment and multi-objective methods |
| A Survey of Self-Evolving Agents: On Path to Artificial Super Intelligence | SE Survey | arXiv | 2025 | Paper | Taxonomy and methods for self-evolving agents. |
| Method | Category | Base LLM | Link | Resource |
|---|---|---|---|---|
| DeepRetrieval | External | Qwen2.5-3B-Instruct, Llama-3.2-3B-Instruct | Paper | Code |
| Search-R1 | External | Qwen2.5-3B/7B-Base/Instruct | Paper | Code |
| R1-Searcher | External | Qwen2.5-7B, Llama3.1-8B-Instruct | Paper | Code |
| WebThinker | External | QwQ-32B, DeepSeek-R1-Distilled-Qwen-7B/14B/32B | Paper | Code |
| WebSailor | External | Qwen2.5-3B/7B/32B/72B | Paper | Code |
| SSRL | Internal | Qwen2.5-1.5B/3B/7B/14B/32B/72B-Instruct, Llama-3.2-1B/8B-Instruct | Paper | Code |
| OpenAI Deep Research | External | OpenAI Models | Blog | Website |
| Perplexity DeepResearch | External | - | Blog | Website |
| Method | RL Reward Type | Base LLM | Link | Resource |
|---|---|---|---|---|
| AceCoder | Outcome | Qwen2.5-Coder-7B-Base/Instruct | Paper | Code |
| DeepCoder-14B | Outcome | DeepSeek-R1-Distilled-Qwen-14B | Blog | Code |
| CodeBoost | Process | Qwen2.5-Coder-7B-Instruct, Llama-3.1-8B-Instruct | Paper | Code |
| R1-Code-Interpreter | Outcome | Qwen2.5-7B/14B-Instruct-1M | Paper | Code |
| SWE-RL | Outcome | Llama-3.3-70B-Instruct | Paper | Code |
| Satori-SWE | Outcome | Qwen-2.5-Math-7B | Paper | Code |
| Method | Reward | Link | Resource |
|---|---|---|---|
| ARTIST | Outcome | Paper | - |
| ToRL | Outcome | Paper | Code |
| ZeroTIR | Outcome | Paper | Code |
| TTRL | Outcome | Paper | Code |
| DeepSeek-Prover-v1.5 | Formal | Paper | Code |
| Leanabell-Prover | Formal | Paper | Code |
| Method | Paradigm | Environment | Link | Resource |
|---|---|---|---|---|
| MM-Navigator | Vanilla VLM | - | Paper | Code |
| SeeAct | Vanilla VLM | - | Paper | Code |
| GUI-R1 | RL | Static | Paper | Code |
| UI-R1 | RL | Static | Paper | Code |
| InFiGUI-R1 | RL | Static | Paper | Code |
| UI-TARS | RL | Interactive | Paper | Code |
Reinforcement learning for AI agents is rapidly evolving, driving breakthroughs in reasoning, autonomy, and collaboration. As new methods and frameworks emerge, staying current is essential for both research and practical deployment. This curated list aims to support the community in navigating the dynamic landscape and make contributions!
💡 Pull requests welcome to keep this list up to date!
https://github.com/xhyumiracle/Awesome-AgenticLLM-RL-Papers
https://github.com/0russwest0/Awesome-Agent-RL
https://github.com/thinkwee/AgentsMeetRL
🌟 If you find this resource helpful, star the repo and share your favorite RL agent papers or frameworks! Let's build the future of intelligent agents together.
