Waypoint

A Python SDK for making LLM agent workflows fault-tolerant via event sourcing.

When an agent crashes mid-execution, Waypoint lets you resume from the last successful step—without re-running LLM calls or tool invocations that already completed. It does this by logging every step's input/output to an append-only PostgreSQL journal, then replaying from checkpoints on recovery.

Getting Started

Prerequisites

Docker + Docker Compose
uv (Python 3.13+)

Clone & Start

git clone git@github.com:aybruhm/waypoint.git
cd waypoint
make up

This starts the API gateway on http://localhost:9654 and PostgreSQL. The gateway auto-reloads on code changes.

Run Migrations

make run_migrations

Run Examples

# 3-step agent, no LLM
uv run python -m sdk.examples.simple_agent

# Mocked LLM + crash recovery demo
uv run python -m sdk.examples.agent_with_llm_mock

Stop

make down

Makefile Reference

Command	Description
`make up` / `make start`	Build & start containers (detached)
`make down` / `make stop`	Stop & remove containers
`make run_migrations`	Apply pending Alembic migrations
`make revert_migrations`	Roll back last migration
`make add_migration MSG="msg"`	Auto-generate new migration
`make show_current_db_head`	Show current migration version
`make show_db_heads`	List all migration heads

What It Solves

LLM agent crashes create three problems:

Wasted spend: LLM calls that succeeded before the crash get re-invoked on retry.
Lost context: No record of what happened, what state the agent was in, or which step failed.
Duplicate effects: Retrying a tool call (e.g., an API write) can create duplicates or break idempotency.

Waypoint avoids all three by persisting every step's result. On crash, you resume from the checkpoint—cached LLM responses return instantly, tool outputs are reused, and execution continues from the next step.

Architecture

Agent Code
    ↓
@checkpoint decorators (Waypoint SDK)
    ↓
┌────────────────┬─────────────────┬──────────────────┐
│ Event Journal  │ Checkpoint Mgr  │ Replay Engine    │
│ (append-only)  │ (progress)      │ (deterministic)  │
└────────────────┴─────────────────┴──────────────────┘
    ↓
PostgreSQL

Core Concepts

Concept	Description
Execution	A single run of an agent workflow, identified by a UUID.
Step	A decorated async function (`@checkpoint("name")`). Each step runs once per execution.
Checkpoint	A persisted record of a step's input/output + execution position.
Event Journal	Append-only log of all steps across all executions (PostgreSQL).
Replay	Reconstructing state by reading checkpoints in order, skipping re-execution.

How It Works

@checkpoint("step_name")
async def my_step(input):
    return output

The decorator:

Checks if a checkpoint exists for this step in the current execution.
If yes: returns cached output immediately (no function execution).
If no: runs the function, persists input/output as a checkpoint, returns output.

On crash, create a new Waypoint instance and call resume(execution_id). The SDK rebuilds state from the journal and continues from the next uncompleted step.

Key Properties

Deterministic replay — Same inputs always produce same outputs; no re-execution.
LLM call caching — Cached responses are returned on replay (zero token cost).
Framework-agnostic — Works with LangChain, CrewAI, custom async agents, FastAPI, etc.
Minimal integration — Add @checkpoint decorators (one per step). ~3 lines of change per step.
Full history — Query every step, error, and state transition by execution ID.

When to Use

Long-running agent workflows (minutes to hours) where crashes are expensive.
Cost-sensitive apps where re-calling LLMs on retry is unacceptable.
Teams needing audit trails for agent behavior and debugging.
Agent-as-a-service platforms running untrusted/user-submitted agents.

When Not to Use (Next Steps)

Distributed/multi-machine workflows (Waypoint is single-process).
High-throughput task queues (use Celery, Temporal, etc.).
Simple chatbots with no multi-step orchestration.

Stack

Python 3.13+
asyncio
FastAPI (gateway demo only; SDK is framework-agnostic)
PostgreSQL (events + checkpoints)
Pydantic + JSON serialization

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
.github/workflows		.github/workflows
docker		docker
docs		docs
migrations		migrations
sdk		sdk
src		src
.actrc		.actrc
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
alembic.ini		alembic.ini
compose.yml		compose.yml
makefile		makefile
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Waypoint

Getting Started

Prerequisites

Clone & Start

Run Migrations

Run Examples

Stop

Makefile Reference

What It Solves

Architecture

Core Concepts

How It Works

Key Properties

When to Use

When Not to Use (Next Steps)

Stack

About

Uh oh!

Releases 1

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Waypoint

Getting Started

Prerequisites

Clone & Start

Run Migrations

Run Examples

Stop

Makefile Reference

What It Solves

Architecture

Core Concepts

How It Works

Key Properties

When to Use

When Not to Use (Next Steps)

Stack

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Contributors

Uh oh!

Languages