An AI agent with behavioral routing. One key, every model.
npx otto-agent
Otto is a personal AI agent that automatically routes your prompts to the best model for the task:
- Code tasks → Claude Sonnet (highest code quality per dollar)
- Reasoning → Claude Sonnet or Opus (behavioral consistency: 0.89)
- Creative → GPT-4o (most creative output)
- Simple questions → Claude Haiku (fast, cheap)
- Sensitive topics → Claude Opus (highest manipulation resistance)
Routing decisions are based on ConstellationBench — an open behavioral benchmark that measures how models actually behave under pressure, not just how they score on multiple choice tests.
One requirement: an OpenRouter API key. One key gives you access to every model.
npx otto-agent
First run walks you through setup:
- Your name
- Agent name (default: Otto)
- OpenRouter API key
- Personality vibe: Chill / Direct / Hype / Coach
Config saved to ~/.otto/config.yaml. Soul file at ~/.otto/soul.md.
| Command | What it does |
|---|---|
/help |
Show all commands |
/models |
List available models with consistency scores |
/route <text> |
Show which model would handle a prompt |
/budget |
Show today's token spending vs daily limit |
/clear |
Clear conversation history |
/soul |
Show the agent's soul file |
/vibe <type> |
Change personality (chill/direct/hype/coach) |
/exit |
Quit |
- Behavioral routing — automatically picks the best model based on task type
- Token budget — daily spending limit with auto-concise mode over 70%
- Streaming — responses stream in real-time
- Soul file — customize the agent's personality in markdown
- BYOK — your key, your models, your data. Nothing goes through us.
- 9 models — Claude, GPT-4o, Gemini, DeepSeek, Llama, Mistral, Kimi K2.6
Every prompt is classified by task type (code, reasoning, creative, simple, sensitive, general). Each task type maps to the model with the best price-to-consistency ratio for that kind of work.
Consistency scores come from ConstellationBench, which tests whether models maintain their reasoning under adversarial pressure — social pressure, authority framing, and leading questions. A model that scores 0.89 means it holds its position 89% of the time when challenged. A model at 0.42 folds nearly half the time.
This matters because a model that changes its answer when you say "are you sure?" is not reliable — regardless of how well it scores on MMLU.
- Your API key is stored locally in
~/.otto/config.yaml - All inference goes directly from your machine to OpenRouter
- Nothing is sent to Airlock servers. Ever.
- No telemetry. No analytics. No tracking.
MIT — Airlock Technologies LLC