Skip to content

GalenChen320/Otter

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

164 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

logo

Otter

An Agent code evaluation framework with native multi-turn feedback iteration.

English | 中文

Coverage Python License

Why Otter

Mainstream code benchmarks use snapshot-style evaluation — one input, one output. But real-world programming involves iterating based on compiler errors, test failures, and other feedback. This feedback-driven iteration is the core of programming ability.

Otter integrates evaluation feedback into the evaluation loop, letting agents work like real developers: write code → run → read errors → fix → run again, until the tests pass or the maximum number of turns is reached.

   ┌────────────────────────────┐
   ↓                            │
Proposer ───→ Executor ───→ Evaluator
                                │
                          Pass? │
                                ↓
                               End

Quick Start

Prerequisites: Python >= 3.11, Docker

# Install
pip install -e .

# Configure
cp .env.example .env
# Edit .env with your API credentials

# Run evaluation
otter run

Configuration

All parameters are managed via .env files. The CLI only accepts --env to select a config file:

otter run                    # uses .env by default
otter run --env .env.local   # specify a config file

See Environment Variable Configuration for the full parameter reference.

Output Structure

Results are saved under experiments/ as a directory tree, with a full record for each turn of each problem:

experiments/{experiment_id}/
└── {task_id}#{sample_id}/
    ├── turn_1/
    │   ├── prop_input/    # Proposer input (created if proposer enabled)
    │   ├── prop_output/   # Proposer output (created if proposer enabled)
    │   ├── exec_input/    # Executor input (created if executor enabled)
    │   ├── exec_output/   # Executor output (created if executor enabled)
    │   ├── eval_input/    # Evaluator input (created if evaluator enabled)
    │   ├── eval_output/   # Evaluator output (created if evaluator enabled)
    │   └── meta.json     # Turn verdict {"passed": true/false}
    ├── turn_2/           # Turn 2 (if turn 1 failed and max_turns > 1)
    │   └── ...
    └── ...

Supported Datasets

Dataset Status Description
MBPP+ Fully supported Function-level Python problems
EvalPlus (HumanEval+) Fully supported Rigorous LLM4Code benchmarks
LiveCodeBench Planned Contamination-free live coding problems
SWE-Bench Planned Real-world GitHub issue resolution
Tau2Bench Planned Multi-turn agentic task evaluation
TerminalBench Planned Terminal-based coding tasks
SWE-CI Planned CI-driven software engineering tasks

License

Apache License 2.0

About

An agent evaluation framework with native multi-turn feedback iteration.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors