|
An Agent code evaluation framework with native multi-turn feedback iteration. |
English | 中文
Mainstream code benchmarks use snapshot-style evaluation — one input, one output. But real-world programming involves iterating based on compiler errors, test failures, and other feedback. This feedback-driven iteration is the core of programming ability.
Otter integrates evaluation feedback into the evaluation loop, letting agents work like real developers: write code → run → read errors → fix → run again, until the tests pass or the maximum number of turns is reached.
┌────────────────────────────┐
↓ │
Proposer ───→ Executor ───→ Evaluator
│
Pass? │
↓
End
Prerequisites: Python >= 3.11, Docker
# Install
pip install -e .
# Configure
cp .env.example .env
# Edit .env with your API credentials
# Run evaluation
otter runAll parameters are managed via .env files. The CLI only accepts --env to select a config file:
otter run # uses .env by default
otter run --env .env.local # specify a config fileSee Environment Variable Configuration for the full parameter reference.
Results are saved under experiments/ as a directory tree, with a full record for each turn of each problem:
experiments/{experiment_id}/
└── {task_id}#{sample_id}/
├── turn_1/
│ ├── prop_input/ # Proposer input (created if proposer enabled)
│ ├── prop_output/ # Proposer output (created if proposer enabled)
│ ├── exec_input/ # Executor input (created if executor enabled)
│ ├── exec_output/ # Executor output (created if executor enabled)
│ ├── eval_input/ # Evaluator input (created if evaluator enabled)
│ ├── eval_output/ # Evaluator output (created if evaluator enabled)
│ └── meta.json # Turn verdict {"passed": true/false}
├── turn_2/ # Turn 2 (if turn 1 failed and max_turns > 1)
│ └── ...
└── ...
| Dataset | Status | Description |
|---|---|---|
| MBPP+ | Fully supported | Function-level Python problems |
| EvalPlus (HumanEval+) | Fully supported | Rigorous LLM4Code benchmarks |
| LiveCodeBench | Planned | Contamination-free live coding problems |
| SWE-Bench | Planned | Real-world GitHub issue resolution |
| Tau2Bench | Planned | Multi-turn agentic task evaluation |
| TerminalBench | Planned | Terminal-based coding tasks |
| SWE-CI | Planned | CI-driven software engineering tasks |
