AudioCapBench

A benchmark for evaluating audio captioning models across three domains: environmental sound, music, and speech.

1,000 samples | LLM-as-Judge + reference metrics

Quick Setup

# 1. Install dependencies
bash install.sh
source .venv/bin/activate

# 2. Set up credentials
cp credentials.env.template credentials.env
# Edit credentials.env with your API keys (OpenAI, Gemini, HuggingFace)

# 3. Build evaluation dataset (downloads audio from HuggingFace)
source credentials.env
python -m audiocapbench.build_dataset --output-dir data/audio_caption

Quick Evaluation

# Evaluate a model
source credentials.env && python -m audiocapbench.evaluate \
    --provider openai --model gpt-4o-audio-preview \
    --data-dir data/audio_caption \
    --credentials credentials.env \
    --concurrency 10 --max-tokens 8192 --no-aac-metrics

# Quick test (10 samples, no LLM judge)
source credentials.env && python -m audiocapbench.evaluate \
    --provider openai --model gpt-4o-audio-preview \
    --data-dir data/audio_caption \
    --credentials credentials.env \
    --max-samples 10 --no-aac-metrics

# Single category
source credentials.env && python -m audiocapbench.evaluate \
    --provider gemini --model gemini-2.5-flash \
    --data-dir data/audio_caption \
    --credentials credentials.env \
    --category music --concurrency 10 --max-tokens 8192 --no-aac-metrics

More Details

Supported Models

Provider	Models	API Type
OpenAI	`gpt-4o-audio-preview`, `gpt-audio`, `gpt-audio-mini`, `gpt-4o-mini-audio-preview`	Chat Completions
OpenAI	`gpt-4o-realtime-preview`, `gpt-realtime`, `gpt-realtime-mini`	Realtime WebSocket
Gemini	`gemini-2.0-flash`, `gemini-2.5-flash-lite`, `gemini-2.5-flash`, `gemini-2.5-pro`, `gemini-3-flash-preview`, `gemini-3-pro-preview`	Vertex AI / API key

Evaluation Dataset

Category	Source	Samples
Sound	Clotho v2 test + AudioCaps test	200 + 200
Music	MusicCaps eval set	300
Speech	Emo Speech Caption	300
Total		1,000

Evaluation Metrics

LLM-as-Judge (GPT-4.1): Accuracy, Completeness, Hallucination (each 0-10). Overall = average of all three.

Reference-based: METEOR, BLEU-4, ROUGE-L (via NLTK + rouge-score).

Credentials

Copy the template and fill in your keys:

cp credentials.env.template credentials.env

Variable	Required for	How to get
`OPENAI_API_KEY`	OpenAI models + LLM judge	platform.openai.com
`GEMINI_API_KEY`	Gemini models (API key mode)	aistudio.google.com
`VERTEX_PROJECT`	Gemini models (Vertex AI mode)	GCP project ID + `gcloud auth`
`HF_TOKEN`	Dataset download	huggingface.co/settings/tokens

Project Structure

AudioCapBench/
├── audiocapbench/           # Main package
│   ├── build_dataset.py     # Dataset builder (downloads from HuggingFace)
│   ├── evaluate.py          # Evaluation pipeline
│   ├── models.py            # Model clients (OpenAI, Gemini, Qwen)
│   ├── metrics.py           # Metrics (aac-metrics, NLTK fallback, LLM judge)
│   └── config.py            # Config & credential loading
├── eval_data_ids/           # Curated 1000-sample eval subset (CSV)
├── configs/default.yaml     # Default configuration
├── install.sh               # Setup script
├── credentials.env.template # Credentials template
└── results/                 # Evaluation results (JSON)

License

This project is licensed under Apache-2.0 license.

Individual datasets retain their original licenses:

Clotho: Tampere University License
AudioCaps: CC-BY-NC-4.0
MusicCaps: CC-BY-SA-4.0
Emo Speech Caption: See dataset card

Citation

If you find our project useful, here is our paper:

@article{Qiu2025LoCoBenchAgentAI,
  title={AudioCapBench: Quick Evaluation on Audio Captioning across Sound, Music, and Speech},
  author={Jielin Qiu, Jianguo Zhang, Zixiang Chen, Liangwei Yang, Ming Zhu, Juntao Tan, Haolin Chen, Wenting Zhao, Rithesh Murthy, Roshan Ram, Akshara Prabhakar, Shelby Heinecke, Caiming, Xiong, Silvio Savarese, Huan Wang},
  journal={ArXiv},
  year={2026},
  volume={abs/2602.23649}
}

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
audiocapbench		audiocapbench
configs		configs
eval_data_ids		eval_data_ids
.gitignore		.gitignore
AI_ETHICS.md		AI_ETHICS.md
CODEOWNERS		CODEOWNERS
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE.txt		LICENSE.txt
README.md		README.md
SECURITY.md		SECURITY.md
commands.md		commands.md
credentials.env.template		credentials.env.template
how_to_license.md		how_to_license.md
install.sh		install.sh
models.md		models.md
project_doc.md		project_doc.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AudioCapBench

Quick Setup

Quick Evaluation

More Details

Supported Models

Evaluation Dataset

Evaluation Metrics

Credentials

Project Structure

License

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AudioCapBench

Quick Setup

Quick Evaluation

More Details

Supported Models

Evaluation Dataset

Evaluation Metrics

Credentials

Project Structure

License

Citation

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages