BirdBox is a comprehensive system for detecting and evaluating bird calls in audio recordings using deep learning. It leverages YOLO (You Only Look Once) object detection on spectrogram images to identify and localize bird vocalizations in time and frequency.
Multiple Audio Formats - Supports WAV, FLAC, OGG, MP3 (WAV/FLAC recommended for best results)
Arbitrary-Length Audio Processing - Handle audio from seconds to hours
Song Reconstruction - Automatically merge temporally adjacent detections into continuous bird songs
Batch Processing - Process entire directories of audio files
PCEN Normalization - Per-Channel Energy Normalization for robust spectral features
Comprehensive Evaluation - F-beta analysis, confusion matrices, optimal threshold finding
Multiple Output Formats - JSON, CSV (compatible with annotation formats)
Model Agnostic - Works with .pt, .onnx, .engine model formats
Trained YOLO-Models for this task can be found on the TUC-Cloud. Alternatively, you can train your own model on a custom dataset by using the code available in the BirdBox-Train repository (not yet publicly available).
To specify the model using the CLI, just pass the relative path of the model as the --model command-line argument.
If you use the code as a package, you can specify the model function parameter to match the relative path of the model file.
Important: The species mapping in the conf.yaml file the model is trained with and the DATASETS[model_name] dictionary in src/config.py have to match.
# Clone the repository
git clone https://github.com/birdnet-team/BirdBox.git
cd BirdBox
# Create virtual environment
python3 -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# Install dependencies
pip install -r requirements.txtThe easiest way to use BirdBox is through the interactive web interface:
streamlit run src/streamlit/app.pyThen open your browser to http://localhost:8501 and:
- Upload audio files (WAV, FLAC, OGG, MP3)
- Select a model from the dropdown
- Adjust detection parameters with sliders
- View PCEN spectrograms with bounding boxes
- Download results as JSON or CSV
If done correctly, the Streamlit Web Interface will look like this:
# Detect birds in a single audio file (supports WAV, FLAC, OGG, MP3)
python src/inference/detect_birds.py \
--audio path/to/recording.wav \
--model models/best.pt \
--species-mapping Hawaii
# Or process entire directory (batch processing)
python src/inference/detect_birds.py \
--audio path/to/audio/folder \
--model models/best.pt \
--species-mapping Hawaii# Step 1: Run comprehensive detection with low confidence threshold
python src/inference/detect_birds.py \
--audio data/test_audio/ \
--model models/best.pt \
--species-mapping Hawaii \
--conf 0.001 \
--output-path results/all_detections \
--output-format json
# Step 2: Analyze F-beta scores to find optimal threshold
python src/evaluation/f_beta_score_analysis.py \
--detections results/all_detections.json \
--labels data/test_labels.csv \
--beta 2.0 \
--output-path results/f_beta_analysis
# Step 3: Filter detections to optimal threshold (e.g., 0.35)
python src/evaluation/filter_detections.py \
--input results/all_detections.json \
--conf 0.35 \
--output-path results/filtered_detections \
--format all
# Step 4: Generate confusion matrix
python src/evaluation/confusion_matrix_analysis.py \
--detections results/filtered_detections.csv \
--labels data/test_labels.csv \
--output-path results/confusion_matrix
# Step 5: Examine results in results/ directoryfrom inference.detect_birds import BirdCallDetector
# Initialize detector
detector = BirdCallDetector(
model_path="models/best.pt",
species_mapping="Hawaii",
conf_threshold=0.001,
song_gap_threshold=0.1
)
# Detect birds (supports WAV, FLAC, OGG, MP3)
detections = detector.detect(
"path/to/audio.wav", # or .flac, .ogg, .mp3
output_path="results/detections"
)
# Print summary
detector.print_summary(detections)
# Access detection data
for det in detections:
print(f"{det['species']}: {det['time_start']:.1f}s - {det['time_end']:.1f}s "
f"(confidence: {det['confidence']:.3f})")from evaluation.f_beta_score_analysis import FBetaScoreAnalyzer
# Create analyzer
analyzer = FBetaScoreAnalyzer(
iou_threshold=0.5,
beta=2.0,
use_optimal_matching=True
)
# Analyze performance
results_df = analyzer.analyze_confidence_thresholds(
detections_path="results/all_detections.json",
labels_path="data/ground_truth.csv",
confidence_thresholds=[0.1, 0.2, 0.3, 0.4, 0.5]
)
# Find optimal thresholds
optimal_df = analyzer.find_optimal_thresholds(results_df)
print(optimal_df)- Use GPU acceleration (automatically detected)
- Use TensorRT models (
.engine) for NVIDIA GPUs - Lower confidence threshold for comprehensive detection, filter later
- Adjust
song_gap_thresholdbased on species vocalization patterns - Adjust
ìou-thresholdto fit the specific use-case
- Tune the β-Parameter for the Fβ-Analysis to fit the specific use-case
- β < 1 leads to more weight on precision
- β > 1 leads to more weight on recall
"No detections found"
- Lower confidence threshold (
--conf 0.001) - Check if audio file is in a supported format (WAV, FLAC, OGG, MP3)
- Verify model is trained on similar species
- If using MP3/OGG, try with WAV/FLAC version of same recording
"Poor detection performance"
- Use lossless formats (WAV/FLAC) instead of lossy (MP3/OGG)
- Model was trained on WAV files - lossy compression can affect accuracy
- Ensure MP3/OGG files use high bitrate (≥256 kbps) if you must use them
"Out of memory errors"
- Process shorter audio files
- Reduce PCEN segment length in config
- Use smaller YOLO model (e.g., yolo11n instead of yolo11l)
"No matching files in evaluation"
- Check filename formats (tools auto-normalize extensions)
- Verify ground truth CSV has correct column names
- Ensure audio filenames match between detections and labels
Feel free to use BirdBox for your acoustic analyses and research. If you do, please cite as:
@article{kahl2021birdnet,
title={BirdNET: A deep learning solution for avian diversity monitoring},
author={Kahl, Stefan and Wood, Connor M and Eibl, Maximilian and Klinck, Holger},
journal={Ecological Informatics},
volume={61},
pages={101236},
year={2021},
publisher={Elsevier}
}Our work in the K. Lisa Yang Center for Conservation Bioacoustics is made possible by the generosity of K. Lisa Yang to advance innovative conservation technologies to inspire and inform the conservation of wildlife and habitats.
The development of BirdNET is supported by the German Federal Ministry of Research, Technology and Space (FKZ 01|S22072), the German Federal Ministry for the Environment, Climate Action, Nature Conservation and Nuclear Safety (FKZ 67KI31040E), the German Federal Ministry of Economic Affairs and Energy (FKZ 16KN095550), the Deutsche Bundesstiftung Umwelt (project 39263/01) and the European Social Fund.
BirdNET is a joint effort of partners from academia and industry. Without these partnerships, this project would not have been possible. Thank you!


