Speakerbox is a library which enables:

the creation of an audio-based speaker identification datasets
training an audio-based speaker identification transformer model
applying a pre-trained audio-based speaker identification to a new audio file and predicting portions of audio as the known speakers

This release completes the work for our Journal of Open Source Software (JOSS) paper.

The changes from v1.0.0 include:

An example video attached to the README which demonstrates how to use this library (on a toy example) -- YouTube Video Link.
A more thorough workflow diagram attached to the README which explains how all the components of this library fit together.
The example data used for model reproduction is now available for download directly from a Python command.
Upgrading to newer dependency versions.
The JOSS paper content: paper.md.
Upgraded linting with ruff.
Minor improvements to logging.

Speakerbox v1.0.0

Speakerbox is a library for few-shot fine-tuning of a Transformer for speaker identification. This initial release has all the functionality needed to quickly generate a training set and fine-tune a model for use in downstream analysis tasks.

Given a set of recordings of multi-speaker recordings:

example/
├── 0.wav
├── 1.wav
├── 2.wav
├── 3.wav
├── 4.wav
└── 5.wav

Where each recording has some or all of a set of speakers, for example:

0.wav -- contains speakers: A, B, C, D, E
1.wav -- contains speakers: B, D, E
2.wav -- contains speakers: A, B, C
3.wav -- contains speakers: A, B, C, D, E
4.wav -- contains speakers: A, C, D
5.wav -- contains speakers: A, B, C, D, E

You want to train a model to classify portions of audio as one of the N known speakers
in future recordings not included in your original training set.

f(audio) -> [(start_time, end_time, speaker), (start_time, end_time, speaker), ...]

i.e. f(audio) -> [(2.4, 10.5, "A"), (10.8, 14.1, "D"), (14.8, 22.7, "B"), ...]

The speakerbox library contains methods for both generating datasets for annotation
and for utilizing multiple audio annotation schemes to train such a model.

The following table shows model performance results as the dataset size increases:

dataset_size	mean_accuracy	mean_precision	mean_recall	mean_training_duration_seconds
15-minutes	0.874 ± 0.029	0.881 ± 0.037	0.874 ± 0.029	101 ± 1
30-minutes	0.929 ± 0.006	0.94 ± 0.007	0.929 ± 0.006	186 ± 3
60-minutes	0.937 ± 0.02	0.94 ± 0.017	0.937 ± 0.02	453 ± 7

Please see our documentation for more details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Uh oh!

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Speakerbox v1.0.0

Uh oh!

Releases: CouncilDataProject/speakerbox

Speakerbox: Few-Shot Learning for Speaker Identification with Transformers

Uh oh!

Speakerbox v1.0.0

Speakerbox v1.0.0

Uh oh!