A Python library for transcribing audio and video with automatic speaker diarization. Uses WhisperX for high-quality speech-to-text and pyannote.audio for speaker detection, labeling, and fingerprinting.
- Speech-to-text transcription using WhisperX (faster-whisper / CTranslate2)
- Speaker diarization — automatically detect and label who said what
- Speaker fingerprinting — enroll speakers and recognize them across recordings using voice embeddings
- Cross-recording speaker identification — match speakers between different audio files
- REST API — built-in FastAPI server for transcription and speaker management
- GPU accelerated — PyTorch + CUDA for fast inference
- Multiple output formats — text and dictionary/JSON
- Configurable model sizes — tiny, base, small, medium, large
- Audio is loaded and preprocessed
- WhisperX transcribes speech to text with word-level timestamps
- pyannote.audio detects speaker turns (diarization)
- Speaker embeddings are extracted using WeSpeaker ResNet34-LM (256-dim)
- Embeddings are matched against enrolled speaker profiles in LanceDB for identification
- Results are returned with speaker labels, timestamps, and optional speaker names
pip install diarized-transcriberWith optional extras:
# FastAPI server
pip install diarized-transcriber[server]
# Speaker profiles (fingerprinting + cross-recording ID)
pip install diarized-transcriber[profiles]
# Everything
pip install diarized-transcriber[server,profiles]- Python 3.10+
- CUDA-capable GPU
- PyTorch with CUDA support
- HuggingFace token for pyannote.audio model access
export HF_TOKEN="<your-huggingface-token>"from diarized_transcriber import TranscriptionEngine, MediaContent, MediaSource
engine = TranscriptionEngine(model_size="base")
content = MediaContent(
id="example-1",
title="Example Media",
media_url="https://example.com/media.mp3",
source=MediaSource(type="podcast")
)
result = engine.transcribe(content)
from diarized_transcriber.utils.formatting import format_transcript
print(format_transcript(result, output_format="text", group_by_speaker=True))The TranscriptionEngine accepts different Whisper model sizes:
| Model | Speed | Accuracy |
|---|---|---|
tiny |
Fastest | Lowest |
base |
Fast | Good |
small |
Medium | Better |
medium |
Slow | High |
large |
Slowest | Highest |
engine = TranscriptionEngine(
model_size="medium",
compute_type="float16" # or "float32" for higher precision
)pip install diarized-transcriber[server]
python -m diarized_transcriber.api.serverThe server runs on port 8000 and provides endpoints for transcription and speaker management (/speakers/*).
When installed with the profiles extra, the API exposes endpoints for managing speaker profiles:
POST /speakers/enroll— Enroll a new speaker from audioGET /speakers/— List all enrolled speakersGET /speakers/{id}— Get speaker profile detailsPUT /speakers/{id}— Update a speaker profileDELETE /speakers/{id}— Remove a speaker profilePOST /speakers/identify— Identify a speaker from audioPOST /speakers/merge— Merge two speaker profiles
transcript = format_transcript(
result,
output_format="text",
include_timestamps=True,
timestamp_format="HH:MM:SS.mmm",
group_by_speaker=True
)transcript_dict = format_transcript(
result,
output_format="dict",
include_timestamps=True
)The library provides specific exceptions:
GPUConfigError— GPU/CUDA configuration issuesModelLoadError— Model loading failuresAudioProcessingError— Audio processing problemsTranscriptionError— General transcription failuresDiarizationError— Speaker diarization issues
Contributions are welcome! Please open an issue or submit a pull request.
MIT License — see LICENSE for details.