Multi speaker training on xttsv2 #516

FatihcanUslu · 2025-10-27T11:31:24Z

FatihcanUslu
Oct 27, 2025

Hi,
I’m currently fine-tuning the XTTS v2.0 model on a custom Turkish dataset that contains multiple speakers.
However, I’m unsure about the correct way to represent speakers during training.

I’m using a fine-tuning script based on coqui-ai-TTS/recipes/ljspeech/xtts_v2/train_gpt_xtts.py, and my dataset metadata currently looks like this:

path|text|speaker_name
/home/.../speaker1/audio_001.wav|Merhaba, nasılsın?|speaker1
/home/.../speaker1/audio_002.wav|Nasılsın dünya?|speaker1
/home/.../speaker2/audio_034.wav|Bugün hava çok güzel.|speaker2
...

Should I:

Provide each speaker as a separate speaker_id / speaker_name in the metadata (so the model learns to distinguish them, like a classification task)?
or

Treat all samples as coming from a single generic speaker and just drop the speaker_name, since XTTS uses speaker embeddings from reference audio (speaker_wav) and may not require explicit speaker labels?

I want to understand how XTTSv2 internally uses speaker information — does providing speaker labels during fine-tuning has benefits, or does it rely entirely on the learned voice embeddings from speaker_wav?

Answered by eginhard

Nov 3, 2025

XTTS doesn't use speaker labels, instead it will create embeddings from the audio file.

View full answer

eginhard · 2025-11-03T09:59:02Z

eginhard
Nov 3, 2025
Maintainer

XTTS doesn't use speaker labels, instead it will create embeddings from the audio file.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi speaker training on xttsv2 #516

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Multi speaker training on xttsv2 #516

Uh oh!

FatihcanUslu Oct 27, 2025

Replies: 1 comment

Uh oh!

eginhard Nov 3, 2025 Maintainer

FatihcanUslu
Oct 27, 2025

eginhard
Nov 3, 2025
Maintainer