Multi speaker training on xttsv2 #516
-
|
Hi, I’m using a fine-tuning script based on coqui-ai-TTS/recipes/ljspeech/xtts_v2/train_gpt_xtts.py, and my dataset metadata currently looks like this: Should I: Provide each speaker as a separate speaker_id / speaker_name in the metadata (so the model learns to distinguish them, like a classification task)? Treat all samples as coming from a single generic speaker and just drop the speaker_name, since XTTS uses speaker embeddings from reference audio (speaker_wav) and may not require explicit speaker labels? I want to understand how XTTSv2 internally uses speaker information — does providing speaker labels during fine-tuning has benefits, or does it rely entirely on the learned voice embeddings from speaker_wav? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
|
XTTS doesn't use speaker labels, instead it will create embeddings from the audio file. |
Beta Was this translation helpful? Give feedback.
XTTS doesn't use speaker labels, instead it will create embeddings from the audio file.