Problem
When using a voice model cloned from Japanese audio to synthesize Chinese or Korean text via the /v1/tts endpoint (S2-Pro), shared CJK characters (漢字) are incorrectly pronounced in Japanese instead of the target language.
Examples
- Chinese text
十二月十五日 is read as Japanese "じゅうにがつじゅうごにち" instead of Chinese "shí'èr yuè shíwǔ rì"
- Korean text with
2년 has 2 read as Japanese "に" instead of Korean "이"
This happens because the automatic language detection appears to be biased by the reference voice's language. Since Japanese, Chinese, and Korean share many characters (漢字/汉字/한자), the TTS engine cannot reliably determine the intended language from text alone when the reference voice is in a different CJK language.
Workarounds Attempted
- Phoneme annotations for every character: fixes pronunciation but makes speech very unnatural
- Partial phoneme annotations (first few characters only): annotated parts sound unnatural, and the rest still reverts to Japanese pronunciation
- Rewriting text (e.g., spelling out numbers in target language): does not help for shared characters like 月, 日, 年
Proposed Solution
Add an optional language parameter to the TTS request body:
{
"text": "十二月十五日",
"reference_id": "model_id",
"language": "zh"
}
This would allow the TTS engine to explicitly determine the pronunciation language for the given text, regardless of the reference voice's original language.
Use Case
We are building a bilingual learning audio service that:
- Transcribes audio (e.g., Japanese podcast) with speaker diarization
- Clones each speaker's voice using Fish Audio
- Translates segments to a target language (e.g., Chinese, Korean)
- Synthesizes the translated text using the cloned voice
This cross-lingual voice cloning workflow works well for non-CJK target languages (e.g., Japanese to English), but breaks down for CJK to CJK translation due to shared characters.
Environment
- Model: S2-Pro
- API endpoint:
/v1/tts
- Source language: Japanese
- Target languages affected: Chinese (zh), Korean (ko)
Problem
When using a voice model cloned from Japanese audio to synthesize Chinese or Korean text via the
/v1/ttsendpoint (S2-Pro), shared CJK characters (漢字) are incorrectly pronounced in Japanese instead of the target language.Examples
十二月十五日is read as Japanese "じゅうにがつじゅうごにち" instead of Chinese "shí'èr yuè shíwǔ rì"2년has2read as Japanese "に" instead of Korean "이"This happens because the automatic language detection appears to be biased by the reference voice's language. Since Japanese, Chinese, and Korean share many characters (漢字/汉字/한자), the TTS engine cannot reliably determine the intended language from text alone when the reference voice is in a different CJK language.
Workarounds Attempted
Proposed Solution
Add an optional
languageparameter to the TTS request body:{ "text": "十二月十五日", "reference_id": "model_id", "language": "zh" }This would allow the TTS engine to explicitly determine the pronunciation language for the given text, regardless of the reference voice's original language.
Use Case
We are building a bilingual learning audio service that:
This cross-lingual voice cloning workflow works well for non-CJK target languages (e.g., Japanese to English), but breaks down for CJK to CJK translation due to shared characters.
Environment
/v1/tts