Qwen3-TTS-JP

A Windows-native fork of Qwen3-TTS with a modern, multilingual Web UI.

The original Qwen3-TTS was developed primarily for Linux environments, and FlashAttention 2 is recommended. However, FlashAttention 2 does not work on Windows. This fork enables direct execution on Windows without WSL2 or Docker, provides a modern Web UI supporting 10 languages, and adds automatic transcription via Whisper.

Mac (Apple Silicon) users: For the best experience on Mac, please use Qwen3-TTS-Mac-GeneLab -- fully optimized for Apple Silicon with MLX + PyTorch dual engine, 8bit/4bit quantization, and 10-language Web UI.

Custom Voice -- Speech synthesis with preset speakers

Voice Design -- Describe voice characteristics to synthesize

Voice Clone -- Clone voice from reference audio

Settings -- GPU / VRAM / Model information

Related Projects

Platform	Repository	Description
Windows	This repository	Windows-native + multilingual Web UI
macOS (Apple Silicon)	Qwen3-TTS-Mac-GeneLab	Fully optimized for Apple Silicon Mac (MLX + PyTorch dual engine, 10-language Web UI)

Features

Windows Native Support

No FlashAttention 2 required: Uses PyTorch's standard SDPA (Scaled Dot Product Attention) via the --no-flash-attn option
No WSL2/Docker required: Runs directly on Windows
RTX 50 series support: Includes instructions for installing PyTorch nightly builds for NVIDIA Blackwell architecture (sm_120)
SoX dependency avoided: Works without SoX (warnings are displayed but can be safely ignored)

Modern Web UI & Multilingual Support

10-language UI: Japanese / English / Chinese / Korean / Russian / Spanish / Italian / German / French / Portuguese -- switch instantly via dropdown
4-tab layout: Custom Voice / Voice Design / Voice Clone / Settings -- access all features regardless of model type; unloaded models are downloaded automatically on first use
GPU / VRAM monitoring: Check real-time usage in the Settings tab; CUDA cache clearing also available
Whisper automatic transcription: Automates reference audio text input for voice cloning (uses faster-whisper)
Whisper model selection: Choose from 5 models depending on your needs
- tiny - Fastest & smallest (39M parameters)
- base - Fast (74M parameters)
- small - Balanced (244M parameters) *Default
- medium - High accuracy (769M parameters)
- large-v3 - Highest accuracy (1550M parameters)

System Requirements

OS: Windows 10/11 (native environment, no WSL2 required)
GPU: NVIDIA GPU (CUDA compatible)
- RTX 30/40 series: Works with stable PyTorch
- RTX 50 series (Blackwell): Requires PyTorch nightly build (cu128)
Python: 3.10 or higher
VRAM: 8GB or more recommended (varies by model size)

Installation

1. Clone the Repository

git clone https://github.com/hiroki-abe-58/Qwen3-TTS-JP.git
cd Qwen3-TTS-JP

2. Create and Activate Virtual Environment

python -m venv .venv
.venv\Scripts\activate

3. Install Dependencies

pip install -e .
pip install faster-whisper

4. Install PyTorch (CUDA Version)

Install according to your CUDA version.

# For CUDA 12.x
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124

# For RTX 50 series (sm_120), nightly build is required
pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu128

Usage

Starting the GUI

From the Command Line

# CustomVoice model (preset speakers)
qwen-tts-demo Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice --ip 127.0.0.1 --port 7860 --no-flash-attn

# Base model (with voice cloning)
qwen-tts-demo Qwen/Qwen3-TTS-12Hz-1.7B-Base --ip 127.0.0.1 --port 7860 --no-flash-attn

Open http://127.0.0.1:7860 in your browser.

Quick Launch with Batch Files (Recommended)

Create a batch file like the following for double-click launching:

run-gui.bat (for CustomVoice model):

@echo off
chcp 65001 >nul
title Qwen3-TTS GUI
cd /d "%~dp0"
.venv\Scripts\python.exe -m qwen_tts.cli.demo Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice --ip 127.0.0.1 --port 7860 --no-flash-attn
pause

run-voice-clone.bat (for Base model / voice cloning):

@echo off
chcp 65001 >nul
title Qwen3-TTS Voice Clone
cd /d "%~dp0"
.venv\Scripts\python.exe -m qwen_tts.cli.demo Qwen/Qwen3-TTS-12Hz-1.7B-Base --ip 127.0.0.1 --port 7860 --no-flash-attn
pause

Advanced Launcher (Auto Port Selection & Auto Browser Launch)

For a more convenient launch method, you can use the following Python launcher:

launch_gui.py (click to expand)

# coding=utf-8
import socket
import subprocess
import sys
import time
import webbrowser
import threading
import urllib.request
import urllib.error

def find_free_port(start_port=7860, max_attempts=100):
    """Find an available port"""
    for port in range(start_port, start_port + max_attempts):
        try:
            with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
                s.bind(('127.0.0.1', port))
                return port
        except OSError:
            continue
    raise RuntimeError(f"All ports {start_port}-{start_port + max_attempts} are in use")

def wait_for_server_and_open_browser(url, timeout=180):
    """Wait for server startup then open browser"""
    start_time = time.time()
    while time.time() - start_time < timeout:
        try:
            req = urllib.request.Request(url, method='HEAD')
            urllib.request.urlopen(req, timeout=2)
            webbrowser.open(url)
            return True
        except (urllib.error.URLError, ConnectionRefusedError, TimeoutError):
            time.sleep(2)
    return False

def main():
    port = find_free_port(7860)
    url = f"http://127.0.0.1:{port}"
    
    threading.Thread(target=wait_for_server_and_open_browser, args=(url, 180), daemon=True).start()
    
    subprocess.run([
        sys.executable, "-m", "qwen_tts.cli.demo",
        "Qwen/Qwen3-TTS-12Hz-1.7B-Base",  # or CustomVoice
        "--ip", "127.0.0.1",
        "--port", str(port),
        "--no-flash-attn"
    ])

if __name__ == "__main__":
    main()

Features:

Auto port selection: Automatically detects a free port if 7860 is in use
Auto browser launch: Detects server startup completion and automatically opens the browser
Character encoding fix: UTF-8 encoding support

Voice Cloning Steps

Upload an audio file to "Reference Audio"
Select a model under "Whisper Model" (first-time download may take some time)
Click "Auto Transcribe"
The transcription result is automatically entered in "Reference Audio Text"
Edit the text if necessary
Enter the "Text to Synthesize"
Click "Generate Audio"

Windows Native Support Details

This fork achieves Windows-native operation through the following measures:

Issue	Original	This Fork's Solution
FlashAttention 2	Linux-only, cannot build on Windows	Use SDPA via `--no-flash-attn` option
SoX dependency	Assumes installation	Works without it (warnings can be ignored)
RTX 50 series	Not supported	Nightly build instructions included
Environment setup	conda (Linux-oriented)	venv (Windows standard)

Note: The --no-flash-attn option is required. Without it, the application will fail to start with a FlashAttention 2 import error.

Detailed Windows Native Environment Setup

Resolved Technical Issues

During the development of this fork, the following Windows-specific issues were identified and resolved:

1. CUDA Support for RTX 50 Series (Blackwell/sm_120)

Problem: Stable PyTorch does not support the latest GPUs like RTX 5090 (sm_120)

RuntimeError: CUDA error: no kernel image is available for execution on the device
NVIDIA GeForce RTX 5090 with CUDA capability sm_120 is not compatible with the current PyTorch installation.

Solution: Use PyTorch nightly (cu128)

pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu128

2. FlashAttention 2 Not Supported on Windows

Problem: FlashAttention 2 is Linux-only and cannot be built or run on Windows

ImportError: FlashAttention2 has been toggled on, but it cannot be used due to the following error: 
the package flash_attn seems to be not installed.

Solution: Use PyTorch's standard SDPA (Scaled Dot Product Attention) via the --no-flash-attn option

Attention Implementation	Speed	Memory Efficiency	Windows Support
flash_attention_2	Fastest	Best	Not supported
sdpa (PyTorch native)	Fast	Good	Supported
eager (standard)	Normal	Normal	Supported

3. SoX Dependency Avoidance

Problem: Some audio processing requires SoX, but it is not installed by default on Windows

SoX could not be found!

Solution: Qwen3-TTS core functionality works without SoX. Warnings can be safely ignored.

4. Console Character Encoding (cp932 Encoding)

Problem: In Japanese Windows environments, non-ASCII characters are garbled due to cp932 encoding

UnicodeEncodeError: 'cp932' codec can't encode character...

Solution: Explicitly set UTF-8 encoding

import sys
import io
sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding='utf-8', errors='replace')
sys.stderr = io.TextIOWrapper(sys.stderr.buffer, encoding='utf-8', errors='replace')

Or run chcp 65001 in the batch file

5. torchao Compatibility Warning

Problem: Version mismatch warning between PyTorch nightly and torchao

Skipping import of cpp extensions due to incompatible torch version 2.11.0.dev+cu128 for torchao version 0.15.0

Solution: Warning only, no impact on operation. Can be safely ignored.

6. Hugging Face Symlink Warning

Problem: Creating symbolic links on Windows requires administrator privileges

huggingface_hub cache-system uses symlinks by default...

Solution:

Enable Developer Mode in Windows Settings
Or ignore the warning (no impact on operation)

Verification Script

To verify that the environment is set up correctly:

import torch

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"CUDA version: {torch.version.cuda}")
print(f"GPU: {torch.cuda.get_device_name(0)}")
print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.1f} GB")

Expected output (for RTX 5090):

PyTorch version: 2.11.0.dev20260123+cu128
CUDA available: True
CUDA version: 12.8
GPU: NVIDIA GeForce RTX 5090
GPU Memory: 31.8 GB

Troubleshooting

Symptom	Cause	Solution
`no kernel image is available`	Using stable PyTorch	Install nightly (cu128)
`FlashAttention2 cannot be used`	FlashAttention not supported on Windows	Add `--no-flash-attn` option
`SoX could not be found`	SoX not installed	Can be ignored (no impact on core functionality)
GPU not recognized	CUDA driver outdated	Install latest driver
Character garbling	cp932 encoding	`chcp 65001` or UTF-8 setting

License

This project is released under the Apache License 2.0.

Open Source Software Used

Software	License	Copyright
Qwen3-TTS	Apache License 2.0	Copyright 2026 Alibaba Cloud
faster-whisper	MIT License	Copyright SYSTRAN
OpenAI Whisper	MIT License	Copyright OpenAI

For details, see the NOTICE file.

Disclaimer

Disclaimer Regarding Audio Generation

The audio generated by this system is automatically produced by an AI model and may contain inaccurate or inappropriate content
Generated audio does not represent the views of the developers and does not constitute professional advice
Users assume all risks and responsibilities related to the use, distribution, or reliance on generated audio

Voice Cloning Warning

Cloning or using another person's voice without their consent may constitute a violation of portrait rights and publicity rights
Please use the voice cloning feature only for lawful purposes with the consent of the person whose voice is being cloned
Use for malicious purposes such as fraud, impersonation, defamation, or deepfakes is strictly prohibited

Legal Liability

The developers assume no liability for any damages arising from the use of this software
All legal liability arising from illegal use shall be borne by the user
This software is provided "as is" without any warranty

Acknowledgments

Original developer: Alibaba Cloud Qwen Team
Original repository: QwenLM/Qwen3-TTS

Citation

To cite the original Qwen3-TTS:

@article{Qwen3-TTS,
  title={Qwen3-TTS Technical Report},
  author={Hangrui Hu and Xinfa Zhu and Ting He and Dake Guo and Bin Zhang and Xiong Wang and Zhifang Guo and Ziyue Jiang and Hongkun Hao and Zishan Guo and Xinyu Zhang and Pei Zhang and Baosong Yang and Jin Xu and Jingren Zhou and Junyang Lin},
  journal={arXiv preprint arXiv:2601.15621},
  year={2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.github		.github
assets		assets
docs		docs
examples		examples
finetuning		finetuning
qwen_tts		qwen_tts
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
NOTICE		NOTICE
Qwen3-TTS-JP.bat		Qwen3-TTS-JP.bat
README.md		README.md
launch_gui.py		launch_gui.py
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

Qwen3-TTS-JP

Custom Voice -- Speech synthesis with preset speakers

Voice Design -- Describe voice characteristics to synthesize

Voice Clone -- Clone voice from reference audio

Settings -- GPU / VRAM / Model information

Related Projects

Features

Windows Native Support

Modern Web UI & Multilingual Support

System Requirements

Installation

1. Clone the Repository

2. Create and Activate Virtual Environment

3. Install Dependencies

4. Install PyTorch (CUDA Version)

Usage

Starting the GUI

From the Command Line

Quick Launch with Batch Files (Recommended)

Advanced Launcher (Auto Port Selection & Auto Browser Launch)

Voice Cloning Steps

Windows Native Support Details

Detailed Windows Native Environment Setup

Resolved Technical Issues

1. CUDA Support for RTX 50 Series (Blackwell/sm_120)

2. FlashAttention 2 Not Supported on Windows

3. SoX Dependency Avoidance

4. Console Character Encoding (cp932 Encoding)

5. torchao Compatibility Warning

6. Hugging Face Symlink Warning

Verification Script

Troubleshooting

License

Open Source Software Used

Disclaimer

Disclaimer Regarding Audio Generation

Voice Cloning Warning

Legal Liability

Acknowledgments

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages