English | 日本語 | 中文 | 한국어 | Русский | Español | Italiano | Deutsch | Français | Português
A Windows-native fork of Qwen3-TTS with a modern, multilingual Web UI.
The original Qwen3-TTS was developed primarily for Linux environments, and FlashAttention 2 is recommended. However, FlashAttention 2 does not work on Windows. This fork enables direct execution on Windows without WSL2 or Docker, provides a modern Web UI supporting 10 languages, and adds automatic transcription via Whisper.
Mac (Apple Silicon) users: For the best experience on Mac, please use Qwen3-TTS-Mac-GeneLab -- fully optimized for Apple Silicon with MLX + PyTorch dual engine, 8bit/4bit quantization, and 10-language Web UI.
| Platform | Repository | Description |
|---|---|---|
| Windows | This repository | Windows-native + multilingual Web UI |
| macOS (Apple Silicon) | Qwen3-TTS-Mac-GeneLab | Fully optimized for Apple Silicon Mac (MLX + PyTorch dual engine, 10-language Web UI) |
- No FlashAttention 2 required: Uses PyTorch's standard SDPA (Scaled Dot Product Attention) via the
--no-flash-attnoption - No WSL2/Docker required: Runs directly on Windows
- RTX 50 series support: Includes instructions for installing PyTorch nightly builds for NVIDIA Blackwell architecture (sm_120)
- SoX dependency avoided: Works without SoX (warnings are displayed but can be safely ignored)
- 10-language UI: Japanese / English / Chinese / Korean / Russian / Spanish / Italian / German / French / Portuguese -- switch instantly via dropdown
- 4-tab layout: Custom Voice / Voice Design / Voice Clone / Settings -- access all features regardless of model type; unloaded models are downloaded automatically on first use
- GPU / VRAM monitoring: Check real-time usage in the Settings tab; CUDA cache clearing also available
- Whisper automatic transcription: Automates reference audio text input for voice cloning (uses faster-whisper)
- Whisper model selection: Choose from 5 models depending on your needs
tiny- Fastest & smallest (39M parameters)base- Fast (74M parameters)small- Balanced (244M parameters) *Defaultmedium- High accuracy (769M parameters)large-v3- Highest accuracy (1550M parameters)
- OS: Windows 10/11 (native environment, no WSL2 required)
- GPU: NVIDIA GPU (CUDA compatible)
- RTX 30/40 series: Works with stable PyTorch
- RTX 50 series (Blackwell): Requires PyTorch nightly build (cu128)
- Python: 3.10 or higher
- VRAM: 8GB or more recommended (varies by model size)
git clone https://github.com/hiroki-abe-58/Qwen3-TTS-JP.git
cd Qwen3-TTS-JPpython -m venv .venv
.venv\Scripts\activatepip install -e .
pip install faster-whisperInstall according to your CUDA version.
# For CUDA 12.x
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
# For RTX 50 series (sm_120), nightly build is required
pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu128# CustomVoice model (preset speakers)
qwen-tts-demo Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice --ip 127.0.0.1 --port 7860 --no-flash-attn
# Base model (with voice cloning)
qwen-tts-demo Qwen/Qwen3-TTS-12Hz-1.7B-Base --ip 127.0.0.1 --port 7860 --no-flash-attnOpen http://127.0.0.1:7860 in your browser.
Create a batch file like the following for double-click launching:
run-gui.bat (for CustomVoice model):
@echo off
chcp 65001 >nul
title Qwen3-TTS GUI
cd /d "%~dp0"
.venv\Scripts\python.exe -m qwen_tts.cli.demo Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice --ip 127.0.0.1 --port 7860 --no-flash-attn
pauserun-voice-clone.bat (for Base model / voice cloning):
@echo off
chcp 65001 >nul
title Qwen3-TTS Voice Clone
cd /d "%~dp0"
.venv\Scripts\python.exe -m qwen_tts.cli.demo Qwen/Qwen3-TTS-12Hz-1.7B-Base --ip 127.0.0.1 --port 7860 --no-flash-attn
pauseFor a more convenient launch method, you can use the following Python launcher:
launch_gui.py (click to expand)
# coding=utf-8
import socket
import subprocess
import sys
import time
import webbrowser
import threading
import urllib.request
import urllib.error
def find_free_port(start_port=7860, max_attempts=100):
"""Find an available port"""
for port in range(start_port, start_port + max_attempts):
try:
with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
s.bind(('127.0.0.1', port))
return port
except OSError:
continue
raise RuntimeError(f"All ports {start_port}-{start_port + max_attempts} are in use")
def wait_for_server_and_open_browser(url, timeout=180):
"""Wait for server startup then open browser"""
start_time = time.time()
while time.time() - start_time < timeout:
try:
req = urllib.request.Request(url, method='HEAD')
urllib.request.urlopen(req, timeout=2)
webbrowser.open(url)
return True
except (urllib.error.URLError, ConnectionRefusedError, TimeoutError):
time.sleep(2)
return False
def main():
port = find_free_port(7860)
url = f"http://127.0.0.1:{port}"
threading.Thread(target=wait_for_server_and_open_browser, args=(url, 180), daemon=True).start()
subprocess.run([
sys.executable, "-m", "qwen_tts.cli.demo",
"Qwen/Qwen3-TTS-12Hz-1.7B-Base", # or CustomVoice
"--ip", "127.0.0.1",
"--port", str(port),
"--no-flash-attn"
])
if __name__ == "__main__":
main()Features:
- Auto port selection: Automatically detects a free port if 7860 is in use
- Auto browser launch: Detects server startup completion and automatically opens the browser
- Character encoding fix: UTF-8 encoding support
- Upload an audio file to "Reference Audio"
- Select a model under "Whisper Model" (first-time download may take some time)
- Click "Auto Transcribe"
- The transcription result is automatically entered in "Reference Audio Text"
- Edit the text if necessary
- Enter the "Text to Synthesize"
- Click "Generate Audio"
This fork achieves Windows-native operation through the following measures:
| Issue | Original | This Fork's Solution |
|---|---|---|
| FlashAttention 2 | Linux-only, cannot build on Windows | Use SDPA via --no-flash-attn option |
| SoX dependency | Assumes installation | Works without it (warnings can be ignored) |
| RTX 50 series | Not supported | Nightly build instructions included |
| Environment setup | conda (Linux-oriented) | venv (Windows standard) |
Note: The --no-flash-attn option is required. Without it, the application will fail to start with a FlashAttention 2 import error.
During the development of this fork, the following Windows-specific issues were identified and resolved:
Problem: Stable PyTorch does not support the latest GPUs like RTX 5090 (sm_120)
RuntimeError: CUDA error: no kernel image is available for execution on the device
NVIDIA GeForce RTX 5090 with CUDA capability sm_120 is not compatible with the current PyTorch installation.
Solution: Use PyTorch nightly (cu128)
pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu128Problem: FlashAttention 2 is Linux-only and cannot be built or run on Windows
ImportError: FlashAttention2 has been toggled on, but it cannot be used due to the following error:
the package flash_attn seems to be not installed.
Solution: Use PyTorch's standard SDPA (Scaled Dot Product Attention) via the --no-flash-attn option
| Attention Implementation | Speed | Memory Efficiency | Windows Support |
|---|---|---|---|
| flash_attention_2 | Fastest | Best | Not supported |
| sdpa (PyTorch native) | Fast | Good | Supported |
| eager (standard) | Normal | Normal | Supported |
Problem: Some audio processing requires SoX, but it is not installed by default on Windows
SoX could not be found!
Solution: Qwen3-TTS core functionality works without SoX. Warnings can be safely ignored.
Problem: In Japanese Windows environments, non-ASCII characters are garbled due to cp932 encoding
UnicodeEncodeError: 'cp932' codec can't encode character...
Solution: Explicitly set UTF-8 encoding
import sys
import io
sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding='utf-8', errors='replace')
sys.stderr = io.TextIOWrapper(sys.stderr.buffer, encoding='utf-8', errors='replace')Or run chcp 65001 in the batch file
Problem: Version mismatch warning between PyTorch nightly and torchao
Skipping import of cpp extensions due to incompatible torch version 2.11.0.dev+cu128 for torchao version 0.15.0
Solution: Warning only, no impact on operation. Can be safely ignored.
Problem: Creating symbolic links on Windows requires administrator privileges
huggingface_hub cache-system uses symlinks by default...
Solution:
- Enable Developer Mode in Windows Settings
- Or ignore the warning (no impact on operation)
To verify that the environment is set up correctly:
import torch
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"CUDA version: {torch.version.cuda}")
print(f"GPU: {torch.cuda.get_device_name(0)}")
print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.1f} GB")Expected output (for RTX 5090):
PyTorch version: 2.11.0.dev20260123+cu128
CUDA available: True
CUDA version: 12.8
GPU: NVIDIA GeForce RTX 5090
GPU Memory: 31.8 GB
| Symptom | Cause | Solution |
|---|---|---|
no kernel image is available |
Using stable PyTorch | Install nightly (cu128) |
FlashAttention2 cannot be used |
FlashAttention not supported on Windows | Add --no-flash-attn option |
SoX could not be found |
SoX not installed | Can be ignored (no impact on core functionality) |
| GPU not recognized | CUDA driver outdated | Install latest driver |
| Character garbling | cp932 encoding | chcp 65001 or UTF-8 setting |
This project is released under the Apache License 2.0.
| Software | License | Copyright |
|---|---|---|
| Qwen3-TTS | Apache License 2.0 | Copyright 2026 Alibaba Cloud |
| faster-whisper | MIT License | Copyright SYSTRAN |
| OpenAI Whisper | MIT License | Copyright OpenAI |
For details, see the NOTICE file.
- The audio generated by this system is automatically produced by an AI model and may contain inaccurate or inappropriate content
- Generated audio does not represent the views of the developers and does not constitute professional advice
- Users assume all risks and responsibilities related to the use, distribution, or reliance on generated audio
- Cloning or using another person's voice without their consent may constitute a violation of portrait rights and publicity rights
- Please use the voice cloning feature only for lawful purposes with the consent of the person whose voice is being cloned
- Use for malicious purposes such as fraud, impersonation, defamation, or deepfakes is strictly prohibited
- The developers assume no liability for any damages arising from the use of this software
- All legal liability arising from illegal use shall be borne by the user
- This software is provided "as is" without any warranty
- Original developer: Alibaba Cloud Qwen Team
- Original repository: QwenLM/Qwen3-TTS
To cite the original Qwen3-TTS:
@article{Qwen3-TTS,
title={Qwen3-TTS Technical Report},
author={Hangrui Hu and Xinfa Zhu and Ting He and Dake Guo and Bin Zhang and Xiong Wang and Zhifang Guo and Ziyue Jiang and Hongkun Hao and Zishan Guo and Xinyu Zhang and Pei Zhang and Baosong Yang and Jin Xu and Jingren Zhou and Junyang Lin},
journal={arXiv preprint arXiv:2601.15621},
year={2026}
}


