English | 中文
An AI-powered tool to transcribe and summarize videos and podcasts — supports YouTube, TikTok, Bilibili, Apple Podcasts, SoundCloud, and 30+ platforms.
- 🎥 Multi-Platform Support: Works with YouTube, TikTok, Bilibili, Apple Podcasts, SoundCloud, and 30+ more
- ⚡ Subtitle-First Architecture: For platforms with native subtitles (e.g. YouTube), transcripts are extracted instantly — no audio download needed. Whisper is only used as a fallback, making the whole pipeline dramatically faster.
- 🗣️ Intelligent Transcription: High-accuracy speech-to-text using Faster-Whisper when subtitles aren't available
- 🤖 AI Text Optimization: Automatic typo correction, sentence completion, and intelligent paragraphing
- 🌍 Multi-Language Summaries: Generate intelligent summaries in multiple languages
- 🔧 Bring Your Own Model: Configure any OpenAI-compatible API endpoint (OpenAI, OpenRouter, local LLM, etc.) directly in the UI — enter your API Base URL and API Key, then click Fetch to auto-discover all available models and select the one you want
- ⚙️ Conditional Translation: Auto-translates the transcript when the summary language differs from the source language
- 📱 Mobile-Friendly: Perfect support for mobile devices
- Python 3.8+
- FFmpeg
- An API key from any OpenAI-compatible provider (OpenAI, OpenRouter, etc.) — configured directly in the UI, no server-side env var needed
# Clone the repository
git clone https://github.com/wendy7756/AI-Video-Transcriber.git
cd AI-Video-Transcriber
# Run installation script
chmod +x install.sh
./install.sh# Clone the repository
git clone https://github.com/wendy7756/AI-Video-Transcriber.git
cd AI-Video-Transcriber
# Using Docker Compose (easiest)
cp .env.example .env
# Edit .env file if you want server-side defaults (optional)
docker-compose up -d
# Or using Docker directly
docker build -t ai-video-transcriber .
docker run -p 8000:8000 ai-video-transcriber- Install Python Dependencies
# macOS (PEP 668) strongly recommends using a virtualenv
python3 -m venv venv
source venv/bin/activate
python -m pip install --upgrade pip
pip install -r requirements.txt- Install FFmpeg
# macOS
brew install ffmpeg
# Ubuntu/Debian
sudo apt update && sudo apt install ffmpeg
# CentOS/RHEL
sudo yum install ffmpeg- Configure Environment Variables (optional)
# If you prefer server-side defaults, set these — otherwise configure via the UI
export OPENAI_API_KEY="your_api_key_here"
export OPENAI_BASE_URL="https://openrouter.ai/api/v1" # any OpenAI-compatible endpointpython3 start.pyAfter the service starts, open your browser and visit http://localhost:8000
To avoid SSE disconnections during long processing, start in production mode (hot-reload disabled):
python3 start.py --prodThis keeps the SSE connection stable throughout long tasks (30–60+ min).
source venv/bin/activate
export OPENAI_API_KEY=your_api_key_here # optional: server-side default
# export OPENAI_BASE_URL=https://openrouter.ai/api/v1 # optional: server-side default
python3 start.py --prod- Enter Video URL: Paste a video link from YouTube, Bilibili, or other supported platforms
- Select Summary Language: Choose the output language from the dropdown next to the input area
- (Optional) Configure AI Model: Click AI Settings to expand the panel
- Enter your API Base URL (e.g.
https://openrouter.ai/api/v1) and API Key - Click Fetch to auto-load all models from that provider
- Select the model you want — or leave blank to use the server default
- Enter your API Base URL (e.g.
- Start Processing: Click the Transcribe button. The progress bar shows which mode is active:
- ⚡ Subtitle (green) — native subtitles found, transcript extracted in seconds
- 🎙 Whisper (amber) — no subtitles available, downloading audio for transcription
- View Results: Review the optimized transcript and AI summary
- If transcript language ≠ selected summary language, a Translation tab appears automatically
- Download Files: Save Markdown-formatted files (Transcript / Translation / Summary)
- FastAPI: Modern Python web framework
- yt-dlp: Video downloading and processing
- Faster-Whisper: Efficient speech transcription
- OpenAI API: Intelligent text summarization
- HTML5 + CSS3: Responsive interface design
- JavaScript (ES6+): Modern frontend interactions
- Marked.js: Markdown rendering
- Font Awesome: Icon library
AI-Video-Transcriber/
├── backend/ # Backend code
│ ├── main.py # FastAPI main application
│ ├── video_processor.py # Video processing module
│ ├── transcriber.py # Transcription module
│ ├── summarizer.py # Summary module
│ └── translator.py # Translation module
├── static/ # Frontend files
│ ├── index.html # Main page
│ └── app.js # Frontend logic
├── temp/ # Temporary files directory
├── Dockerfile # Docker image configuration
├── docker-compose.yml # Docker Compose configuration
├── .dockerignore # Docker ignore rules
├── .env.example # Environment variables template
├── requirements.txt # Python dependencies
├── start.py # Startup script
└── README.md # Project documentation
| Variable | Description | Default | Required |
|---|---|---|---|
OPENAI_API_KEY |
API key (server-side default) | - | No — can be set in UI instead |
HOST |
Server address | 0.0.0.0 |
No |
PORT |
Server port | 8000 |
No |
WHISPER_MODEL_SIZE |
Whisper model size | base |
No |
| Model | Parameters | English-only | Multilingual | Speed | Memory Usage |
|---|---|---|---|---|---|
| tiny | 39 M | ✓ | ✓ | Fast | Low |
| base | 74 M | ✓ | ✓ | Medium | Low |
| small | 244 M | ✓ | ✓ | Medium | Medium |
| medium | 769 M | ✓ | ✓ | Slow | Medium |
| large | 1550 M | ✗ | ✓ | Very Slow | High |
A: Transcription speed depends on video length, Whisper model size, and hardware performance. Try using smaller models (like tiny or base) to improve speed.
A: All platforms supported by yt-dlp, including but not limited to: YouTube, TikTok, Facebook, Instagram, Twitter, Bilibili, Youku, iQiyi, Tencent Video, etc.
A: AI features require an API key from any OpenAI-compatible provider (OpenAI, OpenRouter, etc.). You can enter it directly in the AI Settings panel in the UI — no server restart needed. Alternatively, set OPENAI_API_KEY as an environment variable for a server-side default.
A: In most cases this is an environment configuration issue rather than a code bug. Please check:
- Ensure a virtualenv is activated:
source venv/bin/activate - Install deps inside the venv:
pip install -r requirements.txt - Configure your API key in the AI Settings panel, or set
OPENAI_API_KEYas an env var - Install FFmpeg:
brew install ffmpeg(macOS) /sudo apt install ffmpeg(Debian/Ubuntu) - If port 8000 is occupied, stop the old process or change
PORT
A: The system can process videos of any length, but processing time will increase accordingly. For very long videos, consider using smaller Whisper models.
A: Docker provides the easiest deployment method:
Prerequisites:
- Install Docker Desktop from https://www.docker.com/products/docker-desktop/
- Ensure Docker service is running
Quick Start:
# Clone and setup
git clone https://github.com/wendy7756/AI-Video-Transcriber.git
cd AI-Video-Transcriber
cp .env.example .env
# Edit .env file to set server-side defaults (optional)
# Start with Docker Compose (recommended)
docker-compose up -d
# Or build and run manually
docker build -t ai-video-transcriber .
docker run -p 8000:8000 --env-file .env ai-video-transcriberCommon Docker Issues:
- Port conflict: Change port mapping
-p 8001:8000if 8000 is occupied - Permission denied: Ensure Docker Desktop is running and you have proper permissions
- Build fails: Check disk space (need ~2GB free) and network connection
- Container won't start: Check Docker logs with
docker logs <container_id>
Docker Commands:
# View running containers
docker ps
# Check container logs
docker logs ai-video-transcriber-ai-video-transcriber-1
# Stop service
docker-compose down
# Rebuild after changes
docker-compose build --no-cacheA: Memory usage varies depending on the deployment method and workload:
Docker Deployment:
- Base memory: ~128MB for idle container
- During processing: 500MB - 2GB depending on video length and Whisper model
- Docker image size: ~1.6GB disk space required
- Recommended: 4GB+ RAM for smooth operation
Traditional Deployment:
- Base memory: ~50-100MB for FastAPI server
- Whisper models memory usage:
tiny: ~150MBbase: ~250MBsmall: ~750MBmedium: ~1.5GBlarge: ~3GB
- Peak usage: Base + Model + Video processing (~500MB additional)
Memory Optimization Tips:
# Use smaller Whisper model to reduce memory usage
WHISPER_MODEL_SIZE=tiny # or base
# For Docker, limit container memory if needed
docker run -m 1g -p 8000:8000 --env-file .env ai-video-transcriber
# Monitor memory usage
docker stats ai-video-transcriber-ai-video-transcriber-1A: If you encounter network-related errors during video downloading or API calls, try these solutions:
Common Network Issues:
- Video download fails with "Unable to extract" or timeout errors
- OpenAI API calls return connection timeout or DNS resolution failures
- Docker image pull fails or is extremely slow
Solutions:
- Switch VPN/Proxy: Try connecting to a different VPN server or switch your proxy settings
- Check Network Stability: Ensure your internet connection is stable
- Retry After Network Change: Wait 30-60 seconds after changing network settings before retrying
- Use Alternative Endpoints: If using custom OpenAI endpoints, verify they're accessible from your network
- Docker Network Issues: Restart Docker Desktop if container networking fails
Quick Network Test:
# Test video platform access
curl -I https://www.youtube.com/
# Test your AI provider endpoint
curl -I https://openrouter.ai
# Test Docker Hub access
docker pull hello-world- Supports 100+ languages through Whisper
- Automatic language detection
- High accuracy for major languages
- English
- Chinese (Simplified)
- Japanese
- Korean
- Spanish
- French
- German
- Portuguese
- Russian
- Arabic
- And more...
-
Hardware Requirements:
- Minimum: 4GB RAM, dual-core CPU
- Recommended: 8GB RAM, quad-core CPU
- Ideal: 16GB RAM, multi-core CPU, SSD storage
-
Processing Time Estimates:
Video Length Subtitle Mode Whisper Mode Notes 1 minute ~5s 30s–1 min Subtitle mode needs no audio download 5 minutes ~10s 2–5 min YouTube auto-captions trigger subtitle mode 15 minutes ~15s 5–15 min Most YouTube videos support subtitle mode 30+ minutes ~20s 15–60 min Podcast/audio-only always uses Whisper
We welcome Issues and Pull Requests!
- Fork the project
- Create a feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
- yt-dlp - Powerful video downloading tool
- Faster-Whisper - Efficient Whisper implementation
- FastAPI - Modern Python web framework
- OpenAI - Intelligent text processing API
For questions or suggestions, please submit an Issue or contact Wendy.
This tool is the open-source part of sipsip.ai.
The full product goes further:
- 📧 Daily email briefs — follow your favorite creators and get an AI-curated digest in your inbox every morning
- ⚡ Transcribe & summarize any video or podcast on demand
- 🌐 Multi-language support across all features
Free to start — no credit card required.
➡️ sipsip.ai
If you find this project helpful, please consider giving it a star!
