Production-grade AI voice agent built on FreeSWITCH — handles real phone calls end-to-end using Speech-to-Text, LLM reasoning, and Text-to-Speech.
⚡ This project is under active development and evolving into a full AI Voice Platform (VoIP + LLM + Multi-Agent system)
When someone calls in, this system handles the entire conversation autonomously:
- FreeSWITCH receives the SIP call and streams raw audio
- STT Service (faster-whisper, CPU int8) converts live speech to text — with VAD filtering and in-process resampling
- Agent Service sends the transcript to an LLM for context-aware, intelligent response generation
- TTS Service (Piper TTS) converts the LLM response back into natural speech
- FreeSWITCH plays the audio back to the caller — completing the loop
🎧 Listen to a real call → — 175 seconds, 6-8 turns, ~1s avg latency on CPU End-to-end latency: ~1.5–2 seconds on CPU | ~800ms on GPU
- Real-time SIP call handling via FreeSWITCH
- Live Speech-to-Text via faster-whisper (CPU int8 — no GPU needed)
- VAD filter — silence detection cuts unnecessary STT calls
- In-process audio resampling (scipy/soundfile — no ffmpeg, ~5ms vs ~120ms)
- VoIP-specific STT corrections (FreeSWITCH, SIP trunk mis-transcription fixes)
- LLM-based response generation (Groq / OpenAI — pluggable)
- TTS-friendly LLM output — strips markdown, converts lists to spoken sentences
- Text-to-Speech playback (Piper TTS — naturalness tuning: noise_scale, length_scale)
- Pipeline Service :8004 — single HTTP call from FreeSWITCH handles STT→LLM→TTS
- Async httpx with persistent connection pool — eliminates per-request TCP overhead
- Per-stage latency logging (STT / Agent / TTS timings in logs)
- Multi-turn conversation memory per call (30-min session TTL)
- CPU & GPU auto-detection
- Supervisor-based service orchestration
- Call simulator for local testing
- Streaming LLM — first audio token target ~300ms
- Barge-in — caller interrupts TTS mid-sentence
- Docker-based deployment
- Streaming STT for lower latency
- Multi-language support
- Call analytics dashboard
- Multi-agent orchestration (Supervisor + Agents)
- CRM integration (HubSpot / Salesforce)
- Voice biometrics / speaker identification
- SaaS deployment (multi-tenant AI voice platform)
Replace legacy IVR systems with an AI agent that understands natural language — no more "press 1 for billing". Handles inbound support calls, account queries, and call routing without a human agent in the loop.
Automate outbound calls for appointment reminders, payment follow-ups, and customer surveys. The agent handles natural responses, objections, and can escalate to a live agent when needed.
Front desk bot for handling room bookings, check-in queries, restaurant reservations, and local recommendations — available 24/7 without staffing costs.
Automate appointment booking, rescheduling, and patient reminders. The agent can handle FAQs, collect basic intake information, and transfer complex cases to staff.
Handle balance enquiries, transaction alerts, EMI reminders, and basic account support calls — integrated with your existing telephony infrastructure via SIP.
Automatically call and qualify inbound leads, ask discovery questions, schedule site visits, and log outcomes — before a human agent ever picks up the phone.
Caller (SIP Phone / PSTN)
|
v
FreeSWITCH (SIP + RTP)
|
v
Pipeline Service :8004
|
v
+-------------+
| Flow |
| |
| 1. STT | ---> STT Service :8001 (Whisper)
| 2. Agent | ---> Agent Service :8003 (LLM - Groq/OpenAI)
| 3. TTS | ---> TTS Service :8002 (Piper)
+-------------+
|
v
Audio Response → FreeSWITCH → Caller
- Incoming SIP call hits FreeSWITCH
- FreeSWITCH records/streams caller audio
- Audio is sent to a single Agent API endpoint
- Agent Service internally processes:
- Speech-to-Text (Whisper)
- LLM response generation
- Text-to-Speech synthesis
- Final audio response is returned to FreeSWITCH
- FreeSWITCH plays the response to the caller
Instead of calling multiple services (STT → LLM → TTS) from FreeSWITCH,
the system uses a unified Agent API.
Advantages:
- Reduces network latency (only one external API call)
- Improves response time for real-time conversations
- Keeps FreeSWITCH logic simple and clean
- Enables internal optimization (GPU processing, batching, caching)
| Service | Port | Technology | Description |
|---|---|---|---|
| FreeSWITCH | 5060 (SIP), 16384–16400 (RTP) | FreeSWITCH | SIP registration, call routing, media handling, ESL |
| STT Service | 8001 | OpenAI Whisper | Real-time speech-to-text. Auto-detects CPU or GPU. |
| TTS Service | 8002 | Piper TTS | Converts LLM text response to audio |
| Agent Service | 8003 | FastAPI + LLM | Core call logic — connects STT, LLM, TTS, and FreeSWITCH |
| Pipeline Service | 8004 | FastAPI | Orchestrates STT → Agent → TTS in a single HTTP call |
| Simulator Service | — | Custom | Call simulator for local testing without a real SIP endpoint |
- Python — FastAPI, Uvicorn, faster-whisper (CPU int8), Piper TTS, httpx, scipy, soundfile, numpy
- Lua — FreeSWITCH call scripting and dialplan logic
- FreeSWITCH — SIP/RTP media server, ESL integration
- Groq / OpenAI — LLM backend (pluggable, llama-3.1-8b-instant default)
- Homer + sngrep — SIP capture and call latency tracing
- Supervisor — Process orchestration for all services
- Ubuntu 22.04 LTS
- Root access
- Public IP (recommended for SIP registration)
- Python 3.10+
| Port Range | Protocol | Service |
|---|---|---|
| 5060 | UDP/TCP | FreeSWITCH SIP (Signaling) |
| 16384–16400 | UDP | RTP (Media) |
| 8001 | TCP | STT Service |
| 8002 | TCP | TTS Service |
| 8003 | TCP | Agent Service |
| 8004 | TCP | Pipeline Service |
| 8021 | TCP | FreeSWITCH ESL (Internal only) |
cd /root
git clone https://github.com/doshiankit/ai-voice-agent.git
cd ai-voice-agent
cp .env.example .env
# Edit .env and add your API keys
chmod +x scripts/install.sh scripts/freeswitch_install.sh
./scripts/install.shCopy .env.example to .env and fill in your values:
cp .env.example .env| Variable | Required | Description |
|---|---|---|
GROQ_API_KEY |
Yes | LLM backend — get free key at console.groq.com |
GROQ_MODEL |
Yes | LLM model (default: llama-3.1-8b-instant) |
OPENAI_API_KEY |
Optional | Alternative LLM backend |
WHISPER_MODEL |
Yes | Model size: tiny / base / small / medium |
AGENT_SYSTEM_PROMPT |
Optional | Customize AI persona for your use case |
STT_URL |
Yes | STT service URL (default: http://127.0.0.1:8001) |
AGENT_URL |
Yes | Agent service URL (default: http://127.0.0.1:8003) |
TTS_URL |
Yes | TTS service URL (default: http://127.0.0.1:8002) |
VOICEBOT_PIPELINE_URL |
Yes | Pipeline endpoint called by FreeSWITCH |
VOICEBOT_RECORD_MAX_SECS |
Yes | Max recording seconds per turn (default: 6) |
VOICEBOT_RECORD_SIL_MS |
Yes | Silence threshold to stop recording (default: 500) |
VOICEBOT_MAX_TURNS |
Yes | Max conversation turns per call (default: 8) |
VOICEBOT_HELLO_TEXT |
Optional | Greeting message spoken to caller |
VOICEBOT_BYE_TEXT |
Optional | Goodbye message spoken to caller |
The install.sh script fully automates setup:
- Installs system packages (build-essential, Python3, pip, ffmpeg, etc.)
- Installs FreeSWITCH with required modules
- Creates isolated Python virtual environments per service
- Auto-detects CPU or GPU — installs appropriate PyTorch version
- Pins NumPy to
1.26.4for STT service (torch 2.2.1 + faster-whisper compatibility) - Configures Supervisor to manage all services
- Starts all services automatically on completion
# Check all services are running
supervisorctl status
# Expected output:
# agent_service RUNNING
# stt_service RUNNING
# tts_service RUNNING
# simulator_service RUNNING
# pipeline_service RUNNING# Restart all services
supervisorctl restart all
# Stop all services
supervisorctl stop all
# Restart a single service
supervisorctl restart agent_service
# View live logs
tail -f /var/log/supervisor/agent_service.logWhen deploying on Vast.ai GPU instances, open the following ports:
5060 UDP/TCP
16384-16400 UDP
8001-8004 TCP
ai-voice-agent/
├── services/
│ ├── stt_service/ # Whisper speech-to-text
│ ├── tts_service/ # Piper text-to-speech
│ ├── agent_service/ # LLM call logic
│ ├── pipeline_service/ # Orchestrates STT → Agent → TTS
│ └── simulator_service/ # Call simulator for testing
│
├── freeswitch/ # FreeSWITCH dialplan + config + Lua scripts
│
├── scripts/
│ ├── install.sh # Main installer
│ ├── freeswitch_install.sh
│ ├── start_all.sh
│ └── start_supervisor.sh
│
├── config/ # Service configuration files
├── supervisor/ # Supervisor process configs
│
├── docker-compose.yml # Docker orchestration
├── requirements.txt # Base Python dependencies
├── .env.example # Environment variable template
└── test_config.py # Installation verification script
Key packages pinned in requirements.txt:
| Package | Version | Purpose |
|---|---|---|
| fastapi | 0.104.1 | Service API framework |
| uvicorn | 0.24.0 | ASGI server |
| faster-whisper | 1.0.3 | CPU-optimised speech-to-text (int8) |
| scipy | 1.15.3 | Audio resampling (in-process, no ffmpeg) |
| soundfile | 0.13.1 | WAV file I/O |
| numpy | 1.26.4 | Numerical computing (STT — torch 2.2.1 compatible) |
| httpx | 0.28.1 | Async HTTP client with connection pooling |
| pydantic | 2.5.0 | Data validation |
| tiktoken | 0.12.0 | Token counting |
Note: Virtual environments are not committed to git. They are created by
install.shper service.
# Build and start all services
docker-compose up --build
# Run in background
docker-compose up -d
# View logs
docker-compose logs -f agent_service# Verify all services are running
supervisorctl status
# Test STT service directly
curl -X POST http://localhost:8001/transcribe \
-F "file=@test_audio.wav"
# Test TTS service directly
curl -G "http://localhost:8002/synthesize" \
--data-urlencode "text=Hello, how can I help you today?" \
--data-urlencode "format=wav" \
--data-urlencode "sample_rate=8000" \
-o test_output.wav
# Test Agent service directly
curl -X POST http://localhost:8003/chat \
-H "Content-Type: application/json" \
-d '{"text": "My SIP trunk calls are dropping"}'
# Test full Pipeline (STT + Agent + TTS in one call)
curl -X POST http://localhost:8004/pipeline \
-F "audio=@test_audio.wav" \
-F "session_id=test123" \
-o pipeline_response.wav
# Check health of all services
curl http://localhost:8001/health
curl http://localhost:8002/health
curl http://localhost:8003/health
curl http://localhost:8004/health- CPU and GPU modes are handled automatically by the installer
- Designed for single-server deployment
- STT service requires NumPy
1.26.4— torch 2.2.1 is not compatible with NumPy 2.x - FreeSWITCH ESL port
8021should not be exposed publicly
Contributions are welcome!
If you have ideas to improve performance, scalability, or features, feel free to:
- Open an issue for discussion
- Submit a pull request
- Suggest new use cases or integrations
If you find this project useful or have suggestions, feel free to share feedback via issues.
- Multi-agent orchestration (Supervisor + Agents)
- Emotion-aware voice responses
- Advanced real-time call analytics
- Multi-tenant SaaS deployment
- Voice personalization and speaker recognition
If you found this project helpful, consider giving it a star ⭐ — it helps others discover it.
Ankit Doshi — 13 years VoIP/Telecom engineering
FreeSWITCH | SIP | AI Voice | PHP | Python | Lua