Skip to content

voice agent that uses OpenAI realtime API speech-to-text, generates responses with GPT, and streams natural-sounding replies using text-to-speech — all over WebSocket.

Notifications You must be signed in to change notification settings

liviaerxin/voice-agent

Repository files navigation

Voice Agent Prototype

Introduction

This Prototype project implements a realtime Voice Agent using the OpenAI ecosystem, built around RealTime/Speech-to-Text, LLM, and Text-to-Speech.

There are two main architectures for implementing a voice agent:

1. Speech-to-Speech

  • Low latency, Near real-time
  • Less control and predictability
  • APIs used:
    • Realtime API

2. RealTime/Speech-to-Text → LLM → Text-to-Speech (current approach)

API Streaming Capabilities

API Input Output Streaming Support
Realtime API Audio (stream) Audio (stream) ✅ Full duplex
Transcription Audio (full) Text (stream)
Chat Completion Text (full) Text (stream)
Speech Text (full) Audio (stream)

Note: The Realtime API offers true full-duplex streaming, enabling lowest latency. Our current implementation uses the STT → LLM → TTS flow, which introduces latency due to sequential processing.

Latency Optimizations

To reduce latency in the current pipeline:

  1. Realtime API/Transcription - transcribe the ongoing audio input, as browser microphone streams audio -> our server streams audio -> OpenAI Realtime Transcription -> text.
  2. In-memory audio buffer – avoids file I/O delays
  3. Streaming audio playback – streams audio from OpenAI to the browser as it's generated
  4. PCM format – raw, low-latency audio format reduces encoding/decoding overhead and get fast response from OpenAI API

Here we Realtime API/Transcription to replace Transcription API to reduce latency. See Latency Analysis for details.

Tech Stack

Component Tool / API Reason
Audio Format PCM (streaming in/out) Raw format ensures low-latency transmission
Transport WebSocket Enables async, bi-directional streaming
STT gpt-4o-mini-transcribe (OpenAI) Accurate, streaming-capable speech-to-text
LLM gpt-4.1-mini (OpenAI ChatCompletion) Low-latency, concise responses
TTS gpt-4o-mini-tts (OpenAI) Fast, natural-sounding speech with stream output
Backend Python + FastAPI + WebSocket Async-native server suitable for real-time apps
Client HTML5 + JS + Web Audio API Lightweight browser-based voice interaction

WebSocket vs WebRTC

  • Client ↔ OpenAI: WebRTC (for native client integration)
  • Client ↔ Server ↔ OpenAI: WebSocket (for server-side control and architecture flexibility)

We use WebSocket for browser compatibility and easier server orchestration.

Audio Format Considerations

Format Pros Cons
PCM Fastest response, no decoding needed Larger raw size
WAV Similar to PCM but includes headers Slightly larger
webm/Opus Smaller size Slower to decode, adds delay
  • Preferred:
    • PCM: 24kHz, 16-bit, mono, little-endian, for realtime api
    • WAV: Input file for STT
  • Refer to OpenAI Supported Formats

Streaming vs Real-time

Streaming ≠ Real-time

Capability Description
Real-time STT Audio streamed in → text streamed out as it's recognized
Real-time TTS Text streamed in → audio streamed out as it's synthesized
Chat LLM Text streamed out as response is generated (token by token)

For true real-time, the system must handle stream-in and stream-out simultaneously across STT, LLM, and TTS.

System Architecture

Client (Mic Input)
     ↓ WebSocket
Audio Stream
     ↓
Server
     ↓
Audio Stream
     ↓
[Real-time API/Transcription] GPT-4o mini Transcribe
     ↓
[Chat Completions API] GPT-4.1 mini (1-word response)
     ↓
[Speech API] GPT-4o mini tts
     ↓ WebSocket
Audio Stream
     ↓
Streaming audio playback on browser

Getting Started

Requirements

  • Python 3.10+

Install dependencies

pip install -r requirements.txt
pip install -r requirements-dev.txt

Setup .env

Place your .env file with OPENAI_API_KEY at the project root.

Run Development Server

uvicorn main:app --reload

Run Production Server

python main.py

Access Client

Open your browser at http://127.0.0.1:8000/client

Example: Test Real-time Transcription

Uses OpenAI Realtime API directly.

python -m examples.test_realtime_transcribe_client

Demo

Watch the video

Implementation Checklist

  • Audio WebSocket Server
    • Buffer audio input
    • Serve static client (HTML + JS) for demo
  • Speech-to-Text (Realtime API/STT)
    • Transcribe complete audio input
    • Transcribe ongoing audio input, with voice activity detection (VAD)
  • LLM Response (GPT-4.1 mini)
  • Text-to-Speech (TTS)
    • Send full audio response
    • Stream partial audio for long responses
  • Browser Client
    • Record mic audio and send via WebSocket (PCM16, 24kHz, mono)
    • Play back PCM audio in real-time
    • Use voice activity detection (VAD) to replace manual recording
  • [] Concurrent load testing
  • [] Decouple STT / LLM / TTS into microservices

Latency Analysis

Example 1: Short speech

2025-05-21 15:40:19.353 - INFO - HTTP Request: POST https://api.openai.com/v1/audio/speech "HTTP/1.1 200 OK"
2025-05-21 15:40:19.450 - INFO - 
[Speech detected]
2025-05-21 15:40:19.936 - INFO - 
[Speech ended]
2025-05-21 15:40:20.767 - INFO - 
[Transcription completed]
transcribed_text: How are you?
2025-05-21 15:40:20.767 - INFO - LLM start...
2025-05-21 15:40:21.265 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2025-05-21 15:40:21.323 - INFO - llm_response_text: Fine.
2025-05-21 15:40:21.323 - INFO - TTS start...
2025-05-21 15:40:21.906 - INFO - HTTP Request: POST https://api.openai.com/v1/audio/speech "HTTP/1.1 200 OK"

Audio ended → STT: ~0.8s LLM Response: ~0.5s TTS Start to First Output: ~0.6s

Example 2: Longer speech

2025-05-21 15:50:38.447 - INFO - 
[Speech detected]
2025-05-21 15:50:47.164 - INFO - 
[Speech ended]
2025-05-21 15:50:49.435 - INFO - 
[Transcription completed]
transcribed_text: Could you tell me about your feeling if the weather is bad or if it's a cloudy day, how will you feel? Will you feel good or will you feel bad?
2025-05-21 15:50:49.435 - INFO - LLM start...
2025-05-21 15:50:49.953 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2025-05-21 15:50:49.968 - INFO - llm_response_text: Neutral
2025-05-21 15:50:49.968 - INFO - TTS start...
2025-05-21 15:50:50.527 - INFO - HTTP Request: POST https://api.openai.com/v1/audio/speech "HTTP/1.1 200 OK"

Audio ended → STT: ~2.3s, shorter than using Transcription API as it takes ~4s. LLM Response: ~0.6s TTS Start to First Output: ~0.6s

So, The STT process is the primary latency bottleneck, especially for longer speech inputs.

Further Improvements

Reduce Latency

  • Host closer to OpenAI (e.g., in Azure)
  • Deploy on-prem STT/TTS models (e.g., Whisper, Coqui)
  • Use true realtime services by leverage streaming on the whole data pipeline:
    • Streaming STT input → Streaming LLM → Streaming TTS

Scalability

  • Split system into microservices for STT, LLM, and TTS

Security

  • Use SSL for secure audio input/output transport

Functionality

  • Refactor the project
  • Add test cases
  • Handle HTTP error
    • 400 for missing/invalid input.oReturn
    • HTTP 502 with meaningful errors for STT, LLM, or TTS failures
  • Handle edge cases and failures in WebSocket messages
  • Use state machine for fine-grained control over STT/LLM/TTS pipeline

Bonus

We can build a fully local desktop voice agent using OpenAI's Realtime API with full streaming support.

Conclusion

This project demonstrates how to prototype a real-time voice assistant using OpenAI's APIs with low-latency audio streaming, simplified architecture, and scalable components.

References

About

voice agent that uses OpenAI realtime API speech-to-text, generates responses with GPT, and streams natural-sounding replies using text-to-speech — all over WebSocket.

Resources

Stars

Watchers

Forks

Packages

No packages published