Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 12 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,18 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

## [Unreleased]

### Added

- **Barge-In: Interrupt TTS Playback** (VM-606, GH-211)
- Users can interrupt TTS playback by speaking, enabling natural conversation flow
- `VOICEMODE_BARGE_IN=true` enables the feature (opt-in, default: false)
- `VOICEMODE_BARGE_IN_VAD` controls detection sensitivity (0-3, default: 2)
- `VOICEMODE_BARGE_IN_MIN_MS` sets minimum speech threshold (default: 150ms)
- Captured speech is passed directly to STT for seamless conversation
- Works with both buffered and streaming TTS modes
- Requires `webrtcvad` library (auto-installed with VoiceMode)
- Target latency: <100ms from voice onset to TTS stop

## [8.1.0] - 2026-02-02

### Added
Expand Down
80 changes: 80 additions & 0 deletions docs/concepts/architecture.md
Original file line number Diff line number Diff line change
Expand Up @@ -117,6 +117,86 @@ Text → TTS Service → Audio Stream → Format Conversion → Speaker
3. **Format Conversion**: FFmpeg handles formats
4. **Playback**: PyAudio for speaker output

### Barge-In (TTS Interruption)

Barge-in enables natural conversation by allowing users to interrupt TTS playback:

```
TTS Playing ──┬── BargeInMonitor ──→ Voice Detected ──→ Interrupt Player
│ │ │
│ (VAD Analysis) (Stop Playback)
│ │ │
└─────────┴──── Captured Audio ──→ STT ──→ Response
```

**Components:**

1. **BargeInMonitor** (`barge_in.py`): Monitors microphone during TTS
- Uses WebRTC VAD for speech detection
- Captures audio buffer from voice onset
- Fires interrupt callback when speech threshold met

2. **NonBlockingAudioPlayer**: Extended with interrupt support
- `interrupt()` method stops playback immediately
- `was_interrupted()` indicates barge-in occurred
- Clean resource shutdown on interrupt

3. **Conversation Flow Integration**:
- Monitor starts when TTS playback begins
- On voice detection: TTS stops, captured audio flows to STT
- Listening chime skipped (user already speaking)
- Normal conversation continues with interrupted speech

**Configuration:**
- `VOICEMODE_BARGE_IN=true` enables the feature
- `VOICEMODE_BARGE_IN_VAD` controls detection sensitivity (0-3)
- `VOICEMODE_BARGE_IN_MIN_MS` sets minimum speech duration threshold

**Performance Target:** <100ms from voice onset to TTS stop

### Barge-In Performance Characteristics

Measured performance characteristics from automated testing:

| Metric | Average | Max | Target |
|--------|---------|-----|--------|
| Interrupt callback latency | <5ms | <10ms | <50ms |
| Voice onset to TTS stop | <20ms | <50ms | <100ms |
| VAD check per chunk | <5ms | <20ms | - |
| Buffer append operation | <1ms | <10ms | - |
| Cross-thread interrupt latency | <20ms | <50ms | - |

**Latency Breakdown:**

The total latency from when the user starts speaking to when TTS stops consists of:

1. **VAD Processing** (~10-20ms): WebRTC VAD analyzes 20ms audio chunks
2. **Speech Threshold** (configurable, default 150ms): Minimum speech duration to confirm intentional interruption
3. **Callback Invocation** (<5ms): Signaling from monitor to player
4. **Player Stop** (<5ms): Stopping audio output stream

Note: The 150ms speech threshold is intentional to prevent false positives and is not considered system latency. Actual system latency (from confirmed speech detection to TTS stop) is typically under 50ms.

**CPU Overhead:**

- BargeInMonitor objects are lightweight (~1KB memory footprint)
- VAD checking runs at ~50+ checks per second without bottleneck
- Audio buffer operations are O(1) with lock protection
- Background thread has minimal impact during idle periods

**Memory Usage:**

- Audio buffer grows linearly with captured speech duration
- 5 seconds of captured audio at 24kHz, 16-bit: ~240KB
- Buffers are cleared on silence (when barge-in hasn't triggered)
- Memory is released when monitor is stopped

**Thread Safety:**

- All buffer operations protected by threading.Lock
- Events use threading.Event for signal coordination
- Callback invocation is thread-safe across monitoring and playback threads

## Service Architecture

### Service Lifecycle
Expand Down
30 changes: 30 additions & 0 deletions docs/guides/configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -248,6 +248,36 @@ VOICEMODE_EVENT_LOG=false # Log all events
VOICEMODE_CONVERSATION_LOG=false # Log conversations
```

### Barge-In (Interrupt TTS)

Barge-in allows users to interrupt TTS playback by speaking. When enabled, VoiceMode monitors the microphone during TTS and stops playback immediately when voice activity is detected, allowing natural conversational flow.

```bash
# Enable barge-in feature (default: false, opt-in)
VOICEMODE_BARGE_IN=true

# VAD aggressiveness for barge-in detection (0-3)
# 0: Very permissive - triggers easily, may have false positives
# 1: Permissive - good for quiet environments
# 2: Moderate - balanced for most environments (default)
# 3: Aggressive - only triggers on clear speech
VOICEMODE_BARGE_IN_VAD=2

# Minimum speech duration in milliseconds before triggering (default: 150)
# Higher values prevent false positives from brief sounds
VOICEMODE_BARGE_IN_MIN_MS=150
```

**Requirements:**
- Requires `webrtcvad` library (installed automatically with VoiceMode)
- Works with both buffered and streaming TTS modes
- Captured speech is passed directly to STT for seamless conversation

**Use cases:**
- Natural conversation flow without waiting for TTS to finish
- Quick corrections or interjections
- Time-sensitive interactions

### Development Settings

```bash
Expand Down
42 changes: 42 additions & 0 deletions docs/reference/converse-parameters.md
Original file line number Diff line number Diff line change
Expand Up @@ -177,6 +177,48 @@ Skip text-to-speech, show text only.
- When voice isn't needed
- Text-only mode

## Barge-In (TTS Interruption)

Barge-in allows users to interrupt TTS playback by speaking, enabling more natural conversation flow.

### Enabling Barge-In

Barge-in is controlled by environment variables, not converse parameters:

```bash
# Enable barge-in (default: false)
export VOICEMODE_BARGE_IN=true

# VAD aggressiveness (0-3, default: 2)
export VOICEMODE_BARGE_IN_VAD=2

# Minimum speech duration in ms (default: 150)
export VOICEMODE_BARGE_IN_MIN_MS=150
```

### How It Works

1. When TTS playback starts, VoiceMode monitors the microphone
2. WebRTC VAD analyzes audio for speech activity
3. When voice is detected and sustained past the threshold:
- TTS playback stops immediately
- Captured speech (from voice onset) is passed to STT
- Listening chime is skipped (user is already speaking)
- Conversation continues normally

### Requirements

- `webrtcvad` library (installed automatically)
- `wait_for_response=true` (default)
- TTS not skipped via `skip_tts`

### Tuning Tips

- **False positives** (TTS stops randomly): Increase `VOICEMODE_BARGE_IN_VAD` (try 3) or `VOICEMODE_BARGE_IN_MIN_MS` (try 200-300)
- **Slow response**: Decrease `VOICEMODE_BARGE_IN_MIN_MS` (try 100)
- **Quiet environment**: Lower VAD (try 1)
- **Noisy environment**: Higher VAD (try 3)

## Endpoint Requirements

STT/TTS services must expose OpenAI-compatible endpoints:
Expand Down
Loading
Loading