Skip to content

feat: add barge-in interrupt detection#268

Closed
MatiasComercio wants to merge 1 commit intombailey:masterfrom
MatiasComercio:feat/barge-in-vad-fixes
Closed

feat: add barge-in interrupt detection#268
MatiasComercio wants to merge 1 commit intombailey:masterfrom
MatiasComercio:feat/barge-in-vad-fixes

Conversation

@MatiasComercio
Copy link
Copy Markdown

Summary

This PR adds Barge-in (Interrupt Detection): Users can interrupt TTS playback by speaking, with echo suppression and ring buffer to capture pre-trigger audio

These changes enable natural conversation flow where users can interrupt the AI's response without waiting, and recordings stop automatically after silence.

Motivation

  • No capability to interrupt TTS playback (must wait for full response)

Changes

Core Functionality Added

1. Barge-in System

  • DuplexBargeInPlayer class for full-duplex audio monitoring
  • Monitors microphone during TTS playback
  • Echo suppression (input must be 30% louder than output)
  • Ring buffer captures 500ms pre-trigger audio to prevent lost syllables
  • Configurable energy threshold (default: 300)
  • Minimum speech duration (default: 20ms)

2. Streaming Layer Updates

  • Added skip_playback parameter throughout streaming stack
  • Allows TTS generation without playback (for interrupted audio)
  • Preserves audio files for potential replay

Files Modified

File                          Added  Removed  Total
voice_mode/config.py              6        2      8
voice_mode/core.py                8        3     11
voice_mode/streaming.py          46       33     79
voice_mode/tools/converse.py    284       32    316
────────────────────────────────────────────────
Total:                          344       70    414

voice_mode/config.py (8 lines)

  • Added BARGE_IN_ENERGY_THRESHOLD = 300
  • Added BARGE_IN_MIN_SPEECH_MS = 20

voice_mode/streaming.py (79 lines)

  • Added skip_playback parameter to stream_pcm_audio()
  • Added skip_playback parameter to stream_tts_audio()
  • Added skip_playback parameter to stream_with_buffering()
  • Guarded all stream operations with skip_playback checks
  • Store audio path in metrics for barge-in replay

voice_mode/core.py (11 lines)

  • Added skip_playback parameter to text_to_speech()
  • Updated docstring
  • Pass parameter through to streaming layer

voice_mode/tools/converse.py (316 lines)

  • Added DuplexBargeInPlayer class (170+ lines of barge-in logic)
  • Added barge_in parameter to converse() function
  • Modified text_to_speech_with_failover() with force_save_audio and skip_playback
  • Integrated barge-in detection flow
  • Added audio concatenation (prepending barge-in audio to VAD recording)
  • Added conditional finished chime (skip if interrupted)

API Changes

New backward-compatible parameter in converse():

await converse(
    message="Hello, how can I help?",
    barge_in=True  # Enable interrupt detection
)

Default: barge_in=False (preserves existing behavior)

Environment Variables

Optional configuration via environment:

VOICEMODE_BARGE_IN_THRESHOLD=300      # Energy threshold for detection
VOICEMODE_BARGE_IN_MIN_SPEECH=20     # Minimum speech duration (ms)

Testing

Manual Testing Performed

  1. Basic VAD (Silence Detection)

    • Recording now stops automatically ~1s after user stops speaking
    • No manual interruption needed
    • Works with default settings
  2. Barge-in (TTS Interruption)

    • TTS stops within 300-600ms of user speaking
    • Ring buffer captures full utterance (no lost syllables)
    • Echo suppression prevents false triggers
  3. Backward Compatibility

    • Existing code works unchanged (barge_in defaults to False)
    • All new parameters have sensible defaults

Test Metrics

  • TTS Play Latency: 0.3-0.6s
  • Record Duration: 2-4s with automatic silence stop
  • Barge-in Latency: 300-600ms (threshold dependent)
  • Ring Buffer: Captures 500ms pre-trigger audio

Test Commands

# Enable debug logging
export VOICEMODE_DEBUG=true

# Test basic conversation (with barge-in enabled)
uv run voicemode converse

# Check audio debug files
ls ~/.voicemode/audio/

Automated Testing

All existing tests pass:

pytest
pytest --cov=voice_mode

Note: new tests for DuplexBargeInPlayer barge-in detection logic cover:

  • Energy threshold detection with synthetic data
  • Ring buffer size management (deque-based)
  • Echo suppression logic
  • Thread-safety of shared state
  • skip_playback parameter propagation through streaming functions

Voicemode Version

Patches were developed and tested against voicemode 8.1.0.

Checklist

  • Code follows project style conventions
  • Type hints added where appropriate
  • Functions documented with docstrings
  • Backward compatibility maintained (all new parameters optional)
  • Existing tests pass (pytest)
  • Coverage checks pass (pytest --cov=voice_mode)
  • Manual audio testing completed
  • Debug logging added for troubleshooting
  • Clear commit message following conventions
  • Changes organized into single logical commit

Dependencies

No new direct dependencies added to project requirements.

Platform Compatibility

Tested on macOS with standard audio hardware. Linux/Windows compatibility should be maintained as all audio operations use existing VoiceMode abstractions.

- Implement DuplexBargeInPlayer for real-time interrupt detection
- Add barge-in configuration constants and skip_playback parameter
- Add 23 unit tests covering energy detection, echo suppression, thread-safety
- Fix VAD silence detection with wall-clock timeout and reduced post-barge-in min_duration
- Update dependency to webrtcvad-wheels for easier installation

Applies to voicemode v8.1.0
@MatiasComercio
Copy link
Copy Markdown
Author

Just checking the opened PRs, the #238 is exactly the same. Closing and in case anything from here is useful ping me or use it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant