A complete on-device voice AI pipeline for Linux (Raspberry Pi 5, x86_64, ARM64). All inference runs locally — no cloud, no API keys.
Pipeline: Wake Word -> VAD -> STT -> LLM -> TTS
Microphone (ALSA)
│
▼
Wake Word Detection (openWakeWord / "Hey Jarvis") [optional]
│
▼
Voice Activity Detection (Silero VAD)
│ Buffers speech, detects silence timeout
▼
Speech-to-Text (Whisper Tiny EN)
│
▼
Large Language Model (Qwen2.5 0.5B Q4)
│
▼
Text-to-Speech (Piper Lessac Medium)
│
▼
Speaker (ALSA)
linux-voice-assistant/
├── src/
│ ├── main.cpp # Entry point, CLI parsing, main loop
│ ├── audio/
│ │ ├── audio_capture.h/cpp # ALSA mic input (16kHz, 16-bit PCM, mono)
│ │ └── audio_playback.h/cpp # ALSA speaker output (multi-rate)
│ ├── pipeline/
│ │ └── voice_pipeline.h/cpp # Full pipeline: VAD -> STT -> LLM -> TTS
│ └── config/
│ └── model_config.h # Model paths, IDs, availability checks
├── tests/
│ └── test_pipeline.cpp # Feed WAV file through pipeline (no mic needed)
├── scripts/
│ └── download-models.sh # Download all required models
├── CMakeLists.txt # Build configuration
├── build.sh # End-to-end build script
└── README.md
- Linux (Raspberry Pi 5, Ubuntu, Debian, etc.)
- CMake 3.16+
- C++17 compiler (g++ or clang++)
- ALSA development headers:
sudo apt install libasound2-dev
# 1. Build everything (SDK + models + assistant)
./build.sh
# 2. Run the voice assistant
./build/voice-assistant
# With wake word detection:
./build/voice-assistant --wakeword# Step 1: Download Sherpa-ONNX
cd ../../sdk/runanywhere-commons
./scripts/linux/download-sherpa-onnx.sh
# Step 2: Build runanywhere-commons
./scripts/build-linux.sh --shared
# Step 3: Download models
cd ../../Playground/linux-voice-assistant
./scripts/download-models.sh
# Step 4: Build
mkdir -p build && cd build
cmake ..
cmake --build . -j$(nproc)
# Step 5: Run
./voice-assistant| Component | Model | Size | Framework |
|---|---|---|---|
| VAD | Silero VAD | ~2 MB | ONNX |
| STT | Whisper Tiny EN | ~150 MB | ONNX (Sherpa) |
| LLM | Qwen2.5 0.5B Q4 | ~500 MB | llama.cpp |
| TTS | Piper Lessac Medium | ~65 MB | ONNX (Sherpa) |
| Wake Word | openWakeWord "Hey Jarvis" | ~20 MB | ONNX |
Download models:
# Required models (VAD, STT, LLM, TTS)
./scripts/download-models.sh
# Optional: Wake word model
./scripts/download-models.sh --wakeword
# Select a different LLM:
./scripts/download-models.sh --model qwen3-1.7b
./scripts/download-models.sh --model llama-3.2-3b
./scripts/download-models.sh --model qwen3-4b# Basic usage (always listening)
./build/voice-assistant
# With wake word ("Hey Jarvis" to activate)
./build/voice-assistant --wakeword
# Select audio devices
./build/voice-assistant --list-devices
./build/voice-assistant --input hw:1,0 --output hw:0,0
# Test pipeline with a WAV file (no microphone needed)
./build/test-pipeline path/to/audio.wav- ALSA-based microphone input
- 16kHz, 16-bit PCM, mono (optimal for STT)
- Threaded capture with callback delivery
- Device enumeration support
- ALSA-based speaker output
- Dynamic sample rate reinitialization (22050Hz TTS, 16kHz, etc.)
- Underrun recovery
- Wake Word Detection — openWakeWord ONNX with "Hey Jarvis" model
- Voice Activity Detection — Silero VAD with silence timeout (1.5s)
- Speech-to-Text — Whisper Tiny EN via
rac_voice_agent_transcribe - LLM Response — Local inference via
rac_voice_agent_process_voice_turn - Text-to-Speech — Piper neural TTS via
rac_voice_agent_synthesize_speech
- Hardcoded model IDs and paths for predictable behavior
- Model availability checking before pipeline initialization
- Base directory:
~/.local/share/runanywhere/Models/
"ALSA: Cannot open audio device"
- Check available devices:
aplay -l(output) andarecord -l(input) - Try specifying a device:
--input hw:1,0
"Models are missing"
- Run
./scripts/download-models.shto download all required models - For wake word:
./scripts/download-models.sh --wakeword
No audio output
- Check volume:
alsamixer - Verify output device:
speaker-test -D default -c 2
Slow LLM response on Raspberry Pi
- Use a smaller model:
./scripts/download-models.sh --model qwen3-0.6b - Ensure adequate cooling (throttling reduces performance)