-
Notifications
You must be signed in to change notification settings - Fork 3k
Description
Describe the bug
π‘ Before filing, please check common issues:
https://block.github.io/goose/docs/troubleshooting
π¦ To help us debug faster, attach your diagnostics zip if possible.
π How to capture it: https://block.github.io/goose/docs/troubleshooting/diagnostics-and-reporting/
The speechinterface MCP extension generates phantom transcriptions from silence and background noise. Instead of returning empty results when no speech is present, it hallucinates text in multiple different languages. The system treats silence/ambient noise as if it were actual speech input, creating random transcriptions that were never spoken.
To Reproduce
Steps to reproduce the behavior:
- Enable the 'speechinterface' extension in Goose Desktop settings
- Activate voice input/microphone feature
- Remain silent or allow only background/ambient noise to be captured
- Observe that the system generates transcribed text in various languages despite no actual speech
Expected behavior
When there is only silence or background noise, the transcription should be empty or return no result. The system should distinguish between actual speech and silence/noise, and only generate transcriptions when real speech is detected.
Screenshots
If applicable, add screenshots to help explain your problem.
Please provide the following information
- OS & Arch: [macOS (Apple Silicon)]
- Interface: [UI (Goose Desktop)]
- Version: [1.21.1]
- Extensions enabled: [speechinterface, developer, computercontroller, memory, databricks, websearch]
- Provider & Model: [Databricks β goose-claude-4-5-sonnet]
Additional context
This appears to be related to the underlying speech-to-text model (likely OpenAI Whisper) generating transcriptions even from non-speech audio. The issue is well-documented with Whisper models that will "hallucinate" text from silence. The solution typically involves implementing Voice Activity Detection (VAD) before sending audio to the transcription model, so that only audio segments containing actual speech are processed. Without VAD, the model attempts to transcribe silence and produces random text, often in multiple languages.