speechinterface MCP generates phantom transcriptions from silence

**Describe the bug**

💡 Before filing, please check common issues:  
https://block.github.io/goose/docs/troubleshooting  

📦 To help us debug faster, attach your **diagnostics zip** if possible.  
👉 How to capture it: https://block.github.io/goose/docs/troubleshooting/diagnostics-and-reporting/

The speechinterface MCP extension generates phantom transcriptions from silence and background noise. Instead of returning empty results when no speech is present, it hallucinates text in multiple different languages. The system treats silence/ambient noise as if it were actual speech input, creating random transcriptions that were never spoken.

---

**To Reproduce**
Steps to reproduce the behavior:
1. Enable the 'speechinterface' extension in Goose Desktop settings
2. Activate voice input/microphone feature
3. Remain silent or allow only background/ambient noise to be captured
4. Observe that the system generates transcribed text in various languages despite no actual speech

---

**Expected behavior**
When there is only silence or background noise, the transcription should be empty or return no result. The system should distinguish between actual speech and silence/noise, and only generate transcriptions when real speech is detected.

---

**Screenshots**
If applicable, add screenshots to help explain your problem.

---

**Please provide the following information**
- **OS & Arch:** [macOS (Apple Silicon)]
- **Interface:** [UI (Goose Desktop)]
- **Version:** [1.21.1]
- **Extensions enabled:** [speechinterface, developer, computercontroller, memory, databricks, websearch]
- **Provider & Model:** [Databricks – goose-claude-4-5-sonnet]

---

**Additional context**
This appears to be related to the underlying speech-to-text model (likely OpenAI Whisper) generating transcriptions even from non-speech audio. The issue is well-documented with Whisper models that will "hallucinate" text from silence. The solution typically involves implementing Voice Activity Detection (VAD) before sending audio to the transcription model, so that only audio segments containing actual speech are processed. Without VAD, the model attempts to transcribe silence and produces random text, often in multiple languages.




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

speechinterface MCP generates phantom transcriptions from silence #6715

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

speechinterface MCP generates phantom transcriptions from silence #6715

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions