Spoken Dialogue Models Survey

A survey of spoken dialogue models (SDMs) with speech input and speech output.

The detailed description and the table can be found in the TiCo paper.

Intermediate Representation (IR)

Since the textless NLP paradigm (e.g., GSLM, dGSLM), it has been observed that directly modeling raw speech signals—even with phonetic or discrete speech tokens—remains challenging. Modern SDMs usually introduce an intermediate representation (IR) for semantic planning, most commonly in the form of text. In addition to the direct guidance of the spoken content, Text IR can be extended with:

(+R.) Reasoning (e.g., chain-of-thought, explicit thinking mode)
(+S.) Style guidance for the spoken response
(+Tool) Tool calling signals
(+Time) Spoken Time Tokens for time-controlled speech generation

Generation Pattern

The IR and speech tokens can be generated in different patterns, each with trade-offs in latency and speech–text conditioning:

Sequential — Text first, then speech tokens. Chunking can potentially enable streaming generation.
Parallel — Text and speech tokens generated simultaneously from shared representations. Delay patterns can improve quality.
Interleaved — Text and speech tokens produced in a single mixed sequence, allowing direct conditioning of speech on text.

Speech Representation

SDMs generate speech tokens that are decoded into waveforms by a vocoder or speech codec decoder:

Phonetic tokens — Quantized features from foundation speech encoders (e.g., HuBERT, Whisper). Capture linguistic content but lack speaker/style information; usually require a vocoder (e.g. Hifi-GAN) to condition on the speaker, style embedding for synthesis. Also called semantic tokens.
Acoustic tokens — Derived from neural speech codecs via residual vector quantization (RVQ). Can be directly decoded into waveforms. Recent work distills phonetic/semantic information into early layers to preserve linguistic structure.

Models

Model	Date	IR	Speech Rep.	Pattern	Link	Notes
TiCo (Qwen)	2026-03	Text (+Time)	Acoustic token	Sequential	2603.22267	Spoken Time Tokens enables time-controllable speech generation in SDMs
MiniCPM-o 4.5	2026-02	Text (+R.)	Acoustic token	Interleaved	MiniCPM-o	TDM full-duplex model. Supports visual modality. No directly verified model-specific arXiv page found.
PersonaPlex	2026-01	Text	Acoustic token	Parallel	2602.06053	Follows the Moshi architecture. Dual-channel full-duplex model.
Mimo-Audio	2025-12	Text (+R.)	Acoustic token	Interleaved	2512.23808	Interleaves text tokens and “audio patches”, including a delay pattern. (arXiv title uses “MiMo-Audio”.)
LFM2-Audio	2025-11	Text	Acoustic token	Interleaved	2511.23404	Supports both interleaved and sequential patterns, adapting to different tasks. Covered in the broader LFM2 Technical Report.
Stream RAG	2025-10	Text (+Tool)	Acoustic token	Sequential	2510.02044	Enables the SDM to trigger tool queries in parallel with the user’s speech. (arXiv title uses “Stream RAG”.)
SCoT	2025-10	Text	Acoustic token	Interleaved	2510.02066	CoT framework for SDMs. Blockwise streaming full-duplex model.
Moshi-CoT	2025-10	Text (+R.)	Acoustic token	Parallel	2510.07497	CoT-tuned Moshi performs text reasoning in the “text monologue” stream to enable the “thinking while listening” paradigm.
Qwen3-Omni	2025-09	Text (+R.)	Acoustic token	Sequential	2509.17765	Thinker-Talker architecture. Supports explicit thinking mode, tool calling, and visual modality.
STITCH	2025-07	Text (+R.)	Phonetic token	Interleaved	2507.15375	Backbone: GLM-4-Voice. The paper discusses multiple interleaving patterns among text, reasoning, and speech.
Step-Audio 2	2025-07	Text (+R., Tool)	Phonetic token	Interleaved	2507.16632	Uses multimodal RAG to support grounded responses and timbre/style control.
LLaMA-Omni 2	2025-05	Text	Phonetic token	Parallel	2505.02625	Gate fusion of LLM hidden states and text tokens for improved speech quality. (arXiv title uses “LLaMA-Omni2”.)
Kimi-Audio	2025-04	Text	Phonetic token	Parallel	2504.18425	Shared LLM with a text head and an audio head.
Qwen2.5-Omni	2025-03	Text	Acoustic token	Sequential	2503.20215	Thinker-Talker architecture. Supports visual modality.
Baichuan-Audio	2025-02	Text	Acoustic token	Interleaved	2502.17239	Text-guided speech generation with an independent audio head.
VITA-1.5	2025-01	Text	Acoustic token	Sequential	2501.01957	NAR + AR speech decoder taking the LLM embedding as input.
SLAM-Omni	2024-12	Text	Phonetic token	Parallel	2412.15649	“Semantic group modeling” enables generation of multiple phonetic tokens per text token.
GLM-4-Voice	2024-12	Text	Phonetic token	Interleaved	2412.02612	Single speech codebook paired with a flow-matching speech decoder.
Freeze-Omni	2024-11	Text	Acoustic token	Sequential	2411.00774	TDM-based full-duplex interaction; speech decoder conditioned on text tokens and LLM hidden states.
Mini-Omni2	2024-10	Text	Acoustic token	Parallel	2410.11190	Parallel decoding with a delay pattern.
SyncLLM	2024-09	—	Phonetic token	Direct	2409.15594	Interleaves user speech and model speech for full-duplex dialogue. (arXiv title uses “Synchronous LLMs as Full-Duplex Dialogue Agents”.)
LLaMA-Omni	2024-09	Text	Phonetic token	Parallel	2409.06666	CTC speech decoder maps LLM response states to phonetic tokens for streaming speech synthesis.
Moshi	2024-07	Text	Acoustic token	Parallel	2410.00037	Dual-channel full-duplex model. Parallel decoding of text and acoustic tokens with a delay pattern.
SPIRIT-LM	2024-02	Text (+S.)	Phonetic token	Interleaved	2402.05755	Interleaves text and speech in one stream; the expressive version adds pitch and style tokens. (arXiv title uses “Spirit LM”.)
SpeechGPT	2023-05	Text (+R.)	Phonetic token	Sequential	2305.11000	Uses “Chain-of-Modality”. Expands the LLM vocabulary with phonetic tokens.
dGSLM	2022-03	—	Phonetic token	Direct	2203.16502	“Dual-tower” architecture for dual-channel full-duplex modeling. Direct modeling of two-channel phonetic tokens.

Citation

This survey originated from our paper, TiCo: Time-Controllable Training for Spoken Dialogue Models, which introduces Spoken Time Tokens into the intermediate representation of SDMs to enable explicit time controllability.

If you find this repo useful, please cite:

@article{chang2026tico,
      title={TiCo: Time-Controllable Training for Spoken Dialogue Models},
      author={Kai-Wei Chang and Wei-Chih Chen and En-Pei Hu and Hung-yi Lee and James Glass},
      journal={arXiv preprint arXiv:2603.22267},
      year={2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
assets		assets
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spoken Dialogue Models Survey

Intermediate Representation (IR)

Generation Pattern

Speech Representation

Models

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Spoken Dialogue Models Survey

Intermediate Representation (IR)

Generation Pattern

Speech Representation

Models

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages