feat: text-to-music mode (generate from prompt alone, no input audio) by leszko · Pull Request #255 · daydreamlive/DEMON

leszko · 2026-06-12T14:40:50Z

Summary

Adds a text-to-music mode: generate music in realtime from the text prompt alone, with no input audio — selectable at session start and switchable mid-session, in both the backend and the web app.

ACE-Step's trained "no reference audio" signal is the checkpoint's canonical silence latent (the model forward uses it to simulate text2music mode), so a text-only session is a normal streaming session whose source latent and structure context are the silence latent, diffusing from pure noise at denoise=1.0. No model or pipeline changes — only session construction and wiring.

Changes

Contract (registry-first):

SessionConfig gains text2music: bool and text2music_duration_s: float = 60.0 — auto-projected into GET /api/protocol and the generated TS types.
swap_source command gains a text2music field (no binary PCM frame, mirrors use_server_source).
wireContract.gen.ts regenerated.

Engine (acestep/streaming):

source.py: text2music_waveform() — silent placeholder that seeds the playback ring (the user hears generated slices stream in over silence) — and resolve_text2music_source() — canonical-silence PreparedSource via EmptyLatent plus fixed 120 BPM / C major / 4 conditioning defaults. Skips VAE encode, semantic extract, librosa beat-tracking (returns 0 BPM on silence, which would poison the text conditioning), CNN key detection, and stem extraction.
session.py: create() and the swap path branch on the flag; swap_source() accepts text2music.

Transport + SDK:

ws_adapter.py synthesizes the silent source server-side for both the init handshake and t2m swaps; no audio frame crosses the wire.
web/sdk/protocol.ts: connect() skips the PCM frame when config.text2music is set; new sendSwapTextToMusic().

Web app:

New client-side sentinel source (web/lib/text2music.ts) so every picker surface works unchanged: pinned TEXT TO MUSIC sleeve in the crate fan, pinned entry in the CORE-tab track picker, option in the lite select.
useStartSession / useFixtureSwap translate the sentinel into the wire flag; text sessions bypass the hear-the-source-first denoise gate (the source is silence) and snap denoise to 1.0.

Testing

tests/unit (149 passed; contract drift guards included), npm run typecheck, npm run build.
Headless WS session against a live pod: silent initial buffer in → non-silent generated slices out (rms ~0.27), live prompt re-encode applied mid-stream.
Browser: start session → swap to Text to music (denoise snaps to 1, music fills the silent buffer) → live prompt change via Tags panel → swap back to a fixture. Reconnect path re-derives the t2m config from the store sentinel.

🤖 Generated with Claude Code

The model already treats the checkpoint's canonical silence latent as "no reference audio" (its trained text2music conditioning), so a text-only session is a normal streaming session whose source latent and structure context are both the silence latent, diffusing from pure noise at denoise=1.0. No model or pipeline changes — only session construction and wiring: - SessionConfig: text2music + text2music_duration_s fields (projected into /api/protocol and the generated TS types); swap_source command gains a text2music flag for mid-session switches, no binary frame. - streaming/source.py: text2music_waveform() (silent placeholder that seeds the playback ring) and resolve_text2music_source() (canonical silence PreparedSource + fixed 120 BPM / C major / 4 defaults — librosa beat-tracking on silence returns 0 BPM and would poison the text conditioning, so detection is skipped, as is stem extraction). - ws_adapter: synthesizes the silent source server-side; no PCM upload on the wire in either the init handshake or the swap path. - SDK: connect() skips the binary frame when config.text2music is set (mirrors use_server_fixture); new sendSwapTextToMusic(). - Web app: "Text to music" appears as a pinned source in the crate fan, the CORE-tab track picker, and the lite select, via a client-side sentinel source name. Text sessions bypass the hear-the-source-first denoise gate (the source is silence) and snap denoise to 1.0. Verified end-to-end on GPU: headless WS session (silent initial buffer in, non-silent slices out, live prompt re-encode applied) and in the browser (swap to text mode, prompt change, swap back to a fixture). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

leszko marked this pull request as draft June 12, 2026 14:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: text-to-music mode (generate from prompt alone, no input audio)#255

feat: text-to-music mode (generate from prompt alone, no input audio)#255
leszko wants to merge 1 commit into
mainfrom
rafal/feat/text2music

leszko commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

leszko commented Jun 12, 2026

Summary

Changes

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant