Skip to content

feat: text-to-music mode (generate from prompt alone, no input audio)#255

Draft
leszko wants to merge 1 commit into
mainfrom
rafal/feat/text2music
Draft

feat: text-to-music mode (generate from prompt alone, no input audio)#255
leszko wants to merge 1 commit into
mainfrom
rafal/feat/text2music

Conversation

@leszko

@leszko leszko commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator

Summary

Adds a text-to-music mode: generate music in realtime from the text prompt alone, with no input audio — selectable at session start and switchable mid-session, in both the backend and the web app.

ACE-Step's trained "no reference audio" signal is the checkpoint's canonical silence latent (the model forward uses it to simulate text2music mode), so a text-only session is a normal streaming session whose source latent and structure context are the silence latent, diffusing from pure noise at denoise=1.0. No model or pipeline changes — only session construction and wiring.

Changes

Contract (registry-first):

  • SessionConfig gains text2music: bool and text2music_duration_s: float = 60.0 — auto-projected into GET /api/protocol and the generated TS types.
  • swap_source command gains a text2music field (no binary PCM frame, mirrors use_server_source).
  • wireContract.gen.ts regenerated.

Engine (acestep/streaming):

  • source.py: text2music_waveform() — silent placeholder that seeds the playback ring (the user hears generated slices stream in over silence) — and resolve_text2music_source() — canonical-silence PreparedSource via EmptyLatent plus fixed 120 BPM / C major / 4 conditioning defaults. Skips VAE encode, semantic extract, librosa beat-tracking (returns 0 BPM on silence, which would poison the text conditioning), CNN key detection, and stem extraction.
  • session.py: create() and the swap path branch on the flag; swap_source() accepts text2music.

Transport + SDK:

  • ws_adapter.py synthesizes the silent source server-side for both the init handshake and t2m swaps; no audio frame crosses the wire.
  • web/sdk/protocol.ts: connect() skips the PCM frame when config.text2music is set; new sendSwapTextToMusic().

Web app:

  • New client-side sentinel source (web/lib/text2music.ts) so every picker surface works unchanged: pinned TEXT TO MUSIC sleeve in the crate fan, pinned entry in the CORE-tab track picker, option in the lite select.
  • useStartSession / useFixtureSwap translate the sentinel into the wire flag; text sessions bypass the hear-the-source-first denoise gate (the source is silence) and snap denoise to 1.0.

Testing

  • tests/unit (149 passed; contract drift guards included), npm run typecheck, npm run build.
  • Headless WS session against a live pod: silent initial buffer in → non-silent generated slices out (rms ~0.27), live prompt re-encode applied mid-stream.
  • Browser: start session → swap to Text to music (denoise snaps to 1, music fills the silent buffer) → live prompt change via Tags panel → swap back to a fixture. Reconnect path re-derives the t2m config from the store sentinel.

🤖 Generated with Claude Code

The model already treats the checkpoint's canonical silence latent as
"no reference audio" (its trained text2music conditioning), so a
text-only session is a normal streaming session whose source latent and
structure context are both the silence latent, diffusing from pure
noise at denoise=1.0. No model or pipeline changes — only session
construction and wiring:

- SessionConfig: text2music + text2music_duration_s fields (projected
  into /api/protocol and the generated TS types); swap_source command
  gains a text2music flag for mid-session switches, no binary frame.
- streaming/source.py: text2music_waveform() (silent placeholder that
  seeds the playback ring) and resolve_text2music_source() (canonical
  silence PreparedSource + fixed 120 BPM / C major / 4 defaults —
  librosa beat-tracking on silence returns 0 BPM and would poison the
  text conditioning, so detection is skipped, as is stem extraction).
- ws_adapter: synthesizes the silent source server-side; no PCM upload
  on the wire in either the init handshake or the swap path.
- SDK: connect() skips the binary frame when config.text2music is set
  (mirrors use_server_fixture); new sendSwapTextToMusic().
- Web app: "Text to music" appears as a pinned source in the crate fan,
  the CORE-tab track picker, and the lite select, via a client-side
  sentinel source name. Text sessions bypass the hear-the-source-first
  denoise gate (the source is silence) and snap denoise to 1.0.

Verified end-to-end on GPU: headless WS session (silent initial buffer
in, non-silent slices out, live prompt re-encode applied) and in the
browser (swap to text mode, prompt change, swap back to a fixture).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@leszko leszko marked this pull request as draft June 12, 2026 14:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant