feat: text-to-music mode (generate from prompt alone, no input audio)#255
Draft
leszko wants to merge 1 commit into
Draft
feat: text-to-music mode (generate from prompt alone, no input audio)#255leszko wants to merge 1 commit into
leszko wants to merge 1 commit into
Conversation
The model already treats the checkpoint's canonical silence latent as "no reference audio" (its trained text2music conditioning), so a text-only session is a normal streaming session whose source latent and structure context are both the silence latent, diffusing from pure noise at denoise=1.0. No model or pipeline changes — only session construction and wiring: - SessionConfig: text2music + text2music_duration_s fields (projected into /api/protocol and the generated TS types); swap_source command gains a text2music flag for mid-session switches, no binary frame. - streaming/source.py: text2music_waveform() (silent placeholder that seeds the playback ring) and resolve_text2music_source() (canonical silence PreparedSource + fixed 120 BPM / C major / 4 defaults — librosa beat-tracking on silence returns 0 BPM and would poison the text conditioning, so detection is skipped, as is stem extraction). - ws_adapter: synthesizes the silent source server-side; no PCM upload on the wire in either the init handshake or the swap path. - SDK: connect() skips the binary frame when config.text2music is set (mirrors use_server_fixture); new sendSwapTextToMusic(). - Web app: "Text to music" appears as a pinned source in the crate fan, the CORE-tab track picker, and the lite select, via a client-side sentinel source name. Text sessions bypass the hear-the-source-first denoise gate (the source is silence) and snap denoise to 1.0. Verified end-to-end on GPU: headless WS session (silent initial buffer in, non-silent slices out, live prompt re-encode applied) and in the browser (swap to text mode, prompt change, swap back to a fixture). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a text-to-music mode: generate music in realtime from the text prompt alone, with no input audio — selectable at session start and switchable mid-session, in both the backend and the web app.
ACE-Step's trained "no reference audio" signal is the checkpoint's canonical silence latent (the model forward uses it to simulate text2music mode), so a text-only session is a normal streaming session whose source latent and structure context are the silence latent, diffusing from pure noise at
denoise=1.0. No model or pipeline changes — only session construction and wiring.Changes
Contract (registry-first):
SessionConfiggainstext2music: boolandtext2music_duration_s: float = 60.0— auto-projected intoGET /api/protocoland the generated TS types.swap_sourcecommand gains atext2musicfield (no binary PCM frame, mirrorsuse_server_source).wireContract.gen.tsregenerated.Engine (
acestep/streaming):source.py:text2music_waveform()— silent placeholder that seeds the playback ring (the user hears generated slices stream in over silence) — andresolve_text2music_source()— canonical-silencePreparedSourceviaEmptyLatentplus fixed 120 BPM / C major / 4 conditioning defaults. Skips VAE encode, semantic extract, librosa beat-tracking (returns 0 BPM on silence, which would poison the text conditioning), CNN key detection, and stem extraction.session.py:create()and the swap path branch on the flag;swap_source()acceptstext2music.Transport + SDK:
ws_adapter.pysynthesizes the silent source server-side for both the init handshake and t2m swaps; no audio frame crosses the wire.web/sdk/protocol.ts:connect()skips the PCM frame whenconfig.text2musicis set; newsendSwapTextToMusic().Web app:
web/lib/text2music.ts) so every picker surface works unchanged: pinned TEXT TO MUSIC sleeve in the crate fan, pinned entry in the CORE-tab track picker, option in the lite select.useStartSession/useFixtureSwaptranslate the sentinel into the wire flag; text sessions bypass the hear-the-source-first denoise gate (the source is silence) and snap denoise to 1.0.Testing
tests/unit(149 passed; contract drift guards included),npm run typecheck,npm run build.🤖 Generated with Claude Code