minimal audio generation with codec + transformer. basically a tiny version of what AudioLM/MusicGen do.
the idea is:
- train a codec to compress audio into discrete tokens (like how JPEG compresses images, but learnable and for audio)
- train a transformer to predict the next token
- generate new audio by sampling from the transformer and decoding
the codec uses residual vector quantization (RVQ) - 8 codebooks with 1024 codes each. at 16khz with 320x downsampling you get ~400 tokens per second of audio.
transformer.py- the transformer, has RoPE and GQAcodec.py- encoder/decoder + RVQ, based on encodec/soundstreamaudio_data.py- dataloaders for librispeechtrain_codec.py- trains the codectrain_audio_lm.py- tokenizes audio then trains transformer on the tokensinference.py- generates audiomain.py- sanity check that the transformer works (just memorizes a sequence)
pip install torch torchaudio
# train codec (will download librispeech, it's ~6gb)
python train_codec.py --epochs 100
# tokenize the dataset
python train_audio_lm.py --mode tokenize --codec-path ./checkpoints/codec_epoch_100.pt
# train the LM
python train_audio_lm.py --mode train
# generate
python inference.py --mode generate --codec-path ... --transformer-path ...you probably want a gpu for this. cpu is painfully slow.