Skip to content

izaakrogan/audio-transformer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

audio transformer

minimal audio generation with codec + transformer. basically a tiny version of what AudioLM/MusicGen do.

what is this

the idea is:

  1. train a codec to compress audio into discrete tokens (like how JPEG compresses images, but learnable and for audio)
  2. train a transformer to predict the next token
  3. generate new audio by sampling from the transformer and decoding

the codec uses residual vector quantization (RVQ) - 8 codebooks with 1024 codes each. at 16khz with 320x downsampling you get ~400 tokens per second of audio.

files

  • transformer.py - the transformer, has RoPE and GQA
  • codec.py - encoder/decoder + RVQ, based on encodec/soundstream
  • audio_data.py - dataloaders for librispeech
  • train_codec.py - trains the codec
  • train_audio_lm.py - tokenizes audio then trains transformer on the tokens
  • inference.py - generates audio
  • main.py - sanity check that the transformer works (just memorizes a sequence)

usage

pip install torch torchaudio

# train codec (will download librispeech, it's ~6gb)
python train_codec.py --epochs 100

# tokenize the dataset
python train_audio_lm.py --mode tokenize --codec-path ./checkpoints/codec_epoch_100.pt

# train the LM
python train_audio_lm.py --mode train

# generate
python inference.py --mode generate --codec-path ... --transformer-path ...

you probably want a gpu for this. cpu is painfully slow.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages