Skip to content

[Performance] Improve MiMo-Audio tokenizer decoding performance#2183

Open
qibaoyuan wants to merge 47 commits intovllm-project:mainfrom
qibaoyuan:tok_cg
Open

[Performance] Improve MiMo-Audio tokenizer decoding performance#2183
qibaoyuan wants to merge 47 commits intovllm-project:mainfrom
qibaoyuan:tok_cg

Conversation

@qibaoyuan
Copy link
Contributor

@qibaoyuan qibaoyuan commented Mar 25, 2026

Purpose

To improve the decoding capability of the audio tokenizer in the MiMo-Audio model, we focus on optimizing its efficiency, as it is frequently invoked in asynchronous scenarios. Improving its performance is therefore critical. Our approach leverages CUDA Graphs to accelerate execution.

Key changes include:

  • Attention.forward_fixed — Replaces flash_attn_varlen_func with F.scaled_dot_product_attention, operating on 3D tensors [B, L, D], thereby avoiding variable-length packing.
  • TransformerLayer.forward_fixed — Combines self_attn.forward_fixed with the feed-forward network (FFN).
  • CausalConvTranspose1d.forward_fixed — Applies transposed convolution directly on 3D tensors without using masked_select.
  • TransformerVocos.forward_fixed — Implements a mask-free forward path for the vocoder.
  • AudioDecoder.forward_fixed — Constructs the full decoder pipeline: dconv1 → transformer layers → dconv2 → vocoder.
  • MiMoAudioTokenizer.decode_fixed — Wraps the complete decoding process, including decode_vq, padding, and decoder.forward_fixed.

Test Plan

export MIMO_AUDIO_TOKENIZER_PATH="/to/path/MiMo-Audio-Tokenizer"

cd ./examples/offline_inference/mimo_audio/
python3 -u audio_tokenizer_example.py  --tokenizer-path ${MIMO_AUDIO_TOKENIZER_PATH} \
--audio-path ./examples/offline_inference/mimo_audio/beijing.mp3

Test Result

Non-streaming decoding

Baseline: 0.056s
With CUDA Graph: 0.006s (~9.3× speedup)

Streaming decoding

Baseline: 0.036s
With CUDA Graph: 0.019s (~1.9× speedup)

Details of the output:

[2/5] Loading audio: ./examples/offline_inference/mimo_audio/beijing.mp3
      Duration: 0.81s, samples: 19476

[3/5] Encoding audio → discrete codes ...
      codes shape: torch.Size([20, 21])  (num_quantizers × tokens)
      output_length: tensor([21], device='cuda:0')
      Compression ratio: 927.4x
      Encode time: 0.191s

[4/5] Decoding codes → waveform (non-streaming)...
      Reconstructed length: 20160 samples
      Decode time: **0.056s**
      Saved: reconstructed.wav

[5/7] CUDA-graph decode codes → waveform (non-streaming)...
padded_size in decode 24 actual_code_shape 21
using graph
      CUDA-graph reconstructed length: 20160 samples
      CUDA-graph decode time: **0.006s**
      CUDA-graph saved: reconstructed.wav.cg.wav

[6/7] Streaming decode (chunk_size=50)...
  chunk 1/1: generated 20160 samples
      Streaming total length: 20160 samples
      Streaming decode time: **0.036s**
      Saved: reconstructed_streaming.wav

[7/7] CUDA-graph streaming decode (chunk_size=50)...
  chunk 1/1: generated 20160 samples
      CUDA-graph streaming total length: 20160 samples
      CUDA-graph streaming decode time: **0.019s**
      CUDA-graph saved: reconstructed_streaming.wav.cg.wav

Essential Elements of an Effective PR Description Checklist
  • [ x] The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • [ x] The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
  • [x ] The test results. Please paste the results comparison before and after, or the e2e results.
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
  • (Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

qibaoyuan and others added 30 commits March 6, 2026 15:30
Signed-off-by: 齐保元 <qibaoyuan@xiaomi.com>
Signed-off-by: 齐保元 <qibaoyuan@xiaomi.com>
Signed-off-by: 齐保元 <qibaoyuan@xiaomi.com>
# Conflicts:
#	vllm_omni/model_executor/models/mimo_audio/mimo_audio_code2wav.py
Signed-off-by: 齐保元 <qibaoyuan@xiaomi.com>
Signed-off-by: 齐保元 <qibaoyuan@xiaomi.com>
Signed-off-by: 齐保元 <qibaoyuan@xiaomi.com>
Signed-off-by: 齐保元 <qibaoyuan@xiaomi.com>
Signed-off-by: 齐保元 <qibaoyuan@xiaomi.com>
Signed-off-by: 齐保元 <qibaoyuan@xiaomi.com>
Signed-off-by: 齐保元 <qibaoyuan@xiaomi.com>
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 47a7bfe901

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant