[Performance] Improve MiMo-Audio tokenizer decoding performance by qibaoyuan · Pull Request #2183 · vllm-project/vllm-omni

qibaoyuan · 2026-03-25T11:50:09Z

Purpose

To improve the decoding capability of the audio tokenizer in the MiMo-Audio model, we focus on optimizing its efficiency, as it is frequently invoked in asynchronous scenarios. Improving its performance is therefore critical. Our approach leverages CUDA Graphs to accelerate execution.

Key changes include:

Attention.forward_fixed — Replaces flash_attn_varlen_func with F.scaled_dot_product_attention, operating on 3D tensors [B, L, D], thereby avoiding variable-length packing.
TransformerLayer.forward_fixed — Combines self_attn.forward_fixed with the feed-forward network (FFN).
CausalConvTranspose1d.forward_fixed — Applies transposed convolution directly on 3D tensors without using masked_select.
TransformerVocos.forward_fixed — Implements a mask-free forward path for the vocoder.
AudioDecoder.forward_fixed — Constructs the full decoder pipeline: dconv1 → transformer layers → dconv2 → vocoder.
MiMoAudioTokenizer.decode_fixed — Wraps the complete decoding process, including decode_vq, padding, and decoder.forward_fixed.

Test Plan

export MIMO_AUDIO_TOKENIZER_PATH="/to/path/MiMo-Audio-Tokenizer"

cd ./examples/offline_inference/mimo_audio/
python3 -u audio_tokenizer_example.py  --tokenizer-path ${MIMO_AUDIO_TOKENIZER_PATH} \
--audio-path ./examples/offline_inference/mimo_audio/beijing.mp3

Test Result

Non-streaming decoding

Baseline: 0.056s
With CUDA Graph: 0.006s (~9.3× speedup)

Streaming decoding

Baseline: 0.036s
With CUDA Graph: 0.019s (~1.9× speedup)

Details of the output:

[2/5] Loading audio: ./examples/offline_inference/mimo_audio/beijing.mp3
      Duration: 0.81s, samples: 19476

[3/5] Encoding audio → discrete codes ...
      codes shape: torch.Size([20, 21])  (num_quantizers × tokens)
      output_length: tensor([21], device='cuda:0')
      Compression ratio: 927.4x
      Encode time: 0.191s

[4/5] Decoding codes → waveform (non-streaming)...
      Reconstructed length: 20160 samples
      Decode time: **0.056s**
      Saved: reconstructed.wav

[5/7] CUDA-graph decode codes → waveform (non-streaming)...
padded_size in decode 24 actual_code_shape 21
using graph
      CUDA-graph reconstructed length: 20160 samples
      CUDA-graph decode time: **0.006s**
      CUDA-graph saved: reconstructed.wav.cg.wav

[6/7] Streaming decode (chunk_size=50)...
  chunk 1/1: generated 20160 samples
      Streaming total length: 20160 samples
      Streaming decode time: **0.036s**
      Saved: reconstructed_streaming.wav

[7/7] CUDA-graph streaming decode (chunk_size=50)...
  chunk 1/1: generated 20160 samples
      CUDA-graph streaming total length: 20160 samples
      CUDA-graph streaming decode time: **0.019s**
      CUDA-graph saved: reconstructed_streaming.wav.cg.wav

Essential Elements of an Effective PR Description Checklist

[ x] The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
[ x] The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
[x ] The test results. Please paste the results comparison before and after, or the e2e results.
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
(Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

Signed-off-by: 齐保元 <qibaoyuan@xiaomi.com>

# Conflicts: # vllm_omni/model_executor/models/mimo_audio/mimo_audio_code2wav.py

Signed-off-by: 齐保元 <qibaoyuan@xiaomi.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 47a7bfe901

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

vllm_omni/model_executor/models/mimo_audio/modeling_audio_tokenizer.py

Signed-off-by: 齐保元 <qibaoyuan@xiaomi.com>

# Conflicts: # vllm_omni/model_executor/models/mimo_audio/modeling_audio_tokenizer.py

Signed-off-by: 齐保元 <qibaoyuan@xiaomi.com>

qibaoyuan and others added 30 commits March 6, 2026 15:30

[mimo-audio] tok example

0b5ed57

Signed-off-by: 齐保元 <qibaoyuan@xiaomi.com>

Merge branch 'vllm-project:main' into tok_cg

09e17eb

[mimo-audio] example

e10ea18

Signed-off-by: 齐保元 <qibaoyuan@xiaomi.com>

Merge branch 'vllm-project:main' into tok_cg

2c4e68c

[mimo-audio] example

80d4f24

Signed-off-by: 齐保元 <qibaoyuan@xiaomi.com>

Merge branch 'vllm-project:main' into tok_cg

bd4ed9d

Merge branch 'vllm-project:main' into tok_cg

b55ac59

Merge branch 'vllm-project:main' into tok_cg

8140d2d

Merge branch 'vllm-project:main' into tok_cg

8b022b3

Merge branch 'vllm-project:main' into tok_cg

4deee49

Merge branch 'vllm-project:main' into tok_cg

6124bf8

Merge branch 'vllm-project:main' into tok_cg

02156e5

Merge branch 'vllm-project:main' into tok_cg

c77f442

Merge branch 'vllm-project:main' into tok_cg

9dbb293

Merge branch 'vllm-project:main' into tok_cg

6964957

Merge branch 'vllm-project:main' into tok_cg

0109fd1

Merge branch 'vllm-project:main' into tok_cg

e3700d0

Merge branch 'vllm-project:main' into tok_cg

2439724

Merge branch 'vllm-project:main' into tok_cg

be6206f

Merge branch 'vllm-project:main' into tok_cg

1c1ff70

Merge branch 'vllm-project:main' into tok_cg

33efe81

Merge branch 'vllm-project:main' into tok_cg

05ef764

Merge branch 'vllm-project:main' into tok_cg

24f85e0

Merge branch 'vllm-project:main' into tok_cg

57da820

Merge remote-tracking branch 'origin/main' into tok_cg

f49a6b8

# Conflicts: # vllm_omni/model_executor/models/mimo_audio/mimo_audio_code2wav.py

Merge remote-tracking branch 'origin/tok_cg' into tok_cg

cd06d99

[mimo-audio] revert

b10c4e0

Signed-off-by: 齐保元 <qibaoyuan@xiaomi.com>

[mimo-audio] cg refit

f2dd06b

Signed-off-by: 齐保元 <qibaoyuan@xiaomi.com>

[mimo-audio] streaming decode

0c41a3e

Signed-off-by: 齐保元 <qibaoyuan@xiaomi.com>

Merge branch 'vllm-project:main' into tok_cg

1b85225

qibaoyuan added 3 commits March 25, 2026 17:22

[mimo-audio] code format

2f35b62

Signed-off-by: 齐保元 <qibaoyuan@xiaomi.com>

[mimo-audio] code format

9c48405

Signed-off-by: 齐保元 <qibaoyuan@xiaomi.com>

[mimo-audio] shape fits

8743fe6

Signed-off-by: 齐保元 <qibaoyuan@xiaomi.com>

qibaoyuan requested a review from hsliuustc0106 as a code owner March 25, 2026 11:50

qibaoyuan added 2 commits March 25, 2026 19:50

Merge branch 'main' into tok_cg

47a7bfe

[mimo-audio] shape fits

8b9ec32

Signed-off-by: 齐保元 <qibaoyuan@xiaomi.com>

chatgpt-codex-connector bot reviewed Mar 25, 2026

View reviewed changes

vllm_omni/model_executor/models/mimo_audio/modeling_audio_tokenizer.py Show resolved Hide resolved

qibaoyuan and others added 12 commits March 26, 2026 08:27

Merge branch 'main' into tok_cg

5b81903

Merge branch 'main' into tok_cg

6c28f77

Merge branch 'vllm-project:main' into tok_cg

9e64d7f

Merge branch 'vllm-project:main' into tok_cg

0643be0

Merge branch 'vllm-project:main' into tok_cg

1f1c95d

Merge branch 'main' into tok_cg

4dd34e8

[mimo-audio] add sdpa

d7f18c5

Signed-off-by: 齐保元 <qibaoyuan@xiaomi.com>

[mimo-audio] add sdpa

355b712

Signed-off-by: 齐保元 <qibaoyuan@xiaomi.com>

Merge branch 'main' into tok_cg

c3d8412

# Conflicts: # vllm_omni/model_executor/models/mimo_audio/modeling_audio_tokenizer.py

[mimo-audio] upstream sync

3e0c8c6

Signed-off-by: 齐保元 <qibaoyuan@xiaomi.com>

Merge branch 'vllm-project:main' into tok_cg

4e4061a

Merge branch 'main' into tok_cg

086aea8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance] Improve MiMo-Audio tokenizer decoding performance#2183

[Performance] Improve MiMo-Audio tokenizer decoding performance#2183
qibaoyuan wants to merge 47 commits intovllm-project:mainfrom
qibaoyuan:tok_cg

qibaoyuan commented Mar 25, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

qibaoyuan commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

qibaoyuan commented Mar 25, 2026 •

edited

Loading