[Spec-Decode] Add Ascend adaptation for universal speculative decoding (TLI) by wan-danfeng · Pull Request #7664 · vllm-project/vllm-ascend

wan-danfeng · 2026-03-26T03:19:52Z

What this PR does / why we need it?

This PR adds Ascend NPU support for the universal speculative decoding method introduced in upstream vllm-project/vllm#38174. It enables speculative decoding with heterogeneous vocabularies — the draft and target models can use different tokenizers (e.g., Llama-8B target + Qwen-0.5B draft).

The core algorithm uses Token-Level Intersection (TLI): the draft model generates tokens constrained to the vocabulary intersection, and a VocabMapping layer translates token IDs between the two vocabularies. Rejection sampling runs unchanged, preserving the target model's output distribution (provably lossless).

Currently, AscendDraftModelProposer enforces verify_equal_vocab_size_if_draft_model(), blocking heterogeneous model pairs. This PR adds a new AscendUniversalDraftProposer that lifts this restriction.

Closes #7663

Changes

File	Change
`vllm_ascend/spec_decode/universal_draft_proposer.py`	New. `AscendUniversalDraftProposer` extending `AscendDraftModelProposer` with TLI support
`vllm_ascend/spec_decode/__init__.py`	Register `universal_draft` method in `get_spec_decode_method()`

Key implementation details:

Inherits from AscendDraftModelProposer (which inherits from SpecDecodeBaseProposer)
Skips vocab-size-mismatch check to allow heterogeneous model pairs
Initializes VocabMapping from upstream vLLM (vllm.v1.spec_decode.vocab_mapping) after model loading
In _propose: maps target→draft IDs on input, wraps compute_logits to constrain draft logits to intersection vocabulary, maps draft→target IDs on output
Minimal code footprint (~135 lines) — reuses the existing multi-step drafting loop without duplication

Dependencies

Upstream vLLM PR: vllm-project/vllm#38174 must be merged first (provides VocabMapping and universal_draft config support)

Usage

python -m vllm.entrypoints.openai.api_server \
    --model Qwen2.5-7B-Instruct \
    --speculative-config '{"model": "Qwen2.5-0.5B-Instruct", "method": "universal_draft", "num_speculative_tokens": 5}'

Testing

Tested the TLI algorithm on Ascend 910B (64GB HBM) with an earlier vllm-ascend adaptation:

Configuration	Vocab Intersection	Mean Acceptance Length
Qwen2.5-7B + Qwen2.5-0.5B	151,665 / 151,936 (99.8%)	2.72 / 3.0

On GPU (A800), the same algorithm achieves 49-65% acceptance rate with significant throughput improvement. Full NPU benchmarks will be added once the upstream dependency is merged.

…g (TLI) Add AscendUniversalDraftProposer that extends AscendDraftModelProposer with heterogeneous vocabulary support via Token-Level Intersection (TLI). This allows the draft and target models to use different tokenizers (e.g., Llama-8B target + Qwen-0.5B draft). Key changes: - New universal_draft_proposer.py with AscendUniversalDraftProposer that skips the vocab-size-mismatch check, initialises VocabMapping after model loading, and wraps _propose to translate token IDs between draft and target vocabularies. - Register "universal_draft" method in get_spec_decode_method(). Depends on upstream vllm PR: vllm-project/vllm#38174 Signed-off-by: wan-danfeng <wandanfeng0802@gmail.com>

github-actions · 2026-03-26T03:20:04Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

gemini-code-assist · 2026-03-26T03:20:07Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances speculative decoding capabilities on Ascend NPUs by integrating universal speculative decoding. It addresses the limitation of requiring identical vocabularies between draft and target models, allowing for heterogeneous model pairs. By employing the Token-Level Intersection (TLI) algorithm, the system can now translate token IDs and constrain logits to the vocabulary intersection, ensuring the integrity of the target model's output distribution while improving inference speed.

Highlights

Ascend NPU Support for Universal Speculative Decoding: This PR introduces support for universal speculative decoding on Ascend NPUs, enabling more flexible and efficient inference.
Heterogeneous Vocabulary Support (TLI): The implementation allows speculative decoding with draft and target models that use different tokenizers and vocabulary sizes, leveraging the Token-Level Intersection (TLI) algorithm for provably lossless rejection sampling.
New Proposer Class: A new AscendUniversalDraftProposer class has been added, which extends AscendDraftModelProposer to handle VocabMapping and bypass previous vocabulary size mismatch checks.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a new speculative decoding method, universal_draft, for Ascend. This method, implemented in AscendUniversalDraftProposer, supports heterogeneous vocabularies by using VocabMapping to translate token IDs between draft and target models and constrain logits to the vocabulary intersection. A review comment highlights a potential maintenance issue in the AscendUniversalDraftProposer's __init__ method, suggesting a more robust approach by calling super().__init__ and overriding the _raise_if_vocab_size_mismatch method.

gemini-code-assist · 2026-03-26T03:21:22Z

vllm_ascend/spec_decode/universal_draft_proposer.py

+    def __init__(
+        self,
+        vllm_config: VllmConfig,
+        device: torch.device,
+        runner=None,
+    ):
+        # Intentionally skip AscendDraftModelProposer.__init__ to avoid
+        # the vocab-size-mismatch check.  We still need the base class
+        # initialisation and the TP-mismatch check.
+        SpecDecodeBaseProposer.__init__(
+            self,
+            vllm_config=vllm_config,
+            device=device,
+            pass_hidden_states_to_model=False,
+            runner=runner,
+        )
+        self._raise_if_draft_tp_mismatch()
+        self.vocab_mapping: VocabMapping | None = None


The current __init__ implementation is brittle as it circumvents the parent's __init__ by directly calling the grandparent's. This can lead to maintenance issues if the parent AscendDraftModelProposer's __init__ method is updated in the future, as this class would not inherit the changes.

A more robust, object-oriented approach is to call super().__init__ and override the specific behavior you want to change. In this case, you can override _raise_if_vocab_size_mismatch to prevent the vocabulary size check.

Please replace the current __init__ with the one in the suggestion, and add the following method to the class for clarity and improved maintainability:

def _raise_if_vocab_size_mismatch(self) -> None: """ Override parent method to allow different vocabulary sizes between draft and target models, which is the purpose of universal speculative decoding. """ pass

Suggested change

def __init__(

self,

vllm_config: VllmConfig,

device: torch.device,

runner=None,

):

# Intentionally skip AscendDraftModelProposer.__init__ to avoid

# the vocab-size-mismatch check. We still need the base class

# initialisation and the TP-mismatch check.

SpecDecodeBaseProposer.__init__(

self,

vllm_config=vllm_config,

device=device,

pass_hidden_states_to_model=False,

runner=runner,

)

self._raise_if_draft_tp_mismatch()

self.vocab_mapping: VocabMapping | None = None

def __init__(

self,

vllm_config: VllmConfig,

device: torch.device,

runner=None,

):

super().__init__(vllm_config, device, runner)

self.vocab_mapping: VocabMapping | None = None

…port - Replace direct SpecDecodeBaseProposer.__init__ call with super().__init__ + _raise_if_vocab_size_mismatch override (Gemini) - Remove unused logger import (ruff F401) - Run ruff format Signed-off-by: wan-danfeng <wandanfeng0802@gmail.com>

wan-danfeng requested a review from wangxiyuan as a code owner March 26, 2026 03:19

gemini-code-assist bot reviewed Mar 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Spec-Decode] Add Ascend adaptation for universal speculative decoding (TLI)#7664

[Spec-Decode] Add Ascend adaptation for universal speculative decoding (TLI)#7664
wan-danfeng wants to merge 2 commits intovllm-project:mainfrom
wan-danfeng:feat/universal-draft-tli

wan-danfeng commented Mar 26, 2026 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Mar 26, 2026

Uh oh!

gemini-code-assist bot commented Mar 26, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

wan-danfeng commented Mar 26, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Changes

Dependencies

Usage

Testing

Related

Uh oh!

github-actions bot commented Mar 26, 2026

Uh oh!

gemini-code-assist bot commented Mar 26, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

wan-danfeng commented Mar 26, 2026 •

edited by github-actions bot

Loading