Skip to content

[Spec-Decode] Add Ascend adaptation for universal speculative decoding (TLI)#7664

Open
wan-danfeng wants to merge 2 commits intovllm-project:mainfrom
wan-danfeng:feat/universal-draft-tli
Open

[Spec-Decode] Add Ascend adaptation for universal speculative decoding (TLI)#7664
wan-danfeng wants to merge 2 commits intovllm-project:mainfrom
wan-danfeng:feat/universal-draft-tli

Conversation

@wan-danfeng
Copy link
Copy Markdown

@wan-danfeng wan-danfeng commented Mar 26, 2026

What this PR does / why we need it?

This PR adds Ascend NPU support for the universal speculative decoding method introduced in upstream vllm-project/vllm#38174. It enables speculative decoding with heterogeneous vocabularies — the draft and target models can use different tokenizers (e.g., Llama-8B target + Qwen-0.5B draft).

The core algorithm uses Token-Level Intersection (TLI): the draft model generates tokens constrained to the vocabulary intersection, and a VocabMapping layer translates token IDs between the two vocabularies. Rejection sampling runs unchanged, preserving the target model's output distribution (provably lossless).

Currently, AscendDraftModelProposer enforces verify_equal_vocab_size_if_draft_model(), blocking heterogeneous model pairs. This PR adds a new AscendUniversalDraftProposer that lifts this restriction.

Closes #7663

Changes

File Change
vllm_ascend/spec_decode/universal_draft_proposer.py New. AscendUniversalDraftProposer extending AscendDraftModelProposer with TLI support
vllm_ascend/spec_decode/__init__.py Register universal_draft method in get_spec_decode_method()

Key implementation details:

  • Inherits from AscendDraftModelProposer (which inherits from SpecDecodeBaseProposer)
  • Skips vocab-size-mismatch check to allow heterogeneous model pairs
  • Initializes VocabMapping from upstream vLLM (vllm.v1.spec_decode.vocab_mapping) after model loading
  • In _propose: maps target→draft IDs on input, wraps compute_logits to constrain draft logits to intersection vocabulary, maps draft→target IDs on output
  • Minimal code footprint (~135 lines) — reuses the existing multi-step drafting loop without duplication

Dependencies

  • Upstream vLLM PR: vllm-project/vllm#38174 must be merged first (provides VocabMapping and universal_draft config support)

Usage

python -m vllm.entrypoints.openai.api_server \
    --model Qwen2.5-7B-Instruct \
    --speculative-config '{"model": "Qwen2.5-0.5B-Instruct", "method": "universal_draft", "num_speculative_tokens": 5}'

Testing

Tested the TLI algorithm on Ascend 910B (64GB HBM) with an earlier vllm-ascend adaptation:

Configuration Vocab Intersection Mean Acceptance Length
Qwen2.5-7B + Qwen2.5-0.5B 151,665 / 151,936 (99.8%) 2.72 / 3.0

On GPU (A800), the same algorithm achieves 49-65% acceptance rate with significant throughput improvement. Full NPU benchmarks will be added once the upstream dependency is merged.

Related

…g (TLI)

Add AscendUniversalDraftProposer that extends AscendDraftModelProposer
with heterogeneous vocabulary support via Token-Level Intersection (TLI).
This allows the draft and target models to use different tokenizers
(e.g., Llama-8B target + Qwen-0.5B draft).

Key changes:
- New universal_draft_proposer.py with AscendUniversalDraftProposer
  that skips the vocab-size-mismatch check, initialises VocabMapping
  after model loading, and wraps _propose to translate token IDs
  between draft and target vocabularies.
- Register "universal_draft" method in get_spec_decode_method().

Depends on upstream vllm PR: vllm-project/vllm#38174

Signed-off-by: wan-danfeng <wandanfeng0802@gmail.com>
@wan-danfeng wan-danfeng requested a review from wangxiyuan as a code owner March 26, 2026 03:19
@github-actions
Copy link
Copy Markdown
Contributor

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances speculative decoding capabilities on Ascend NPUs by integrating universal speculative decoding. It addresses the limitation of requiring identical vocabularies between draft and target models, allowing for heterogeneous model pairs. By employing the Token-Level Intersection (TLI) algorithm, the system can now translate token IDs and constrain logits to the vocabulary intersection, ensuring the integrity of the target model's output distribution while improving inference speed.

Highlights

  • Ascend NPU Support for Universal Speculative Decoding: This PR introduces support for universal speculative decoding on Ascend NPUs, enabling more flexible and efficient inference.
  • Heterogeneous Vocabulary Support (TLI): The implementation allows speculative decoding with draft and target models that use different tokenizers and vocabulary sizes, leveraging the Token-Level Intersection (TLI) algorithm for provably lossless rejection sampling.
  • New Proposer Class: A new AscendUniversalDraftProposer class has been added, which extends AscendDraftModelProposer to handle VocabMapping and bypass previous vocabulary size mismatch checks.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new speculative decoding method, universal_draft, for Ascend. This method, implemented in AscendUniversalDraftProposer, supports heterogeneous vocabularies by using VocabMapping to translate token IDs between draft and target models and constrain logits to the vocabulary intersection. A review comment highlights a potential maintenance issue in the AscendUniversalDraftProposer's __init__ method, suggesting a more robust approach by calling super().__init__ and overriding the _raise_if_vocab_size_mismatch method.

Comment on lines +34 to +51
def __init__(
self,
vllm_config: VllmConfig,
device: torch.device,
runner=None,
):
# Intentionally skip AscendDraftModelProposer.__init__ to avoid
# the vocab-size-mismatch check. We still need the base class
# initialisation and the TP-mismatch check.
SpecDecodeBaseProposer.__init__(
self,
vllm_config=vllm_config,
device=device,
pass_hidden_states_to_model=False,
runner=runner,
)
self._raise_if_draft_tp_mismatch()
self.vocab_mapping: VocabMapping | None = None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current __init__ implementation is brittle as it circumvents the parent's __init__ by directly calling the grandparent's. This can lead to maintenance issues if the parent AscendDraftModelProposer's __init__ method is updated in the future, as this class would not inherit the changes.

A more robust, object-oriented approach is to call super().__init__ and override the specific behavior you want to change. In this case, you can override _raise_if_vocab_size_mismatch to prevent the vocabulary size check.

Please replace the current __init__ with the one in the suggestion, and add the following method to the class for clarity and improved maintainability:

    def _raise_if_vocab_size_mismatch(self) -> None:
        """
        Override parent method to allow different vocabulary sizes between
        draft and target models, which is the purpose of universal
        speculative decoding.
        """
        pass
Suggested change
def __init__(
self,
vllm_config: VllmConfig,
device: torch.device,
runner=None,
):
# Intentionally skip AscendDraftModelProposer.__init__ to avoid
# the vocab-size-mismatch check. We still need the base class
# initialisation and the TP-mismatch check.
SpecDecodeBaseProposer.__init__(
self,
vllm_config=vllm_config,
device=device,
pass_hidden_states_to_model=False,
runner=runner,
)
self._raise_if_draft_tp_mismatch()
self.vocab_mapping: VocabMapping | None = None
def __init__(
self,
vllm_config: VllmConfig,
device: torch.device,
runner=None,
):
super().__init__(vllm_config, device, runner)
self.vocab_mapping: VocabMapping | None = None

…port

- Replace direct SpecDecodeBaseProposer.__init__ call with
  super().__init__ + _raise_if_vocab_size_mismatch override (Gemini)
- Remove unused logger import (ruff F401)
- Run ruff format

Signed-off-by: wan-danfeng <wandanfeng0802@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] Support universal speculative decoding for heterogeneous vocabularies (TLI)

1 participant