[Spec-Decode] Add Ascend adaptation for universal speculative decoding (TLI)#7664
[Spec-Decode] Add Ascend adaptation for universal speculative decoding (TLI)#7664wan-danfeng wants to merge 2 commits intovllm-project:mainfrom
Conversation
…g (TLI) Add AscendUniversalDraftProposer that extends AscendDraftModelProposer with heterogeneous vocabulary support via Token-Level Intersection (TLI). This allows the draft and target models to use different tokenizers (e.g., Llama-8B target + Qwen-0.5B draft). Key changes: - New universal_draft_proposer.py with AscendUniversalDraftProposer that skips the vocab-size-mismatch check, initialises VocabMapping after model loading, and wraps _propose to translate token IDs between draft and target vocabularies. - Register "universal_draft" method in get_spec_decode_method(). Depends on upstream vllm PR: vllm-project/vllm#38174 Signed-off-by: wan-danfeng <wandanfeng0802@gmail.com>
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances speculative decoding capabilities on Ascend NPUs by integrating universal speculative decoding. It addresses the limitation of requiring identical vocabularies between draft and target models, allowing for heterogeneous model pairs. By employing the Token-Level Intersection (TLI) algorithm, the system can now translate token IDs and constrain logits to the vocabulary intersection, ensuring the integrity of the target model's output distribution while improving inference speed. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a new speculative decoding method, universal_draft, for Ascend. This method, implemented in AscendUniversalDraftProposer, supports heterogeneous vocabularies by using VocabMapping to translate token IDs between draft and target models and constrain logits to the vocabulary intersection. A review comment highlights a potential maintenance issue in the AscendUniversalDraftProposer's __init__ method, suggesting a more robust approach by calling super().__init__ and overriding the _raise_if_vocab_size_mismatch method.
| def __init__( | ||
| self, | ||
| vllm_config: VllmConfig, | ||
| device: torch.device, | ||
| runner=None, | ||
| ): | ||
| # Intentionally skip AscendDraftModelProposer.__init__ to avoid | ||
| # the vocab-size-mismatch check. We still need the base class | ||
| # initialisation and the TP-mismatch check. | ||
| SpecDecodeBaseProposer.__init__( | ||
| self, | ||
| vllm_config=vllm_config, | ||
| device=device, | ||
| pass_hidden_states_to_model=False, | ||
| runner=runner, | ||
| ) | ||
| self._raise_if_draft_tp_mismatch() | ||
| self.vocab_mapping: VocabMapping | None = None |
There was a problem hiding this comment.
The current __init__ implementation is brittle as it circumvents the parent's __init__ by directly calling the grandparent's. This can lead to maintenance issues if the parent AscendDraftModelProposer's __init__ method is updated in the future, as this class would not inherit the changes.
A more robust, object-oriented approach is to call super().__init__ and override the specific behavior you want to change. In this case, you can override _raise_if_vocab_size_mismatch to prevent the vocabulary size check.
Please replace the current __init__ with the one in the suggestion, and add the following method to the class for clarity and improved maintainability:
def _raise_if_vocab_size_mismatch(self) -> None:
"""
Override parent method to allow different vocabulary sizes between
draft and target models, which is the purpose of universal
speculative decoding.
"""
pass| def __init__( | |
| self, | |
| vllm_config: VllmConfig, | |
| device: torch.device, | |
| runner=None, | |
| ): | |
| # Intentionally skip AscendDraftModelProposer.__init__ to avoid | |
| # the vocab-size-mismatch check. We still need the base class | |
| # initialisation and the TP-mismatch check. | |
| SpecDecodeBaseProposer.__init__( | |
| self, | |
| vllm_config=vllm_config, | |
| device=device, | |
| pass_hidden_states_to_model=False, | |
| runner=runner, | |
| ) | |
| self._raise_if_draft_tp_mismatch() | |
| self.vocab_mapping: VocabMapping | None = None | |
| def __init__( | |
| self, | |
| vllm_config: VllmConfig, | |
| device: torch.device, | |
| runner=None, | |
| ): | |
| super().__init__(vllm_config, device, runner) | |
| self.vocab_mapping: VocabMapping | None = None |
…port - Replace direct SpecDecodeBaseProposer.__init__ call with super().__init__ + _raise_if_vocab_size_mismatch override (Gemini) - Remove unused logger import (ruff F401) - Run ruff format Signed-off-by: wan-danfeng <wandanfeng0802@gmail.com>
What this PR does / why we need it?
This PR adds Ascend NPU support for the universal speculative decoding method introduced in upstream vllm-project/vllm#38174. It enables speculative decoding with heterogeneous vocabularies — the draft and target models can use different tokenizers (e.g., Llama-8B target + Qwen-0.5B draft).
The core algorithm uses Token-Level Intersection (TLI): the draft model generates tokens constrained to the vocabulary intersection, and a
VocabMappinglayer translates token IDs between the two vocabularies. Rejection sampling runs unchanged, preserving the target model's output distribution (provably lossless).Currently,
AscendDraftModelProposerenforcesverify_equal_vocab_size_if_draft_model(), blocking heterogeneous model pairs. This PR adds a newAscendUniversalDraftProposerthat lifts this restriction.Closes #7663
Changes
vllm_ascend/spec_decode/universal_draft_proposer.pyAscendUniversalDraftProposerextendingAscendDraftModelProposerwith TLI supportvllm_ascend/spec_decode/__init__.pyuniversal_draftmethod inget_spec_decode_method()Key implementation details:
AscendDraftModelProposer(which inherits fromSpecDecodeBaseProposer)VocabMappingfrom upstream vLLM (vllm.v1.spec_decode.vocab_mapping) after model loading_propose: maps target→draft IDs on input, wrapscompute_logitsto constrain draft logits to intersection vocabulary, maps draft→target IDs on outputDependencies
VocabMappinganduniversal_draftconfig support)Usage
python -m vllm.entrypoints.openai.api_server \ --model Qwen2.5-7B-Instruct \ --speculative-config '{"model": "Qwen2.5-0.5B-Instruct", "method": "universal_draft", "num_speculative_tokens": 5}'Testing
Tested the TLI algorithm on Ascend 910B (64GB HBM) with an earlier vllm-ascend adaptation:
On GPU (A800), the same algorithm achieves 49-65% acceptance rate with significant throughput improvement. Full NPU benchmarks will be added once the upstream dependency is merged.
Related