[Model] Support Llama-4-Scout-17B-16E-Instruct on Ascend NPU#7702
[Model] Support Llama-4-Scout-17B-16E-Instruct on Ascend NPU#7702liyifu-2026 wants to merge 2 commits intovllm-project:v0.11.0-devfrom
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request enables official support for the Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
Suggested PR Title:
[Attention][Ops][Feature] Add support for Llama-4-Scout-17B-16E-Instruct on Ascend
Suggested PR Summary:
### What this PR does / why we need it?
This PR adds support for the Llama-4-Scout-17B-16E-Instruct model on the vLLM-Ascend platform. Key changes include adding deployment documentation, E2E test configurations, and kernel-level adaptations in the attention and MoE modules. Specifically, it updates `attention_v1.py` to handle Llama-4's routing and modifies expert selection by removing the `global_num_experts` argument.
Feedback: A workaround for `actual_seq_lengths_q` in the general attention path was identified as model-specific. It is recommended to make this logic conditional based on the model type to prevent potential regressions for other models.
### Does this PR introduce _any_ user-facing change?
Yes, users can now deploy and run inference for Llama-4-Scout models on Atlas A2 hardware using vLLM-Ascend.
### How was this patch tested?
The changes were verified using the GSM8K dataset with EvalScope, achieving an accuracy score of 0.94. An E2E test configuration has been added to the repository.| # Workaround: Use flattened query length for Llama-4 MoE compatibility on Ascend. | ||
| # This ensures the fused attention kernel handles the batch dimensions correctly. | ||
| actual_seq_lengths_q = torch.tensor([query.shape[0]], dtype=torch.int32, device=query.device) |
There was a problem hiding this comment.
This workaround for actual_seq_lengths_q is specific to the Llama-4 model. Hardcoding it in a general attention implementation could cause regressions for other models that use this code path.
To make this more robust, I suggest making this change conditional based on the model type. This would likely require passing the layer object down to _forward_v1_style to access the model configuration.
Here's a potential implementation approach:
-
Update
_forward_v1_stylesignature to accept thelayer:def _forward_v1_style( self, query: torch.Tensor, attn_metadata: AscendMetadata, output: Optional[torch.Tensor] = None, layer: Optional[AttentionLayer] = None, ) -> torch.Tensor:
-
Pass the
layerwhen calling it fromforward:output = self._forward_v1_style(query, attn_metadata, output, layer)
-
Implement the conditional logic inside
_forward_v1_style:is_llama4_model = layer and "llama-4" in getattr(layer.config, "model_type", "").lower() if is_llama4_model: # Workaround for Llama-4 actual_seq_lengths_q = torch.tensor([query.shape[0]], dtype=torch.int32, device=query.device) else: actual_seq_lengths_q = attn_metadata.actual_seq_lengths_q output, _ = torch_npu.npu_fused_infer_attention_score( # ... actual_seq_lengths=actual_seq_lengths_q, # ... )
This would isolate the model-specific workaround and improve the maintainability of the attention backend.
76fdcbb to
6de51be
Compare
- Fix ACL 507034 and MoE signature mismatch. - Add E2E config and tutorial. - Verified 0.94 accuracy on GSM8K (limit=100). Fixes vllm-project#1972 Signed-off-by: liyifu-2026 <yifu@isrc.iscas.ac.cn>
Signed-off-by: liyifu-2026 <yifu@isrc.iscas.ac.cn>
e52f1d1 to
4611360
Compare
What this PR does / why we need it?
This PR provides official support and verification for the
meta-llama/Llama-4-Scout-17B-16E-Instructmodel on the vLLM-Ascend platform.Key improvements and bug fixes:
vllm_ascend/ops/moe/experts_selector.pyto remove the redundantglobal_num_expertsparameter, resolving theTypeErrorcaused by signature mismatch with Llama-4's native routing function.sparse_mode=0and implementing a workaround foractual_seq_lengths_qto accommodate Llama-4's TND layout on Ascend.AscendAttentionMetadataBuilderwith**kwargsandgetattr()safety checks forattn_maskandattn_stateto prevent attribute-related crashes.(-1, self.hidden_size)to ensure tensor alignment during distributed inference.tests/e2e/models/configs/llama4_scout_17b.yaml.docs/source/tutorials/models/llama4_scout.md.Fixes #1972
Does this PR introduce any user-facing change?
Yes, this PR adds support for the new
Llama-4-Scout-17B-16E-Instructmodel. No existing APIs or interfaces are modified.How was this patch tested?
Verified on Huawei Ascend (Atlas A2) hardware:
bfloat16--enforce-eager)evalscopeon the GSM8K dataset (limit 100).Reproduction Steps:
evalscope eval --model llama4-scout --api-url http://localhost:8000/v1 --datasets gsm8k --limit 100