[Model] Support Llama-4-Scout-17B-16E-Instruct on Ascend NPU by liyifu-2026 · Pull Request #7702 · vllm-project/vllm-ascend

liyifu-2026 · 2026-03-26T18:32:54Z

What this PR does / why we need it?

This PR provides official support and verification for the meta-llama/Llama-4-Scout-17B-16E-Instruct model on the vLLM-Ascend platform.

Key improvements and bug fixes:

MoE Routing Alignment: Patched vllm_ascend/ops/moe/experts_selector.py to remove the redundant global_num_experts parameter, resolving the TypeError caused by signature mismatch with Llama-4's native routing function.
Attention Kernel Stability:
- Resolved ACL Error 507034 (stream synchronization failure) by setting sparse_mode=0 and implementing a workaround for actual_seq_lengths_q to accommodate Llama-4's TND layout on Ascend.
- Enhanced AscendAttentionMetadataBuilder with **kwargs and getattr() safety checks for attn_mask and attn_state to prevent attribute-related crashes.
- Adjusted output view dimensions to (-1, self.hidden_size) to ensure tensor alignment during distributed inference.
New Deliverables:
- Added E2E test configuration: tests/e2e/models/configs/llama4_scout_17b.yaml.
- Added deployment tutorial: docs/source/tutorials/models/llama4_scout.md.

Fixes #1972

Does this PR introduce any user-facing change?

Yes, this PR adds support for the new Llama-4-Scout-17B-16E-Instruct model. No existing APIs or interfaces are modified.

How was this patch tested?

Verified on Huawei Ascend (Atlas A2) hardware:

Parallelism: Tensor Parallel 4 (TP4)
Precision: bfloat16
Inference Mode: Eager Mode (--enforce-eager)
Benchmark: Evaluated via evalscope on the GSM8K dataset (limit 100).
Metric: Achieved a mean accuracy of 0.94.

Reproduction Steps:

Configure environment variables (HCCL, NPU_LIB_DIR, etc.) as specified in the added tutorial.

Launch the server:

python3 -m vllm.entrypoints.openai.api_server --model /data/models/llama4-scout --tensor-parallel-size 4 --dtype bfloat16 --enforce-eager

Run the evaluation:

evalscope eval --model llama4-scout --api-url http://localhost:8000/v1 --datasets gsm8k --limit 100

gemini-code-assist · 2026-03-26T18:34:25Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enables official support for the Llama-4-Scout-17B-16E-Instruct Mixture-of-Experts model on the vLLM-Ascend platform. It includes critical adaptations to the MoE routing and attention kernels to ensure stable and optimal performance on Ascend NPUs, alongside new documentation and end-to-end testing to facilitate deployment and verification.

Highlights

MoE Routing Alignment: Resolved a TypeError in experts_selector.py by removing the global_num_experts parameter, aligning with Llama-4's native routing function.
Attention Kernel Stability: Addressed ACL Error 507034 by setting sparse_mode=0 and implementing a workaround for actual_seq_lengths_q to support Llama-4's TND layout on Ascend. Enhanced AscendAttentionMetadataBuilder with **kwargs and getattr() safety checks for attn_mask and attn_state, and adjusted output tensor dimensions for distributed inference.
New Deliverables: Introduced an E2E test configuration (llama4_scout_17b.yaml) and a deployment tutorial (llama4_scout.md) for the new model.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

Suggested PR Title:

[Attention][Ops][Feature] Add support for Llama-4-Scout-17B-16E-Instruct on Ascend

Suggested PR Summary:

### What this PR does / why we need it?
This PR adds support for the Llama-4-Scout-17B-16E-Instruct model on the vLLM-Ascend platform. Key changes include adding deployment documentation, E2E test configurations, and kernel-level adaptations in the attention and MoE modules. Specifically, it updates `attention_v1.py` to handle Llama-4's routing and modifies expert selection by removing the `global_num_experts` argument.

Feedback: A workaround for `actual_seq_lengths_q` in the general attention path was identified as model-specific. It is recommended to make this logic conditional based on the model type to prevent potential regressions for other models.

### Does this PR introduce _any_ user-facing change?
Yes, users can now deploy and run inference for Llama-4-Scout models on Atlas A2 hardware using vLLM-Ascend.

### How was this patch tested?
The changes were verified using the GSM8K dataset with EvalScope, achieving an accuracy score of 0.94. An E2E test configuration has been added to the repository.

gemini-code-assist · 2026-03-26T18:36:52Z

vllm_ascend/attention/attention_v1.py

+        # Workaround: Use flattened query length for Llama-4 MoE compatibility on Ascend.
+        # This ensures the fused attention kernel handles the batch dimensions correctly.
+        actual_seq_lengths_q = torch.tensor([query.shape[0]], dtype=torch.int32, device=query.device)


This workaround for actual_seq_lengths_q is specific to the Llama-4 model. Hardcoding it in a general attention implementation could cause regressions for other models that use this code path.

To make this more robust, I suggest making this change conditional based on the model type. This would likely require passing the layer object down to _forward_v1_style to access the model configuration.

Here's a potential implementation approach:

Update _forward_v1_style signature to accept the layer:

def _forward_v1_style( self, query: torch.Tensor, attn_metadata: AscendMetadata, output: Optional[torch.Tensor] = None, layer: Optional[AttentionLayer] = None, ) -> torch.Tensor:

Pass the layer when calling it from forward:

output = self._forward_v1_style(query, attn_metadata, output, layer)

Implement the conditional logic inside _forward_v1_style:

is_llama4_model = layer and "llama-4" in getattr(layer.config, "model_type", "").lower() if is_llama4_model: # Workaround for Llama-4 actual_seq_lengths_q = torch.tensor([query.shape[0]], dtype=torch.int32, device=query.device) else: actual_seq_lengths_q = attn_metadata.actual_seq_lengths_q output, _ = torch_npu.npu_fused_infer_attention_score( # ... actual_seq_lengths=actual_seq_lengths_q, # ... )

This would isolate the model-specific workaround and improve the maintainability of the attention backend.

- Fix ACL 507034 and MoE signature mismatch. - Add E2E config and tutorial. - Verified 0.94 accuracy on GSM8K (limit=100). Fixes vllm-project#1972 Signed-off-by: liyifu-2026 <yifu@isrc.iscas.ac.cn>

Signed-off-by: liyifu-2026 <yifu@isrc.iscas.ac.cn>

gemini-code-assist bot reviewed Mar 26, 2026

View reviewed changes

liyifu-2026 force-pushed the v0.11.0-dev branch 9 times, most recently from 76fdcbb to 6de51be Compare March 27, 2026 06:29

liyifu-2026 added 2 commits March 27, 2026 06:31

feat(model): support Llama-4-Scout-17B on Ascend

6de51be

- Fix ACL 507034 and MoE signature mismatch. - Add E2E config and tutorial. - Verified 0.94 accuracy on GSM8K (limit=100). Fixes vllm-project#1972 Signed-off-by: liyifu-2026 <yifu@isrc.iscas.ac.cn>

ci: fix dubious ownership and fetch-depth

4611360

Signed-off-by: liyifu-2026 <yifu@isrc.iscas.ac.cn>

liyifu-2026 force-pushed the v0.11.0-dev branch from e52f1d1 to 4611360 Compare March 27, 2026 08:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Model] Support Llama-4-Scout-17B-16E-Instruct on Ascend NPU#7702

[Model] Support Llama-4-Scout-17B-16E-Instruct on Ascend NPU#7702
liyifu-2026 wants to merge 2 commits intovllm-project:v0.11.0-devfrom
liyifu-2026:v0.11.0-dev

liyifu-2026 commented Mar 26, 2026

Uh oh!

gemini-code-assist bot commented Mar 26, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

liyifu-2026 commented Mar 26, 2026

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

gemini-code-assist bot commented Mar 26, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant