Skip to content

[Model] Support Llama-4-Scout-17B-16E-Instruct on Ascend NPU#7702

Open
liyifu-2026 wants to merge 2 commits intovllm-project:v0.11.0-devfrom
liyifu-2026:v0.11.0-dev
Open

[Model] Support Llama-4-Scout-17B-16E-Instruct on Ascend NPU#7702
liyifu-2026 wants to merge 2 commits intovllm-project:v0.11.0-devfrom
liyifu-2026:v0.11.0-dev

Conversation

@liyifu-2026
Copy link
Copy Markdown

What this PR does / why we need it?

This PR provides official support and verification for the meta-llama/Llama-4-Scout-17B-16E-Instruct model on the vLLM-Ascend platform.

Key improvements and bug fixes:

  • MoE Routing Alignment: Patched vllm_ascend/ops/moe/experts_selector.py to remove the redundant global_num_experts parameter, resolving the TypeError caused by signature mismatch with Llama-4's native routing function.
  • Attention Kernel Stability:
    • Resolved ACL Error 507034 (stream synchronization failure) by setting sparse_mode=0 and implementing a workaround for actual_seq_lengths_q to accommodate Llama-4's TND layout on Ascend.
    • Enhanced AscendAttentionMetadataBuilder with **kwargs and getattr() safety checks for attn_mask and attn_state to prevent attribute-related crashes.
    • Adjusted output view dimensions to (-1, self.hidden_size) to ensure tensor alignment during distributed inference.
  • New Deliverables:
    • Added E2E test configuration: tests/e2e/models/configs/llama4_scout_17b.yaml.
    • Added deployment tutorial: docs/source/tutorials/models/llama4_scout.md.

Fixes #1972

Does this PR introduce any user-facing change?

Yes, this PR adds support for the new Llama-4-Scout-17B-16E-Instruct model. No existing APIs or interfaces are modified.

How was this patch tested?

Verified on Huawei Ascend (Atlas A2) hardware:

  • Parallelism: Tensor Parallel 4 (TP4)
  • Precision: bfloat16
  • Inference Mode: Eager Mode (--enforce-eager)
  • Benchmark: Evaluated via evalscope on the GSM8K dataset (limit 100).
  • Metric: Achieved a mean accuracy of 0.94.

Reproduction Steps:

  1. Configure environment variables (HCCL, NPU_LIB_DIR, etc.) as specified in the added tutorial.
  2. Launch the server:
    python3 -m vllm.entrypoints.openai.api_server --model /data/models/llama4-scout --tensor-parallel-size 4 --dtype bfloat16 --enforce-eager
  3. Run the evaluation:
    evalscope eval --model llama4-scout --api-url http://localhost:8000/v1 --datasets gsm8k --limit 100

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enables official support for the Llama-4-Scout-17B-16E-Instruct Mixture-of-Experts model on the vLLM-Ascend platform. It includes critical adaptations to the MoE routing and attention kernels to ensure stable and optimal performance on Ascend NPUs, alongside new documentation and end-to-end testing to facilitate deployment and verification.

Highlights

  • MoE Routing Alignment: Resolved a TypeError in experts_selector.py by removing the global_num_experts parameter, aligning with Llama-4's native routing function.
  • Attention Kernel Stability: Addressed ACL Error 507034 by setting sparse_mode=0 and implementing a workaround for actual_seq_lengths_q to support Llama-4's TND layout on Ascend. Enhanced AscendAttentionMetadataBuilder with **kwargs and getattr() safety checks for attn_mask and attn_state, and adjusted output tensor dimensions for distributed inference.
  • New Deliverables: Introduced an E2E test configuration (llama4_scout_17b.yaml) and a deployment tutorial (llama4_scout.md) for the new model.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Suggested PR Title:

[Attention][Ops][Feature] Add support for Llama-4-Scout-17B-16E-Instruct on Ascend

Suggested PR Summary:

### What this PR does / why we need it?
This PR adds support for the Llama-4-Scout-17B-16E-Instruct model on the vLLM-Ascend platform. Key changes include adding deployment documentation, E2E test configurations, and kernel-level adaptations in the attention and MoE modules. Specifically, it updates `attention_v1.py` to handle Llama-4's routing and modifies expert selection by removing the `global_num_experts` argument.

Feedback: A workaround for `actual_seq_lengths_q` in the general attention path was identified as model-specific. It is recommended to make this logic conditional based on the model type to prevent potential regressions for other models.

### Does this PR introduce _any_ user-facing change?
Yes, users can now deploy and run inference for Llama-4-Scout models on Atlas A2 hardware using vLLM-Ascend.

### How was this patch tested?
The changes were verified using the GSM8K dataset with EvalScope, achieving an accuracy score of 0.94. An E2E test configuration has been added to the repository.

Comment on lines +551 to +553
# Workaround: Use flattened query length for Llama-4 MoE compatibility on Ascend.
# This ensures the fused attention kernel handles the batch dimensions correctly.
actual_seq_lengths_q = torch.tensor([query.shape[0]], dtype=torch.int32, device=query.device)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This workaround for actual_seq_lengths_q is specific to the Llama-4 model. Hardcoding it in a general attention implementation could cause regressions for other models that use this code path.

To make this more robust, I suggest making this change conditional based on the model type. This would likely require passing the layer object down to _forward_v1_style to access the model configuration.

Here's a potential implementation approach:

  1. Update _forward_v1_style signature to accept the layer:

    def _forward_v1_style(
        self,
        query: torch.Tensor,
        attn_metadata: AscendMetadata,
        output: Optional[torch.Tensor] = None,
        layer: Optional[AttentionLayer] = None,
    ) -> torch.Tensor:
  2. Pass the layer when calling it from forward:

    output = self._forward_v1_style(query, attn_metadata, output, layer)
  3. Implement the conditional logic inside _forward_v1_style:

    is_llama4_model = layer and "llama-4" in getattr(layer.config, "model_type", "").lower()
    if is_llama4_model:
        # Workaround for Llama-4
        actual_seq_lengths_q = torch.tensor([query.shape[0]], dtype=torch.int32, device=query.device)
    else:
        actual_seq_lengths_q = attn_metadata.actual_seq_lengths_q
    
    output, _ = torch_npu.npu_fused_infer_attention_score(
        # ...
        actual_seq_lengths=actual_seq_lengths_q,
        # ...
    )

This would isolate the model-specific workaround and improve the maintainability of the attention backend.

@liyifu-2026 liyifu-2026 force-pushed the v0.11.0-dev branch 9 times, most recently from 76fdcbb to 6de51be Compare March 27, 2026 06:29
- Fix ACL 507034 and MoE signature mismatch.
- Add E2E config and tutorial.
- Verified 0.94 accuracy on GSM8K (limit=100).

Fixes vllm-project#1972

Signed-off-by: liyifu-2026 <yifu@isrc.iscas.ac.cn>
Signed-off-by: liyifu-2026 <yifu@isrc.iscas.ac.cn>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant