You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The **Llama-4-Scout-17B-16E-Instruct** is Meta's latest generation of Mixture-of-Experts (MoE) models, featuring a sophisticated **16-expert architecture**. It provides state-of-the-art reasoning and multilingual capabilities for complex inference tasks.
6
+
7
+
This document outlines the deployment and verification process on the **vLLM-Ascend** platform. To support Llama-4's unique MoE routing, kernel-level adaptations have been implemented to ensure stability and optimal performance on **Huawei Ascend Atlas A2** hardware.
Configure the following variables to ensure HCCL communication stability and proper operator binding. Replace `/path/to/...` with your actual directory if different:
Llama-4-Scout-17B-16E requires 4 NPUs (TP4) for stable inference with a 1024 context length.
42
+
43
+
```bash
44
+
#!/bin/bash
45
+
# Save as start_llama4.sh
46
+
python3 -m vllm.entrypoints.openai.api_server \
47
+
--model /data/models/llama4-scout \
48
+
--served-model-name llama4-scout \
49
+
--tensor-parallel-size 4 \
50
+
--dtype bfloat16 \
51
+
--max-model-len 1024 \
52
+
--gpu-memory-utilization 0.90 \
53
+
--enforce-eager \
54
+
--trust-remote-code \
55
+
--block-size 128
56
+
```
57
+
58
+
> **Note:**
59
+
> **Critical Kernel Patch:** This model requires `attention_v1.py` to be configured with `sparse_mode=0` and a flattened `actual_seq_lengths_q` workaround. These changes resolve **ACL Error 507034** (stream synchronization failure) caused by Llama-4's TND layout on Ascend NPUs.
60
+
61
+
## Functional Verification
62
+
63
+
### Chat Completion API
64
+
65
+
Test the deployment using a standard OpenAI-compatible request:
66
+
67
+
```bash
68
+
curl http://localhost:8000/v1/chat/completions \
69
+
-H "Content-Type: application/json" \
70
+
-d '{
71
+
"model": "llama4-scout",
72
+
"messages": [{"role": "user", "content": "Write a Python script for quicksort."}],
73
+
"temperature": 0
74
+
}'
75
+
```
76
+
77
+
## Accuracy Evaluation (GSM8K)
78
+
79
+
The reasoning capabilities of Llama-4-Scout have been verified using **EvalScope**.
0 commit comments