[310P][Feat.]: 310P support W8A8 dynamic linear method#7725
[310P][Feat.]: 310P support W8A8 dynamic linear method#7725YangShuai52 wants to merge 8 commits intovllm-project:mainfrom
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces support for W8A8 dynamic linear quantization on the Ascend 310P platform. This new scheme aims to improve the efficiency of linear layers by dynamically quantizing activations to 8-bit integers while keeping weights at 8-bit, leveraging specific NPU operations for computation. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces the AscendW8A8DynamicLinearMethod310 class to support W8A8 dynamic quantization for linear layers on Ascend 310P hardware. The implementation includes weight allocation, per-channel parameter management, and a forward pass utilizing NPU-specific operations such as npu_dynamic_quant and npu_quant_matmul. Review feedback correctly identified that the weight_offset parameter is redundant because the current implementation only supports symmetric quantization, and it should be removed from both the allocation and post-loading processing logic.
Suggested PR Title:
[310P][Ops][Feature] Add W8A8 dynamic linear quantization for Ascend 310PSuggested PR Summary:
### What this PR does / why we need it?
This PR adds the `AscendW8A8DynamicLinearMethod310` class to provide W8A8 dynamic quantization support for linear layers on Ascend 310P. It implements weight initialization, per-channel scaling, and the execution of quantized matrix multiplication via `torch_npu`. This is necessary to enable efficient dynamic quantization on 310P hardware.
### Does this PR introduce _any_ user-facing change?
Yes, it adds a new quantization scheme "W8A8_DYNAMIC" for linear layers on Ascend 310P.
### How was this patch tested?
CI passed with existing tests; the implementation utilizes standard NPU-specific operators.|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
| ) -> dict[str, Any]: | ||
| return {"weight": torch.empty(output_size, input_size, dtype=torch.int8)} | ||
|
|
||
| def get_perchannel_param(self, output_size: int, params_dtype: torch.dtype) -> dict[str, Any]: |
There was a problem hiding this comment.
这里也和分支保持一致吧:
def get_perchannel_param(
self,
output_size: int,
params_dtype: torch.dtype,
) -> dict[str, Any]:
更清晰一些
Signed-off-by: YangShuai52 <yangshuai153@huawei.com>
Signed-off-by: YangShuai52 <yangshuai153@huawei.com>
Signed-off-by: YangShuai52 <yangshuai153@huawei.com>
Signed-off-by: YangShuai52 <yangshuai153@huawei.com>
Signed-off-by: YangShuai52 <yangshuai153@huawei.com>
Signed-off-by: YangShuai52 <yangshuai153@huawei.com>
a02438f to
ca66455
Compare
Signed-off-by: YangShuai52 <yangshuai153@huawei.com>
Signed-off-by: YangShuai52 <yangshuai153@huawei.com>
What this PR does / why we need it?
Does this PR introduce any user-facing change?
How was this patch tested?