[WIP][Quantization][Feature] Add AWQ quantization for Ascend by ZhongsJie · Pull Request #7672 · vllm-project/vllm-ascend

ZhongsJie · 2026-03-26T06:54:02Z

What this PR does / why we need it?

Reimplement AWQ quantization based on PR #4316 to align with the refactored code structure introduced in #5738.
.

Does this PR introduce any user-facing change?

How was this patch tested?

vLLM version: v0.18.0
vLLM main: vllm-project/vllm@35141a7

Signed-off-by: ZhongsJie <zhongsjie@gmail.com>

gemini-code-assist · 2026-03-26T06:54:29Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request re-establishes AWQ quantization support for Ascend NPUs within the vLLM framework. It introduces a dedicated configuration and specialized methods for handling 4-bit weight, 16-bit activation quantization in both standard linear layers and Mixture-of-Experts architectures, ensuring compatibility with Ascend's hardware capabilities. The changes involve integrating these new components into the platform's quantization pipeline and providing robust testing.

Highlights

AWQ Quantization Reimplementation: Reimplemented AWQ quantization for Ascend NPUs, aligning with a refactored code structure to ensure compatibility and leverage Ascend-specific optimizations.
Dedicated AWQ Configuration: Introduced a new AWQConfig class specifically for Ascend, which manages AWQ quantization parameters and overrides vLLM's native AWQ registration to route linear and MoE layers through Ascend-optimized implementations.
Ascend-Specific AWQ Methods: Added AscendW4A16AWQLinearMethod and AscendW4A16AWQFusedMoEMethod to handle 4-bit weight, 16-bit activation AWQ quantization in both standard linear layers and Mixture-of-Experts (MoE) layers on Ascend hardware.
Weight Unpacking Utilities: Implemented utility functions unpack_qzero_from_int32 and unpack_weight_from_int32 to correctly process and transform AWQ weights and zero-points from their packed int32 format into a format suitable for Ascend NPUs.
Platform Integration: Integrated the new AWQ quantization method into the Ascend platform's list of supported quantization techniques, making it discoverable and usable within the framework.
Comprehensive Unit Tests: Added extensive unit tests for the new AWQ configuration, linear, and MoE quantization methods, along with the weight/zero-point unpacking utilities, to ensure correctness and functionality.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

github-actions · 2026-03-26T06:55:09Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

gemini-code-assist

Code Review

This pull request introduces W4A16 AWQ quantization support for Ascend NPUs. It includes a new AWQConfig, dedicated linear and FusedMoE methods for NPU-specific weight handling and operations, and comprehensive unit tests. A high-severity issue was identified in the unit tests where the num_groups for AWQ quantization is incorrectly derived from out_features instead of hidden_size, which needs to be corrected for consistency with standard AWQ implementation.

gemini-code-assist · 2026-03-26T06:57:53Z

tests/ut/quantization/test_w4a16_awq.py

+        group_size = 128
+
+        # Original vLLM AWQ format weights
+        num_groups = out_features // group_size


The number of groups for AWQ quantization should be calculated based on the input dimension (hidden_size), not the output dimension (out_features). The grouping is applied to the input channels of the weights.

This is inconsistent with the _build_layer helper method and the standard AWQ implementation, where the number of groups is input_size // group_size.

Suggested change

num_groups = out_features // group_size

num_groups = hidden_size // group_size

Signed-off-by: ZhongsJie <zhongsjie@gmail.com>

…o feat_awq_quant

menogrey · 2026-03-26T11:59:05Z

tests/ut/quantization/test_w4a16_awq.py

+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.


Add a E2E test will help us better test this scenario.

Sure — I’ll add an e2e test similar to #6644.

menogrey · 2026-03-26T12:04:58Z

vllm_ascend/quantization/awq_config.py

+from .methods.w4a16_awq import AscendW4A16AWQLinearMethod
+
+
+def _remove_quantization_method():


There is no need to remove original item. The register_quantization_config can override now.

Thanks for pointing this out.

menogrey · 2026-03-26T12:09:45Z

vllm_ascend/quantization/awq_config.py

+                self.packed_modules_mapping,
+                skip_with_substr=True,
+            ):
+                from vllm_ascend.ops.linear import AscendUnquantizedLinearMethod


I think there is no need to lazy import here. :)

I reviewed #6644 and think using patches to reuse as much of vLLM’s logic as possible is a better approach.

I’ll take a closer look to see if this part can be removed. If not, I’ll update the current lazy-loading implementation accordingly.

menogrey · 2026-03-26T12:26:48Z

vllm_ascend/quantization/method_adapters.py


        extra_weight_attrs.update({"quant_method": FusedMoeWeightScaleSupported.CHANNEL.value})
-        per_group_param = ["weight_scale_second", "weight_offset_second", "scale_bias"] + (
+        per_group_param = ["weight_scale_second", "weight_offset_second", "scale_bias", "qzeros", "scales"] + (


All the changes in this file seems to be not suitable, i understand you have trouble in dealing with the current scheme, that is also i want to refactor.

Any suggestions or thoughts?

Signed-off-by: ZhongsJie <zhongsjie@gmail.com>

[Feature] Support AWQ quantization for Ascend

42e1900

Signed-off-by: ZhongsJie <zhongsjie@gmail.com>

ZhongsJie requested a review from wangxiyuan as a code owner March 26, 2026 06:54

github-actions bot added module:tests module:core module:quantization labels Mar 26, 2026

ZhongsJie mentioned this pull request Mar 26, 2026

[Quantization][Feature] Add AWQ quantization in vllm-ascend. #4316

Open

gemini-code-assist bot reviewed Mar 26, 2026

View reviewed changes

ZhongsJie and others added 3 commits March 26, 2026 14:57

Merge branch 'main' into feat_awq_quant

e223449

Fix: fix unit test

150d0b6

Signed-off-by: ZhongsJie <zhongsjie@gmail.com>

Merge branch 'feat_awq_quant' of github.com:ZhongsJie/vllm-ascend int…

a3bf3db

…o feat_awq_quant

menogrey reviewed Mar 26, 2026

View reviewed changes

[Feature] add e2e test

6787e25

Signed-off-by: ZhongsJie <zhongsjie@gmail.com>

	num_groups = out_features // group_size
	num_groups = hidden_size // group_size

		from .methods.w4a16_awq import AscendW4A16AWQLinearMethod


		def _remove_quantization_method():

Conversation

ZhongsJie commented Mar 26, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

gemini-code-assist bot commented Mar 26, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

github-actions bot commented Mar 26, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ZhongsJie commented Mar 26, 2026 •

edited by github-actions bot

Loading