Skip to content

[WIP][Quantization][Feature] Add AWQ quantization for Ascend#7672

Open
ZhongsJie wants to merge 5 commits intovllm-project:mainfrom
ZhongsJie:feat_awq_quant
Open

[WIP][Quantization][Feature] Add AWQ quantization for Ascend#7672
ZhongsJie wants to merge 5 commits intovllm-project:mainfrom
ZhongsJie:feat_awq_quant

Conversation

@ZhongsJie
Copy link

@ZhongsJie ZhongsJie commented Mar 26, 2026

What this PR does / why we need it?

Reimplement AWQ quantization based on PR #4316 to align with the refactored code structure introduced in #5738.
.

Does this PR introduce any user-facing change?

How was this patch tested?

Signed-off-by: ZhongsJie <zhongsjie@gmail.com>
@ZhongsJie ZhongsJie requested a review from wangxiyuan as a code owner March 26, 2026 06:54
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request re-establishes AWQ quantization support for Ascend NPUs within the vLLM framework. It introduces a dedicated configuration and specialized methods for handling 4-bit weight, 16-bit activation quantization in both standard linear layers and Mixture-of-Experts architectures, ensuring compatibility with Ascend's hardware capabilities. The changes involve integrating these new components into the platform's quantization pipeline and providing robust testing.

Highlights

  • AWQ Quantization Reimplementation: Reimplemented AWQ quantization for Ascend NPUs, aligning with a refactored code structure to ensure compatibility and leverage Ascend-specific optimizations.
  • Dedicated AWQ Configuration: Introduced a new AWQConfig class specifically for Ascend, which manages AWQ quantization parameters and overrides vLLM's native AWQ registration to route linear and MoE layers through Ascend-optimized implementations.
  • Ascend-Specific AWQ Methods: Added AscendW4A16AWQLinearMethod and AscendW4A16AWQFusedMoEMethod to handle 4-bit weight, 16-bit activation AWQ quantization in both standard linear layers and Mixture-of-Experts (MoE) layers on Ascend hardware.
  • Weight Unpacking Utilities: Implemented utility functions unpack_qzero_from_int32 and unpack_weight_from_int32 to correctly process and transform AWQ weights and zero-points from their packed int32 format into a format suitable for Ascend NPUs.
  • Platform Integration: Integrated the new AWQ quantization method into the Ascend platform's list of supported quantization techniques, making it discoverable and usable within the framework.
  • Comprehensive Unit Tests: Added extensive unit tests for the new AWQ configuration, linear, and MoE quantization methods, along with the weight/zero-point unpacking utilities, to ensure correctness and functionality.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@github-actions
Copy link
Contributor

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces W4A16 AWQ quantization support for Ascend NPUs. It includes a new AWQConfig, dedicated linear and FusedMoE methods for NPU-specific weight handling and operations, and comprehensive unit tests. A high-severity issue was identified in the unit tests where the num_groups for AWQ quantization is incorrectly derived from out_features instead of hidden_size, which needs to be corrected for consistency with standard AWQ implementation.

group_size = 128

# Original vLLM AWQ format weights
num_groups = out_features // group_size
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The number of groups for AWQ quantization should be calculated based on the input dimension (hidden_size), not the output dimension (out_features). The grouping is applied to the input channels of the weights.

This is inconsistent with the _build_layer helper method and the standard AWQ implementation, where the number of groups is input_size // group_size.

Suggested change
num_groups = out_features // group_size
num_groups = hidden_size // group_size

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified

# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a E2E test will help us better test this scenario.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure — I’ll add an e2e test similar to #6644.

from .methods.w4a16_awq import AscendW4A16AWQLinearMethod


def _remove_quantization_method():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no need to remove original item. The register_quantization_config can override now.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out.

self.packed_modules_mapping,
skip_with_substr=True,
):
from vllm_ascend.ops.linear import AscendUnquantizedLinearMethod
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is no need to lazy import here. :)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed #6644 and think using patches to reuse as much of vLLM’s logic as possible is a better approach.

I’ll take a closer look to see if this part can be removed. If not, I’ll update the current lazy-loading implementation accordingly.


extra_weight_attrs.update({"quant_method": FusedMoeWeightScaleSupported.CHANNEL.value})
per_group_param = ["weight_scale_second", "weight_offset_second", "scale_bias"] + (
per_group_param = ["weight_scale_second", "weight_offset_second", "scale_bias", "qzeros", "scales"] + (
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the changes in this file seems to be not suitable, i understand you have trouble in dealing with the current scheme, that is also i want to refactor.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any suggestions or thoughts?

Signed-off-by: ZhongsJie <zhongsjie@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants