[WIP][Quantization][Feature] Add AWQ quantization for Ascend#7672
[WIP][Quantization][Feature] Add AWQ quantization for Ascend#7672ZhongsJie wants to merge 5 commits intovllm-project:mainfrom
Conversation
Signed-off-by: ZhongsJie <zhongsjie@gmail.com>
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request re-establishes AWQ quantization support for Ascend NPUs within the vLLM framework. It introduces a dedicated configuration and specialized methods for handling 4-bit weight, 16-bit activation quantization in both standard linear layers and Mixture-of-Experts architectures, ensuring compatibility with Ascend's hardware capabilities. The changes involve integrating these new components into the platform's quantization pipeline and providing robust testing. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
There was a problem hiding this comment.
Code Review
This pull request introduces W4A16 AWQ quantization support for Ascend NPUs. It includes a new AWQConfig, dedicated linear and FusedMoE methods for NPU-specific weight handling and operations, and comprehensive unit tests. A high-severity issue was identified in the unit tests where the num_groups for AWQ quantization is incorrectly derived from out_features instead of hidden_size, which needs to be corrected for consistency with standard AWQ implementation.
| group_size = 128 | ||
|
|
||
| # Original vLLM AWQ format weights | ||
| num_groups = out_features // group_size |
There was a problem hiding this comment.
The number of groups for AWQ quantization should be calculated based on the input dimension (hidden_size), not the output dimension (out_features). The grouping is applied to the input channels of the weights.
This is inconsistent with the _build_layer helper method and the standard AWQ implementation, where the number of groups is input_size // group_size.
| num_groups = out_features // group_size | |
| num_groups = hidden_size // group_size |
Signed-off-by: ZhongsJie <zhongsjie@gmail.com>
| # distributed under the License is distributed on an "AS IS" BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. |
There was a problem hiding this comment.
Add a E2E test will help us better test this scenario.
| from .methods.w4a16_awq import AscendW4A16AWQLinearMethod | ||
|
|
||
|
|
||
| def _remove_quantization_method(): |
There was a problem hiding this comment.
There is no need to remove original item. The register_quantization_config can override now.
There was a problem hiding this comment.
Thanks for pointing this out.
| self.packed_modules_mapping, | ||
| skip_with_substr=True, | ||
| ): | ||
| from vllm_ascend.ops.linear import AscendUnquantizedLinearMethod |
There was a problem hiding this comment.
I think there is no need to lazy import here. :)
There was a problem hiding this comment.
I reviewed #6644 and think using patches to reuse as much of vLLM’s logic as possible is a better approach.
I’ll take a closer look to see if this part can be removed. If not, I’ll update the current lazy-loading implementation accordingly.
|
|
||
| extra_weight_attrs.update({"quant_method": FusedMoeWeightScaleSupported.CHANNEL.value}) | ||
| per_group_param = ["weight_scale_second", "weight_offset_second", "scale_bias"] + ( | ||
| per_group_param = ["weight_scale_second", "weight_offset_second", "scale_bias", "qzeros", "scales"] + ( |
There was a problem hiding this comment.
All the changes in this file seems to be not suitable, i understand you have trouble in dealing with the current scheme, that is also i want to refactor.
Signed-off-by: ZhongsJie <zhongsjie@gmail.com>
What this PR does / why we need it?
Reimplement AWQ quantization based on PR #4316 to align with the refactored code structure introduced in #5738.
.
Does this PR introduce any user-facing change?
How was this patch tested?