Skip to content

[RFC] HuggingFace compabtible yet flexible WeightOnlyQuantization format for IPEX and INC #1424

@ftian1

Description

@ftian1

This RFC is to propose a Hugging Face-compatible yet flexible Weight Only Quantization (WOQ) format in INC, and then the model quantized by INC can be loaded by IPEX for further inference optimization.

Feature, Motivation and Pitch

As we know, WOQ is getting more attentions from the industry. There have had a lot of quantized WOQ models, like Llama-2-7B-Chat-GPTQ, whose format has been becoming the standard WOQ storage format. Therefore, we propose a Hugging Face-compatible, yet flexible WOQ format definition. With this, we can leverage community effort to get those WOQ models and can also easily extend for new WOQ algorithms in the future which may keep improving on the accuracy of LLMs.

Design

The WOQ quantized model is usually saved at HuggingFace model hub like below layout:
image

User needs a quantization_config to know which group_size, desc_act, and sym is used when generating such WOQ model. however, such info/fields are able to be calculated from the WOQ checkpoint's content.

So the WOQ checkpoint format is the key factor to consider. it's mainly consists of two parts:

  1. checkpoint attributes like packed weight, scale, zero_points, group_idx (De facto standard in HuggingFace WOQ model)
  2. how packed weight gets compressed, like compress dimension, zero point dimension (hardcoded in HuggingFace WOQ model but INC can be more flexible to generate such packed model)

NOTE: The fields marked with bold font is the one we are missing in current IPEX code.

In the industry, the common practice is to save the first part into model checkpoint. For the second part, output channel for compression dimension and input channel for zero point dimension is the default behavior. INC extends the second part to support input channel as compression dimension and output channel as zero point dimension. This extension can be converted to follow the default dimension.

Solutions

Solution 1 (Recommended)

Enhance INC to export the converted model format which can be identified by current IPEX implementation.

### INC export interface
def export_compressed_model(woq_model, ipex_format=True):
    ### convert WOQ model to be IPEX compatible one
    #
    # the converted checkpoint attributes include:
    #
    # 1. 'qweight' to 'packed_weight'
    # 2. 'scales' to 'scale'
    # 3. 'qzeros' to 'packed_zp'
    # 4. 'g_idx' to support HF GPTQ model
    ...

### Usage from User View ###
from neural_compressor import export_compressed_model
compressed_model = export_compressed_model('TheBloke/Llama-2-7B-Chat-GPTQ', ipex_format=True)
torch.save(compressed_model.state_dict(), "/path/to/model.pt")

import intel_extension_for_pytorch as ipex
qconfig = ipex.quantization.get_weight_only_quant_qconfig_mapping()
ipex.optimize_transformers(ipex_woq_model.eval(), quantization_config=qconfig, low_precision_checkpoint='/path/to/model.pt')

This way has minimal impact on IPEX current WOQ implementation. But to support GPTQ like model, IPEX is lack of the g_idx support when group_size != -1 as well as the corresponding kernel. This is the feature gap existing in IPEX.

In INC, it will internally convert compression_dim and zp_dim to the default format IPEX supported, that's compressing weight along input_channel and storing zero point along output_channel.

Solution 2

Enhance IPEX to be directly compatible with latest & popular WOQ model format.

class IpexWoqLinear(nn.Module):    
    def from_float_and_int4_weight(
        cls, mod, qweight, scales, zero_points, bias=None, group_size=-1,
        group_idx=-1, compression_dim=0, zero_point_dim=1 ### new args needed to be supported by IPEX
    ):
        ...

DEFAULT_LOWP_CHECKPOINT_CONFIG = {
    "name": "default",
    "weight_key": "packed_weight",  ### need to be updated as 'qweight'
    "scale_key": "scale",           ### need to be updated as 'scales'
    "zero_point_key": "packed_zp",  ### need to be updated as 'qzeros'
    "bias_key": "bias",
    "g_idx_key": "g_idx"            ### new attribute to be supported
} 

### Usage from User View ###
model.load_state_dict(torch.load(PATH))
model.eval()
qconfig = ipex.quantization.get_weight_only_quant_qconfig_mapping()
optimized_model = ipex.optimize_transformers(model, quantization_config=qconfig, low_precision_checkpoint='/path/to/woq/checkpoint')

In this solution, IPEX needs to be updated to cowork with latest/popular WOQ format in the industry.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions