[RFC] HuggingFace compabtible yet flexible WeightOnlyQuantization format for IPEX and INC

This RFC is to propose a Hugging Face-compatible yet flexible Weight Only Quantization (WOQ) format in INC, and then the model quantized by INC can be loaded by IPEX for further inference optimization. 

## Feature, Motivation and Pitch

As we know, WOQ is getting more attentions from the industry. There have had a lot of quantized WOQ models, like Llama-2-7B-Chat-GPTQ, whose format has been becoming the standard WOQ storage format. Therefore, we propose a Hugging Face-compatible, yet flexible WOQ format definition. With this, we can leverage community effort to get those WOQ models and can also easily extend for new WOQ algorithms in the future which may keep improving on the accuracy of LLMs.

## Design
The WOQ quantized model is usually saved at HuggingFace model hub like below layout:
![image](https://github.com/intel/neural-compressor/assets/16394660/3934b893-1c70-4aad-9b4f-e60a96ee05f8)

User needs a `quantization_config` to know which group_size, desc_act, and sym is used when generating such WOQ model. however, such info/fields are able to be calculated from the WOQ checkpoint's content.

So the WOQ checkpoint format is the key factor to consider. it's mainly consists of two parts:

1. checkpoint attributes like packed weight, scale, zero_points, **group_idx** (De facto standard in HuggingFace WOQ model)
2. how packed weight gets compressed, like **compress dimension, zero point dimension** (hardcoded in HuggingFace WOQ model but INC can be more flexible to generate such packed model)

NOTE: The fields marked with bold font is the one we are missing in current IPEX code. 

In the industry, the common practice is to save the first part into model checkpoint. For the second part, output channel for compression dimension and input channel for zero point dimension is the default behavior. INC extends the second part to support input channel as compression dimension and output channel as zero point dimension. This extension can be converted to follow the default dimension. 

## Solutions

### Solution 1 (Recommended)

Enhance INC to export the converted model format which can be identified by current IPEX implementation.

```
### INC export interface
def export_compressed_model(woq_model, ipex_format=True):
    ### convert WOQ model to be IPEX compatible one
    #
    # the converted checkpoint attributes include:
    #
    # 1. 'qweight' to 'packed_weight'
    # 2. 'scales' to 'scale'
    # 3. 'qzeros' to 'packed_zp'
    # 4. 'g_idx' to support HF GPTQ model
    ...

### Usage from User View ###
from neural_compressor import export_compressed_model
compressed_model = export_compressed_model('TheBloke/Llama-2-7B-Chat-GPTQ', ipex_format=True)
torch.save(compressed_model.state_dict(), "/path/to/model.pt")

import intel_extension_for_pytorch as ipex
qconfig = ipex.quantization.get_weight_only_quant_qconfig_mapping()
ipex.optimize_transformers(ipex_woq_model.eval(), quantization_config=qconfig, low_precision_checkpoint='/path/to/model.pt')

```

This way has minimal impact on IPEX current WOQ implementation. But to support GPTQ like model, IPEX is lack of the `g_idx` support when `group_size != -1` as well as the corresponding kernel. This is the feature gap existing in IPEX.

In INC, it will internally convert `compression_dim` and `zp_dim` to the default format IPEX supported, that's compressing weight along `input_channel` and storing `zero point` along `output_channel`.

### Solution 2

Enhance IPEX to be directly compatible with latest & popular WOQ model format.

```python
class IpexWoqLinear(nn.Module):    
    def from_float_and_int4_weight(
        cls, mod, qweight, scales, zero_points, bias=None, group_size=-1,
        group_idx=-1, compression_dim=0, zero_point_dim=1 ### new args needed to be supported by IPEX
    ):
        ...

DEFAULT_LOWP_CHECKPOINT_CONFIG = {
    "name": "default",
    "weight_key": "packed_weight",  ### need to be updated as 'qweight'
    "scale_key": "scale",           ### need to be updated as 'scales'
    "zero_point_key": "packed_zp",  ### need to be updated as 'qzeros'
    "bias_key": "bias",
    "g_idx_key": "g_idx"            ### new attribute to be supported
} 

### Usage from User View ###
model.load_state_dict(torch.load(PATH))
model.eval()
qconfig = ipex.quantization.get_weight_only_quant_qconfig_mapping()
optimized_model = ipex.optimize_transformers(model, quantization_config=qconfig, low_precision_checkpoint='/path/to/woq/checkpoint')

```

In this solution, IPEX needs to be updated to cowork with latest/popular WOQ format in the industry. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] HuggingFace compabtible yet flexible WeightOnlyQuantization format for IPEX and INC #1424

Feature, Motivation and Pitch

Design

Solutions

Solution 1 (Recommended)

Solution 2

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[RFC] HuggingFace compabtible yet flexible WeightOnlyQuantization format for IPEX and INC #1424

Description

Feature, Motivation and Pitch

Design

Solutions

Solution 1 (Recommended)

Solution 2

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions