Enables the per_tensor lowering patterns for weight per_packing by choudhary-devang · Pull Request #2391 · pytorch/ao

choudhary-devang · 2025-06-17T07:24:25Z

This Pr is an extension of #2139 pr,

Major changes:
1)Introduced lowering pattern for "per_tensor" quantized weights.
2) Modified the original api get_default_arm_inductor_quantization_config to add user choice of using "per_tensor" and "per_channel" granularity in model weight's quantization.

supported shapes:

s8:s8:f32 - (per_tensor / per_channel) input : s8, weight : s8, output : f32
u8:s8:f32 - (per_tensor / per_channel ) input : u8, weight : s8, output : f32

Tested and verified for different models:

Bert model
Resnet model
Vit model
Custum models

Example script for refence:

import torch
from transformers import BertModel
from torchao.quantization.pt2e.quantize_pt2e import prepare_pt2e, convert_pt2e
import torchao.quantization.pt2e.quantizer.arm_inductor_quantizer as aiq
from torchao.quantization.pt2e.quantizer.arm_inductor_quantizer import ArmInductorQuantizer
import torch._inductor.config as config
# Enable C++ wrapper for Inductor
config.cpp_wrapper = True
config.freezing=True

model_name = "bert-base-uncased"
model = BertModel.from_pretrained(model_name)

# Set the model to eval mode
model = model.eval()

# Create the data, using dummy data here as an example
traced_bs = 32
seq_length = 128
x = torch.randint(0, 10000, (traced_bs, seq_length))
attention_mask = torch.ones((traced_bs, seq_length))
example_inputs = (x, attention_mask)

# Capture the FX Graph to be quantized
with torch.no_grad():
    exported_model = torch.export.export_for_training(model, example_inputs).module()
    # Set up the quantizer and prepare the model for post-training quantization
    quantizer = ArmInductorQuantizer()
    quantizer.set_global(aiq.get_default_arm_inductor_quantization_config(is_dynamic=True, is_per_channel=True))
    prepared_model = prepare_pt2e(exported_model, quantizer)
    converted_model = convert_pt2e(prepared_model)
    converted_model = torch.compile(converted_model)
    with torch.profiler.profile( record_shapes=True) as prof:
        for _ in range(200):
            converted_model(*example_inputs)
print(prof.key_averages(group_by_input_shape=True).table(sort_by="self_cpu_time_total"))

Results

Model	FP32	quant (int8)	Speedup
resnet	62.967	44.482	1.415561
bert	103.879	71.953	1.443706
vit	69.031	59.973	1.151035

All time in sec, Taken on Aws Graviton 3E 32 core Instance

Pip list

cc: @jerryzh168, @fadara01, @Xia-Weiwen

pytorch-bot · 2025-06-17T07:24:29Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2391

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 9 New Failures

As of commit bd15048 with merge base ce07646 ():

NEW FAILURES - The following jobs have failed:

PR Label Check / Check PR Labels (gh)
Process completed with exit code 1.
Run Regression Tests / test (CPU 2.10, linux.4xlarge, torch==2.10.0 --index-url https://download.pytorch.org/whl/cpu, cpu) / linux-job (gh)
RuntimeError: Command docker exec -t a2e45c4a46db5ecff775181e96b5800c6e3cd959d774e58a70114da731eb0bfd /exec failed with exit code 2
Run Regression Tests / test (CPU 2.8, linux.4xlarge, torch==2.8.0 --index-url https://download.pytorch.org/whl/cpu, cpu) / linux-job (gh)
RuntimeError: Command docker exec -t f5a81349e383b688d00dd9319e66488429020ec7d63826347a889120ef5978d7 /exec failed with exit code 2
Run Regression Tests / test (CPU 2.9, linux.4xlarge, torch==2.9.1 --index-url https://download.pytorch.org/whl/cpu, cpu) / linux-job (gh)
RuntimeError: Command docker exec -t 73d958bcb28efde641be1eb05e27816b10a47171871467668d31056dca06c1e0 /exec failed with exit code 2
Run Regression Tests / test (CUDA 2.10, linux.g5.12xlarge.nvidia.gpu, torch==2.10.0, cuda, 12.6) / linux-job (gh)
RuntimeError: Command docker exec -t d946ba1ffbbe3ef047c18e2a293d3a96b03d87db3c1295042b80c9c05e0bde01 /exec failed with exit code 2
Run Regression Tests / test (CUDA 2.8, linux.g5.12xlarge.nvidia.gpu, torch==2.8.0, cuda, 12.6) / linux-job (gh)
RuntimeError: Command docker exec -t ea970ad04cba1fecc2089283c811cc0c02b6456b4eb1a8fa08f3d12f941bb9d0 /exec failed with exit code 2
Run Regression Tests / test (CUDA 2.9, linux.g5.12xlarge.nvidia.gpu, torch==2.9.1, cuda, 12.6) / linux-job (gh)
RuntimeError: Command docker exec -t e7a376286787988bda12d24eb465ba58db7251aecd5f1c1c64f50e2c9f2924b6 /exec failed with exit code 2
Run Regression Tests / test-nightly (CPU Nightly, linux.4xlarge, --pre torch --index-url https://download.pytorch.org/wh... / linux-job (gh)
RuntimeError: Command docker exec -t 4650f8b916bd6ca2206f50b954dc28f101d2e1adbaa1cb7aca4158212f9a2d54 /exec failed with exit code 2
Run Regression Tests / test-nightly (CUDA Nightly, linux.g5.12xlarge.nvidia.gpu, --pre torch --index-url https://downloa... / linux-job (gh)
RuntimeError: Command docker exec -t 59381569a43243eae1bff4bf25c49e023b8e2efdcbc7e76a089abdf8ac4436a9 /exec failed with exit code 2

This comment was automatically generated by Dr. CI and updates every 15 minutes.

choudhary-devang · 2025-06-26T05:06:00Z

Hi @jerryzh168, @fadara01, @Xia-Weiwen can you please review this pr
thankyou

jerryzh168 · 2025-06-26T17:23:57Z

Thanks, can you add some tests in https://github.com/pytorch/ao/tree/main/test/quantization/pt2e

choudhary-devang · 2025-07-14T07:12:49Z

Hi @jerryzh168,
I have added the testcase specific for the changes and to keep them separate i have added the file like : -ao/test/quantization/pt2e/test_arm_inductor_quantizer_per_tensor.py
can you please review this,
thankyou

fadara01 · 2025-07-20T10:33:44Z

Thanks for your PR!
Do we see any speedups (against fp32) for e.g. bert / resnet50 as a result of this lowering?
Do we need to do any work in pytorch - qconv and qlinear to support such lowerings?

choudhary-devang · 2025-07-21T05:39:41Z

Thanks for your PR! Do we see any speedups (against fp32) for e.g. bert / resnet50 as a result of this lowering? Do we need to do any work in pytorch - qconv and qlinear to support such lowerings?

Hi @fadara01, Thanks for the response.
I have updated the description to include some of the details, we don't need any changes in pytorch.
for my experimentation i have used pip install torch torchvision.

to recreate the experiment
Fp32 script

import torch
from transformers import BertModel

# model loading
model_name = "bert-base-uncased"
model = BertModel.from_pretrained(model_name)
# Create the data, using dummy data here as an example
traced_bs = 32
seq_length = 128
x = torch.randint(0, 10000, (traced_bs, seq_length))
attention_mask = torch.ones((traced_bs, seq_length))
example_inputs = (x, attention_mask)

# Inference 
with torch.no_grad():
    model = torch.compile(model)
    with torch.profiler.profile( record_shapes=True) as prof:
        for _ in range(200):
                model(x)
print(prof.key_averages(group_by_input_shape=True).table(sort_by="self_cpu_time_total"))

quant script

import torch
from transformers import BertModel
from torchao.quantization.pt2e.quantize_pt2e import prepare_pt2e, convert_pt2e
import torchao.quantization.pt2e.quantizer.arm_inductor_quantizer as aiq
from torchao.quantization.pt2e.quantizer.arm_inductor_quantizer import ArmInductorQuantizer
import torch._inductor.config as config
# Enable C++ wrapper for Inductor
config.cpp_wrapper = True
config.freezing=True

model_name = "bert-base-uncased"
model = BertModel.from_pretrained(model_name)

# Set the model to eval mode
model = model.eval()

# Create the data, using dummy data here as an example
traced_bs = 32
seq_length = 128
x = torch.randint(0, 10000, (traced_bs, seq_length))
attention_mask = torch.ones((traced_bs, seq_length))
example_inputs = (x, attention_mask)

# Capture the FX Graph to be quantized
with torch.no_grad():
    exported_model = torch.export.export_for_training(model, example_inputs).module()
    # Set up the quantizer and prepare the model for post-training quantization
    quantizer = ArmInductorQuantizer()
    quantizer.set_global(aiq.get_default_arm_inductor_quantization_config(is_dynamic=True, is_per_channel=True))
    prepared_model = prepare_pt2e(exported_model, quantizer)
    converted_model = convert_pt2e(prepared_model)
    converted_model = torch.compile(converted_model)
    with torch.profiler.profile( record_shapes=True) as prof:
        for _ in range(200):
            converted_model(*example_inputs)
print(prof.key_averages(group_by_input_shape=True).table(sort_by="self_cpu_time_total"))

current setup
**kernel **
onednn_verbose,v1,primitive,exec,cpu,matmul,lowp_gemm:acl,undef,src:s8:a:blocked:ab::f0 wei:s8::blocked:ab::f0 bia:f32:a:blocked:ab::f0_mask2 dst:f32:a:blocked:ab::f0,attr-scratchpad:user attr-scales:src0:0:f32+wei:0:f32 attr-zero-points:src0:0:s32,,50x512:512x1000,0.224854

fadara01 · 2025-07-21T10:33:26Z

Ahhh that's amazing! I remember doing a PoC for this exact thing back in the day and I had to tweak qlinear/qconv, hence my question.

choudhary-devang · 2025-07-22T05:31:41Z

Hi @jerryzh168, @fadara01, can you please approve and merge this change.
thankyou

choudhary-devang · 2025-07-25T05:29:40Z

@pytorchbot rebase

choudhary-devang · 2025-07-29T04:13:14Z

@pytorchbot rebase

choudhary-devang · 2025-07-31T05:18:14Z

Hi @jerryzh168, @fadara01, can you please approve and merge this change.
thankyou

choudhary-devang · 2025-09-04T05:23:01Z

Hi @jerryzh168. @fadara01, can you please approve and merge this change.
Thankyou.

jerryzh168 · 2025-09-26T17:30:17Z

torchao/quantization/pt2e/quantizer/arm_inductor_quantizer.py

    X86InductorQuantizer,
 )

+if TORCH_VERSION_AT_LEAST_2_7:


this is deprecated btw, please use

ao/torchao/quantization/pt2e/quantizer/x86_inductor_quantizer.py

Line 1639 in 7690612

if torch_version_at_least("2.8.0"):

jerryzh168 · 2025-09-26T17:30:38Z

torchao/quantization/pt2e/quantizer/arm_inductor_quantizer.py

 )

+if TORCH_VERSION_AT_LEAST_2_7:
+    torch._inductor.config.pre_grad_custom_pass = quant_lift_up


what happens when multiple backend set this one?

ao/torchao/quantization/pt2e/quantizer/x86_inductor_quantizer.py

Line 1640 in 7690612

torch._inductor.config.pre_grad_custom_pass = quant_lift_up

what happens when multiple backend set this one?

ao/torchao/quantization/pt2e/quantizer/x86_inductor_quantizer.py

Line 1640 in 7690612

torch._inductor.config.pre_grad_custom_pass = quant_lift_up

previously, the last writer won, so we introduced chain instead of overwriting, so multiple backend can safely coexist.

details:
previously :-

torch._inductor.config.pre_grad_custom_pass = quant_lift_up
A single global callable meant the last assignment silently overwrite any prior pass.

Change:-
Chain ARM’s pass after any existing pass, instead of overwriting it.
This guarantees both passes run in a deterministic order.

added an helper function to chain rather overwrite

def _chain_pregrad_pass(new_pass): prev = getattr(torch._inductor.config, "pre_grad_custom_pass", None) if prev is None or prev is new_pass: return new_pass def _chained(gm): # run previous first, then ours (conservative ordering) prev(gm) new_pass(gm) return _chained

replacing direct pass with chaining:

if torch_version_at_least("2.8.0"): torch._inductor.config.pre_grad_custom_pass = _chain_pregrad_pass(quant_lift_up)

Now both pass (prev -> Arm) will execute.

jerryzh168

sory missed this one

choudhary-devang · 2025-10-07T09:35:23Z

Hi @jerryzh168, can you check this once
#2391 (comment)
and if everything looks okay to you, please merge this change

jerryzh168 · 2025-10-08T19:30:20Z

torchao/quantization/pt2e/quantizer/arm_inductor_quantizer.py

+from torchao.quantization.pt2e.inductor_passes.arm import (
+    _register_quantization_weight_pack_pass,
+)
+from torchao.quantization.pt2e.inductor_passes.x86 import (


this seems to be introducing dependency between arm and x86, is it possible to remove?

if you are really reusing this, might be better to refactor this to a separate file and have both x86 and arm depend on it I think

I’ve removed the ARM→x86 import and refactored quant_lift_up into a shared file(utils.py) so both backends depend on a neutral module instead of each other.
path:
ao/torchao/quantization/pt2e/inductor_passes/utils.py

jerryzh168 · 2025-10-08T19:31:01Z

torchao/quantization/pt2e/quantizer/arm_inductor_quantizer.py

+    _register_quantization_weight_pack_pass,
+)
+from torchao.quantization.pt2e.inductor_passes.x86 import (
+    quant_lift_up,


I thought this is a prev_pass?

In the chaining helper,prev is the existing torch._inductor.config.pre_grad_custom_pass (if any). We now set:

torch._inductor.config.pre_grad_custom_pass = _chain_pregrad_pass(quant_lift_up)

which composes prev (if present) with quant_lift_up (new). If prev is quant_lift_up, we skip wrapping to avoid double running.

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 17, 2025

choudhary-devang force-pushed the Per_tensor_lowering branch from c698531 to 67d4a79 Compare June 26, 2025 05:02

jerryzh168 added the topic: improvement Use this tag if this PR is an improvement (doesn't fit into any of the other categories) label Jun 26, 2025

choudhary-devang force-pushed the Per_tensor_lowering branch from 67d4a79 to d863085 Compare July 14, 2025 07:07

choudhary-devang force-pushed the Per_tensor_lowering branch 2 times, most recently from 2caf61d to e51e9ec Compare July 20, 2025 07:42

choudhary-devang force-pushed the Per_tensor_lowering branch 2 times, most recently from b5a6358 to ab75a9b Compare July 31, 2025 05:09

choudhary-devang force-pushed the Per_tensor_lowering branch from ab75a9b to ad1ff8d Compare September 4, 2025 05:20

jerryzh168 reviewed Sep 26, 2025

View reviewed changes

jerryzh168 approved these changes Sep 26, 2025

View reviewed changes

choudhary-devang force-pushed the Per_tensor_lowering branch from ad1ff8d to c9417fa Compare October 3, 2025 12:14

jerryzh168 reviewed Oct 8, 2025

View reviewed changes

choudhary-devang force-pushed the Per_tensor_lowering branch from c9417fa to 9f01b51 Compare October 13, 2025 07:33

maajidkhann mentioned this pull request Feb 16, 2026

PT2E flow broken on CPU (ARM / Graviton3) with torch 2.9.1 + torchao 0.15.0 pytorch/pytorch#174457

Open

choudhary-devang force-pushed the Per_tensor_lowering branch from 6be5e87 to 9f01b51 Compare March 30, 2026 15:27

choudhary-devang added 3 commits March 30, 2026 20:59

Enables the per_tensor lowering patterns for weight per_packing

8318a51

added the testcases

796a67d

removed the import dependency by moving to shared file

bd15048

choudhary-devang force-pushed the Per_tensor_lowering branch from 9f01b51 to bd15048 Compare March 30, 2026 15:29

Conversation

choudhary-devang commented Jun 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jun 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2391

❌ 9 New Failures

Uh oh!

choudhary-devang commented Jun 26, 2025

Uh oh!

jerryzh168 commented Jun 26, 2025

Uh oh!

choudhary-devang commented Jul 14, 2025

Uh oh!

fadara01 commented Jul 20, 2025

Uh oh!

choudhary-devang commented Jul 21, 2025

Uh oh!

fadara01 commented Jul 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

choudhary-devang commented Jul 22, 2025

Uh oh!

choudhary-devang commented Jul 25, 2025

Uh oh!

choudhary-devang commented Jul 29, 2025

Uh oh!

choudhary-devang commented Jul 31, 2025

Uh oh!

choudhary-devang commented Sep 4, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jerryzh168 left a comment

Choose a reason for hiding this comment

Uh oh!

choudhary-devang commented Oct 7, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

choudhary-devang commented Jun 17, 2025 •

edited

Loading

pytorch-bot bot commented Jun 17, 2025 •

edited

Loading

fadara01 commented Jul 21, 2025 •

edited

Loading