[X86] intmm: Use u8s8 when only support avx512-vnni by cyxlily · Pull Request #4103 · pytorch/ao

cyxlily · 2026-03-18T01:50:03Z

Use u8s8 matmul for intmm on X86 CPU when only support avx512-vnni for better performance.

Convert from int8 to uint8 and compute with u8s8 when platforms only support avx512-vnni. Signed-off-by: Cui, Lily <lily.cui@intel.com>

Signed-off-by: Cui, Lily <lily.cui@intel.com>

pytorch-bot · 2026-03-18T01:50:07Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/4103

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (8 Unrelated Failures)

As of commit 78170f6 with merge base 1087d59 ():

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

Run Regression Tests / test (CPU 2.10, linux.4xlarge, torch==2.10.0 --index-url https://download.pytorch.org/whl/cpu, cpu) / linux-job (gh) (trunk failure)
test/quantization/pt2e/test_x86inductor_quantizer.py::TestQuantizePT2EX86Inductor::test_set_module_name_with_mixed_configs
Run Regression Tests / test (CPU 2.8, linux.4xlarge, torch==2.8.0 --index-url https://download.pytorch.org/whl/cpu, cpu) / linux-job (gh) (trunk failure)
test/quantization/pt2e/test_x86inductor_quantizer.py::TestQuantizePT2EX86Inductor::test_set_module_name_with_mixed_configs
Run Regression Tests / test (CPU 2.9, linux.4xlarge, torch==2.9.1 --index-url https://download.pytorch.org/whl/cpu, cpu) / linux-job (gh) (trunk failure)
test/quantization/pt2e/test_x86inductor_quantizer.py::TestQuantizePT2EX86Inductor::test_set_module_name_with_mixed_configs
Run Regression Tests / test (CUDA 2.10, linux.g5.12xlarge.nvidia.gpu, torch==2.10.0, cuda, 12.6) / linux-job (gh) (trunk failure)
test/quantization/pt2e/test_x86inductor_quantizer.py::TestQuantizePT2EX86Inductor::test_set_module_name_with_mixed_configs
Run Regression Tests / test (CUDA 2.8, linux.g5.12xlarge.nvidia.gpu, torch==2.8.0, cuda, 12.6) / linux-job (gh) (trunk failure)
test/quantization/pt2e/test_x86inductor_quantizer.py::TestQuantizePT2EX86Inductor::test_set_module_name_with_mixed_configs
Run Regression Tests / test (CUDA 2.9, linux.g5.12xlarge.nvidia.gpu, torch==2.9.1, cuda, 12.6) / linux-job (gh) (trunk failure)
test/quantization/pt2e/test_x86inductor_quantizer.py::TestQuantizePT2EX86Inductor::test_set_module_name_with_mixed_configs
Run Regression Tests / test-nightly (CPU Nightly, linux.4xlarge, --pre torch --index-url https://download.pytorch.org/wh... / linux-job (gh) (trunk failure)
test/quantization/pt2e/test_x86inductor_quantizer.py::TestQuantizePT2EX86Inductor::test_set_module_name_with_mixed_configs
Run Regression Tests / test-nightly (CUDA Nightly, linux.g5.12xlarge.nvidia.gpu, --pre torch --index-url https://downloa... / linux-job (gh) (trunk failure)
test/quantization/pt2e/test_x86inductor_quantizer.py::TestQuantizePT2EX86Inductor::test_set_module_name_with_mixed_configs

This comment was automatically generated by Dr. CI and updates every 15 minutes.

torchao/kernel/intmm.py

Signed-off-by: Cui, Lily <lily.cui@intel.com>

For platforms with AVX512_VNNI support but without AMX, we convert to u8s8 to use AVX512_VNNI instructions for better performance. For other platforms, s8s8 computation is done with AMX or reference implementation. Signed-off-by: Cui, Lily <lily.cui@intel.com>

Copilot

Pull request overview

Adds a CPU-specific implementation for int_scaled_matmul that switches to a u8*s8 (uint8 x int8) compute path when AVX512-VNNI is available but AMX tile is not, aiming to improve performance on those CPUs.

Changes:

Introduce _int_scaled_matmul_cpu helper that conditionally uses a u8*s8 path with compensation for a’s zero-point shift.
Route the CPU code path in int_scaled_matmul through the new helper (and avoid expanding scales1 on CPU).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

torchao/kernel/intmm.py

Xia-Weiwen · 2026-03-20T02:23:15Z

@claude review

Signed-off-by: Cui, Lily <lily.cui@intel.com>

cyxlily · 2026-03-20T08:06:13Z

@claude I have updated the codes. Review again.

Xia-Weiwen · 2026-03-20T08:19:13Z

@claude codes are updated per your (and copilot's) comments. Review again.

Signed-off-by: Cui, Lily <lily.cui@intel.com>

Don't check zero point when SYMMETRIC because zero point may not be None, but rather a tensor with values of 0. Signed-off-by: Cui, Lily <lily.cui@intel.com>

Move to another PR. Signed-off-by: Cui, Lily <lily.cui@intel.com>

jerryzh168 · 2026-03-25T01:27:59Z

torchao/kernel/intmm.py

    assert 1 == scales1.size(1)
    assert scales1.is_contiguous()
-    scales1 = scales1.expand((M, N))
    assert scales1.dim() == 2


should the assert be moved as well, previusly this is after expand

The CPU also needs to check that scales1.dim() == 2, otherwise calculation with 3D scales1 will also be incorrect. And if scales1 is 1D, 'assert 1 == scales1.size(1)' will cause an error, so I think it's better to move the check for scales1.dim() == 2 at the beginning.

jerryzh168 · 2026-03-25T01:31:43Z

test/kernel/test_autotuner.py

+    Tests for the CPU-specific paths inside _int_scaled_matmul_cpu.
+    Because the u8s8 VNNI branch is gated on runtime CPU feature detection,
+    CI machines are unlikely to exercise it naturally.  We monkeypatch the
+    two helper functions so each branch can be tested on any machine.


this seems a bit confusing, you mean you have to monkeypatch the _cpu_is_amx_tile_supported and _cpu_is_vnni_supported to run the test, I thought these have to reflect what the hardware is doing? what are the flags of the CI machines before monkeypatch? and did you only change the flag from True to False to test reference?

Thanks for reviewing. This is to ensure both paths are tested since the hardware for CI most likely does not have these ISA support. These paths work on all platforms while performance is the best with certain ISA support.

so if hardware doesn't have instruction support:
_cpu_is_amx_tile_supported = False and _cpu_is_vnni_supported = False

and you set one of these to True, e.g. _cpu_is_amx_tile_supported = True, what happens?

can you provide more details on these in each of the test?

Setting these flags in different ways will lead to s8s8 or u8s8 paths (according to the rules we defined in torchao/kernel/intmm.py). Both paths work on all platforms because torch._int_mm for CPU calls oneDNN under the hood and oneDNN takes care of everything. This is for test only to ensure both paths work as expected in terms of functionality.

Otherwise only one path will be tested in CI. The purpose is to test both paths in CI to ensure they work.

Signed-off-by: Cui, Lily <lily.cui@intel.com>

cyxlily · 2026-03-25T05:19:32Z

@jerryzh168 Could you review again?

jerryzh168 · 2026-03-27T01:32:19Z

torchao/kernel/intmm.py

-    scales1 = scales1.expand((M, N))
-    assert scales1.dim() == 2

    if check_cpu_version(scales1.device):


check_cpu_version seems too vague I feel, maybe just something like is_device_type(scales1.device, "cpu")

check_cpu_version seems too vague I feel, maybe just something like is_device_type(scales1.device, "cpu")

Thanks for the suggestion. However, this utility is not added by this PR. It is defined here:

ao/torchao/utils.py

Line 1242 in 9051a2f

def check_cpu_version(device, version="2.6.0"):

How about fixing it in another PR?

yeah it should be fixed in a separate PR

jerryzh168 · 2026-03-27T01:33:26Z

torchao/kernel/intmm.py

    """
    M, K = a.shape
    K, N = b.shape
+    assert scales1.dim() == 2


are you sure to check this before expand? does orignal op work with 1d scale

are you sure to check this before expand? does orignal op work with 1d scale

It does not work with 1d scale. the assert 1 == scales1.size(1) below is not added by us and it assumes scales1 is 2d but previously it was only checked after that. So, we think it is better to check that at the beginning.

torchao/kernel/intmm.py

Signed-off-by: Cui, Lily <lily.cui@intel.com>

Xia-Weiwen · 2026-03-30T06:11:51Z

CI failures are unrelated.

cyxlily added 2 commits March 17, 2026 18:44

Use uint8 when only support avx512-vnni

956bde8

Convert from int8 to uint8 and compute with u8s8 when platforms only support avx512-vnni. Signed-off-by: Cui, Lily <lily.cui@intel.com>

Optimize zero point term compensation

2c7682e

Signed-off-by: Cui, Lily <lily.cui@intel.com>

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 18, 2026

Xia-Weiwen added the module: not user facing Use this tag if you don't want this PR to show up in release notes label Mar 18, 2026

Xia-Weiwen requested changes Mar 18, 2026

View reviewed changes

torchao/kernel/intmm.py Outdated Show resolved Hide resolved

torchao/kernel/intmm.py Outdated Show resolved Hide resolved

cyxlily added 2 commits March 17, 2026 23:51

Remove cpu expand

a875414

Signed-off-by: Cui, Lily <lily.cui@intel.com>

Xia-Weiwen approved these changes Mar 18, 2026

View reviewed changes

Xia-Weiwen mentioned this pull request Mar 18, 2026

[X86] Add convert_element_type to smooth quant pattern #3784

Open

Merge branch 'pytorch:main' into u8s8

889dd92

Xia-Weiwen requested review from Copilot and jerryzh168 March 20, 2026 02:00

Copilot started reviewing on behalf of Xia-Weiwen March 20, 2026 02:00 View session

Copilot AI reviewed Mar 20, 2026

View reviewed changes

torchao/kernel/intmm.py Outdated Show resolved Hide resolved

torchao/kernel/intmm.py Show resolved Hide resolved

torchao/kernel/intmm.py Outdated Show resolved Hide resolved

This comment was marked as resolved.

Sign in to view

cyxlily added 2 commits March 20, 2026 00:20

Add monkeypatch unit test

9709e79

Signed-off-by: Cui, Lily <lily.cui@intel.com>

Fix unit test

31f41d1

Signed-off-by: Cui, Lily <lily.cui@intel.com>

This comment was marked as resolved.

Sign in to view

Xia-Weiwen changed the title ~~Use u8s8 when only support avx512-vnni~~ [X86] intmm: Use u8s8 when only support avx512-vnni Mar 20, 2026

cyxlily added 3 commits March 22, 2026 23:52

Add torch check

50c30ce

Signed-off-by: Cui, Lily <lily.cui@intel.com>

Only check zero point when ASYMMETRIC

d185af5

Don't check zero point when SYMMETRIC because zero point may not be None, but rather a tensor with values of 0. Signed-off-by: Cui, Lily <lily.cui@intel.com>

Revert zero_point check

744ed09

Move to another PR. Signed-off-by: Cui, Lily <lily.cui@intel.com>

jerryzh168 reviewed Mar 25, 2026

View reviewed changes

Check scales1.dim at the begining

8219fe6

Signed-off-by: Cui, Lily <lily.cui@intel.com>

cyxlily requested a review from jerryzh168 March 25, 2026 05:17

jerryzh168 reviewed Mar 27, 2026

View reviewed changes

Xia-Weiwen reviewed Mar 27, 2026

View reviewed changes

torchao/kernel/intmm.py Show resolved Hide resolved

cyxlily added 2 commits March 26, 2026 19:05

Add comments about CPU path selection

6462a41

Signed-off-by: Cui, Lily <lily.cui@intel.com>

Add comments

aa61c38

Signed-off-by: Cui, Lily <lily.cui@intel.com>

jerryzh168 approved these changes Mar 27, 2026

View reviewed changes

Merge remote-tracking branch 'upstream/main' into u8s8

78170f6

Xia-Weiwen merged commit d5814ae into pytorch:main Mar 30, 2026
11 of 19 checks passed

cyxlily deleted the u8s8 branch March 30, 2026 06:14

Conversation

cyxlily commented Mar 18, 2026 • edited by Xia-Weiwen Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/4103

✅ You can merge normally! (8 Unrelated Failures)

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Xia-Weiwen commented Mar 20, 2026

Uh oh!

This comment was marked as resolved.

cyxlily commented Mar 20, 2026

Uh oh!

Xia-Weiwen commented Mar 20, 2026

Uh oh!

This comment was marked as resolved.

jerryzh168 Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cyxlily commented Mar 25, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Xia-Weiwen Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Xia-Weiwen commented Mar 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

cyxlily commented Mar 18, 2026 •

edited by Xia-Weiwen

Loading

pytorch-bot bot commented Mar 18, 2026 •

edited

Loading

jerryzh168 Mar 25, 2026 •

edited

Loading

Xia-Weiwen Mar 27, 2026 •

edited

Loading