Matmul kernel preference support for `Int8Tensor` by namgyu-youn · Pull Request #3558 · pytorch/ao

namgyu-youn · 2025-12-31T13:18:33Z

Summary:
Add kernel routing support (kernel_preference) for `Int8Tensor — "auto", "pytorch", and "triton"

Motivation:
torch._int_mm (INT8 MatMul kernel in pytorch internal) requires M (batch size) > 16, which failed CUDA graph capture like vLLM — https://gist.github.com/vkuzo/5bf389079442bb9851ef315cdcb797b4.

For better vLLM integration and performance, I would like to support alternative INT8 MatMul kernel support like the Triton-based scaled_int8_mm. This kernel was implemented by @gau-nernst

Example:

# Auto routing (dedault)
config = Int8DynamicActivationInt8WeightConfig(
    kernel_preference="auto"  # default, select based on setup

# Use ao kernel (inside pytorch)
config = Int8DynamicActivationInt8WeightConfig(
    kernel_preference="pytorch"  # Use ao kernel (inside pytorch)
)

# Use custom triton kernel
config = Int8DynamicActivationInt8WeightConfig(
    granularity=PerRow(),  # Note: support PerRow only
    kernel_preference="triton"  # Use Triton kernel
)

# Common quantization flow
quantize_(model, config)

Test plan:

pytest -sv test/quantization/quantize_/workflows/int8/test_int8_tensor.py

pytorch-bot · 2025-12-31T13:18:37Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/3558

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

namgyu-youn · 2025-12-31T13:19:14Z

@pytorchbot label "topic: new feature"

namgyu-youn · 2026-01-04T13:21:09Z

@pytorchbot label "topic: improvement"

vkuzo · 2026-01-06T11:30:17Z

check out https://github.com/pytorch/ao/blob/main/torchao/prototype/quantized_training/int8_mm.py, maybe we can improve the existing one instead?

namgyu-youn · 2026-01-06T13:42:27Z

check out https://github.com/pytorch/ao/blob/main/torchao/prototype/quantized_training/int8_mm.py, maybe we can improve the existing one instead?

Thanks I didn't know it before. Updated PR entirely ~~because improvement after promotion looks better.~~ to use that triton kernel, could you look into it?

namgyu-youn · 2026-01-18T10:12:43Z

@vkuzo could you please look again this updated PR? also cc @jerryzh168

vkuzo · 2026-01-20T14:30:07Z

torchao/quantization/quantize_/workflows/int8/int8_tensor.py

    Non-Tensor Attributes:
        granularity: the granularity for quantization (e.g., PerRow(), PerTensor())
        act_quant_kwargs: flags for dynamic activation quantization
+        mm_config: Matmul kernel to use - "pytorch" (default) or "triton"


it would be better to follow this design:

ao/torchao/quantization/quantize_/workflows/float8/float8_tensor.py

Line 98 in a5f2693

kernel_preference (KernelPreference): the preference for quantize, mm etc. kernel to use,

Actually, I didn't follow the Float8Tensor pattern. The differences can be summarized as:

Float8Tensor: (1) default is None, (2) uses mm_config and kernel_preference for kernel selection

Int8Tensor: (1) default is 'pytorch' (preserves existing behavior), (2) uses mm_config only

In my opinion, (1) having a default helps with simpler logic, and (2) only one config is needed for kernel selection. Could you please check again?

Personally, I don't understand why Float8Tensor needs 3 configs to route the kernel.

It seems to work like: 1) kernel_preference is defined by the user, 2) kernel_choice is defined at runtime, 3) mm_config is defined after passing runtime? Is it possible to make it simpler?

for what you are adding here, kernel_preference is the existing abstraction we have, so I'm recommending to use that to stay consistent across the codebase. mm_config shouldn't be related to this, and kernel_choice should not matter because it's not in the BC surface.

Understood, I will update to use https://github.com/pytorch/ao/blob/main/torchao/quantization/quantize_/common/kernel_preference.py then. Thanks for pointing it out.

vkuzo

let's be consistent with Float8Tensor for this logic

This reverts commit 4ff8625.

vkuzo · 2026-01-29T11:45:54Z

torchao/quantization/quantize_/workflows/int8/int8_tensor.py

+                tmp, w_vals_int8_t, x_scales.reshape(-1, 1).to(intermediate_dtype)
+            ).to(output_dtype)
+            y = y_dot_scaled * w_scales.flatten()
+        else:


AUTO should always be supported

vkuzo · 2026-01-29T11:47:07Z

torchao/quantization/quant_api.py

        set_inductor_config: bool = True - If True, adjusts `torchinductor` settings to recommended values
            for better performance with this quantization scheme.
        version (int): the version of the config, version 1 is using AffineQuantizedTensor that we plan to deprecate/split, version 2 is using Int8Tensor
+        kernel_preference (KernelPreference): Kernel preference for matmul operations. TORCH uses int_scaled_matmul,


put AUTO first

vkuzo · 2026-01-29T11:47:18Z

torchao/quantization/quant_api.py

    granularity: Granularity = PerRow()
    set_inductor_config: bool = True
    version: int = 1
+    kernel_preference: KernelPreference = KernelPreference.TORCH


default to AUTO

namgyu-youn · 2026-02-03T12:15:17Z

It seems that CI failure was caused by uninstalled Triton. Should I (1) add Triton to a dependency or (2) make unit test to skip if Triton is uninstalled?

namgyu-youn · 2026-02-03T15:29:14Z

Just updated unit test to skip if Triton is not installed. This would be safer for CI.

namgyu-youn · 2026-02-06T06:12:18Z

@vkuzo could you please look again this?

namgyu-youn · 2026-02-14T09:04:59Z

Updated to conditional import Triton torchao/kernel/__init__.py, CI should be green now I believe.

namgyu-youn · 2026-03-02T04:52:53Z

@vkuzo could you please look again this?

jerryzh168 · 2026-03-24T20:00:20Z

test/quantization/quantize_/workflows/int8/test_int8_tensor.py

+        # Verify correctness by comparing with reference
+        output_ref = torch.nn.functional.linear(input_tensor, weight_ref)
+        sqnr = compute_error(output_ref, output)
+        self.assertGreater(sqnr, 20, f"SQNR is too low: {sqnr} dB (expected > 20 dB)")


we should also check numerical consistency between kernel preferences I think, can you add a new test similar to

ao/test/quantization/quantize_/workflows/float8/test_float8_tensor.py

Line 763 in 436b2e9

def test_kernel_preference_numerical_equivalence(self, granularity, sizes):

sure, added test_kernel_preference_numerical_equivalence to check numerical consistency.

jerryzh168 · 2026-03-24T20:04:38Z

torchao/quantization/quantize_/workflows/int8/int8_tensor.py

+            kernel_choice = (
+                "triton" if tmp.device.type == "cuda" and is_rowwise else "torch"
+            )
+        elif weight_tensor.kernel_preference == KernelPreference.TRITON:


if triton only supports rowwise we should do a check here and error out when user is not using rowwise here I think

jerryzh168 · 2026-03-24T20:06:54Z

torchao/kernel/int8_mm.py

I'm not how this op has the same numerics as the other int8 mm ops actually, we should check

Do you mean "numerics", not hardware behavior? Can we check under test_kernel_preference_numerical_equivalence?

yes, we can check numerics with test_kernel_preference_numerical_equivalence

namgyu-youn · 2026-03-25T02:30:21Z

@pytorchbot label "module: inference"

namgyu-youn · 2026-03-30T16:09:52Z

@jerryzh168 could you please review this PR again?

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 31, 2025

namgyu-youn changed the title ~~feat: add INT8 scaled matmul Triton kernel~~ feat: INT8 scaled matmul Triton kernel Dec 31, 2025

pytorch-bot bot added the topic: new feature Use this tag if this PR adds a new feature label Dec 31, 2025

namgyu-youn changed the title ~~feat: INT8 scaled matmul Triton kernel~~ [Triton] INT8 scaled matmul kernel Jan 4, 2026

pytorch-bot bot added the topic: improvement Use this tag if this PR is an improvement (doesn't fit into any of the other categories) label Jan 4, 2026

namgyu-youn changed the title ~~[Triton] INT8 scaled matmul kernel~~ [Triton] Promote INT8 scaled matmul kernel Jan 7, 2026

namgyu-youn marked this pull request as draft January 18, 2026 07:33

namgyu-youn marked this pull request as ready for review January 18, 2026 09:10

namgyu-youn changed the title ~~[Triton] Promote INT8 scaled matmul kernel~~ Add support for custom matmul kernel routing in Int8Tensor Jan 18, 2026

vkuzo reviewed Jan 20, 2026

View reviewed changes

vkuzo requested changes Jan 20, 2026

View reviewed changes

namgyu-youn requested a review from vkuzo January 20, 2026 17:08

namgyu-youn added 11 commits January 22, 2026 13:59

feat: int8 matmul triton kernel

ac95f4e

revert

d0886fb

promote Triton kernel for int_mm

58d30f4

Revert "revert"

167d866

This reverts commit 4ff8625.

fix

d3ec88e

fix import

b1e5565

rename test file

66286ed

add flag for custom kernel

1b7c39a

fix

711272a

drop

e33381f

update

b02cf0e

namgyu-youn force-pushed the int8-triton branch from 4383ae7 to b02cf0e Compare January 22, 2026 05:14

vkuzo reviewed Jan 29, 2026

View reviewed changes

namgyu-youn added 3 commits January 30, 2026 13:09

add auto option for kernel preference

12d766a

Merge branch 'main' into int8-triton

210ba12

update

4802314

namgyu-youn requested a review from vkuzo January 30, 2026 04:21

namgyu-youn changed the title ~~Add support for custom matmul kernel routing in Int8Tensor~~ Matmul kernel routing support for Int8Tensor Jan 30, 2026

skip test if not triton

6e24bad

namgyu-youn added 2 commits February 11, 2026 18:02

Merge branch 'main' into int8-triton

c8514ab

fix triton import

61de87a

namgyu-youn added 2 commits March 2, 2026 13:51

Merge branch 'main' into int8-triton

7c7086e

fix pre-commit

af76726

namgyu-youn changed the title ~~Matmul kernel routing support for Int8Tensor~~ Matmul kernel preference support for Int8Tensor Mar 2, 2026

Merge branch 'main' into int8-triton

765a60d

namgyu-youn mentioned this pull request Mar 24, 2026

bug: int8 w8a8 doesn't work on 5090 #2376

Open

jerryzh168 reviewed Mar 24, 2026

View reviewed changes

add matmul similarity test

ae11ab2

namgyu-youn requested a review from jerryzh168 March 24, 2026 22:40

Merge branch 'main' into int8-triton

d7fa7b6

pytorch-bot bot added the module: inference quantize_ api inference flow label Mar 25, 2026

Conversation

namgyu-youn commented Dec 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Dec 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/3558

Uh oh!

namgyu-youn commented Dec 31, 2025

Uh oh!

namgyu-youn commented Jan 4, 2026

Uh oh!

vkuzo commented Jan 6, 2026

Uh oh!

namgyu-youn commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

namgyu-youn commented Jan 18, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

namgyu-youn Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vkuzo left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

namgyu-youn commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

namgyu-youn commented Feb 3, 2026

Uh oh!

namgyu-youn commented Feb 6, 2026

Uh oh!

namgyu-youn commented Feb 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

namgyu-youn commented Mar 2, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

namgyu-youn Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

namgyu-youn commented Mar 25, 2026

Uh oh!

namgyu-youn commented Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

namgyu-youn commented Dec 31, 2025 •

edited

Loading

pytorch-bot bot commented Dec 31, 2025 •

edited

Loading

namgyu-youn commented Jan 6, 2026 •

edited

Loading

namgyu-youn Jan 20, 2026 •

edited

Loading

namgyu-youn commented Feb 3, 2026 •

edited

Loading

namgyu-youn commented Feb 14, 2026 •

edited

Loading

namgyu-youn Mar 24, 2026 •

edited

Loading