Matmul kernel preference support for Int8Tensor#3558
Matmul kernel preference support for Int8Tensor#3558namgyu-youn wants to merge 22 commits intopytorch:mainfrom
Int8Tensor#3558Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/3558
Note: Links to docs will display an error until the docs builds have been completed. This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
@pytorchbot label "topic: new feature" |
|
@pytorchbot label "topic: improvement" |
|
check out https://github.com/pytorch/ao/blob/main/torchao/prototype/quantized_training/int8_mm.py, maybe we can improve the existing one instead? |
Thanks I didn't know it before. Updated PR entirely |
Int8Tensor
|
@vkuzo could you please look again this updated PR? also cc @jerryzh168 |
| Non-Tensor Attributes: | ||
| granularity: the granularity for quantization (e.g., PerRow(), PerTensor()) | ||
| act_quant_kwargs: flags for dynamic activation quantization | ||
| mm_config: Matmul kernel to use - "pytorch" (default) or "triton" |
There was a problem hiding this comment.
it would be better to follow this design:
There was a problem hiding this comment.
Actually, I didn't follow the Float8Tensor pattern. The differences can be summarized as:
Float8Tensor: (1) default is None, (2) usesmm_configandkernel_preferencefor kernel selectionInt8Tensor: (1) default is 'pytorch' (preserves existing behavior), (2) usesmm_configonly
In my opinion, (1) having a default helps with simpler logic, and (2) only one config is needed for kernel selection. Could you please check again?
There was a problem hiding this comment.
Personally, I don't understand why Float8Tensor needs 3 configs to route the kernel.
It seems to work like: 1) kernel_preference is defined by the user, 2) kernel_choice is defined at runtime, 3) mm_config is defined after passing runtime? Is it possible to make it simpler?
There was a problem hiding this comment.
for what you are adding here, kernel_preference is the existing abstraction we have, so I'm recommending to use that to stay consistent across the codebase. mm_config shouldn't be related to this, and kernel_choice should not matter because it's not in the BC surface.
There was a problem hiding this comment.
Understood, I will update to use https://github.com/pytorch/ao/blob/main/torchao/quantization/quantize_/common/kernel_preference.py then. Thanks for pointing it out.
vkuzo
left a comment
There was a problem hiding this comment.
let's be consistent with Float8Tensor for this logic
4383ae7 to
b02cf0e
Compare
| tmp, w_vals_int8_t, x_scales.reshape(-1, 1).to(intermediate_dtype) | ||
| ).to(output_dtype) | ||
| y = y_dot_scaled * w_scales.flatten() | ||
| else: |
There was a problem hiding this comment.
AUTO should always be supported
torchao/quantization/quant_api.py
Outdated
| set_inductor_config: bool = True - If True, adjusts `torchinductor` settings to recommended values | ||
| for better performance with this quantization scheme. | ||
| version (int): the version of the config, version 1 is using AffineQuantizedTensor that we plan to deprecate/split, version 2 is using Int8Tensor | ||
| kernel_preference (KernelPreference): Kernel preference for matmul operations. TORCH uses int_scaled_matmul, |
torchao/quantization/quant_api.py
Outdated
| granularity: Granularity = PerRow() | ||
| set_inductor_config: bool = True | ||
| version: int = 1 | ||
| kernel_preference: KernelPreference = KernelPreference.TORCH |
Int8TensorInt8Tensor
|
It seems that CI failure was caused by uninstalled Triton. Should I (1) add Triton to a dependency or (2) make unit test to skip if Triton is uninstalled? |
|
Just updated unit test to skip if Triton is not installed. This would be safer for CI. |
|
@vkuzo could you please look again this? |
|
Updated to conditional import Triton |
|
@vkuzo could you please look again this? |
Int8TensorInt8Tensor
| # Verify correctness by comparing with reference | ||
| output_ref = torch.nn.functional.linear(input_tensor, weight_ref) | ||
| sqnr = compute_error(output_ref, output) | ||
| self.assertGreater(sqnr, 20, f"SQNR is too low: {sqnr} dB (expected > 20 dB)") |
There was a problem hiding this comment.
we should also check numerical consistency between kernel preferences I think, can you add a new test similar to
There was a problem hiding this comment.
sure, added test_kernel_preference_numerical_equivalence to check numerical consistency.
| kernel_choice = ( | ||
| "triton" if tmp.device.type == "cuda" and is_rowwise else "torch" | ||
| ) | ||
| elif weight_tensor.kernel_preference == KernelPreference.TRITON: |
There was a problem hiding this comment.
if triton only supports rowwise we should do a check here and error out when user is not using rowwise here I think
There was a problem hiding this comment.
I'm not how this op has the same numerics as the other int8 mm ops actually, we should check
There was a problem hiding this comment.
Do you mean "numerics", not hardware behavior? Can we check under test_kernel_preference_numerical_equivalence?
There was a problem hiding this comment.
yes, we can check numerics with test_kernel_preference_numerical_equivalence
|
@pytorchbot label "module: inference" |
|
@jerryzh168 could you please review this PR again? |
Summary:
Add kernel routing support (
kernel_preference) for `Int8Tensor — "auto", "pytorch", and "triton"Motivation:
torch._int_mm(INT8 MatMul kernel in pytorch internal) requires M (batch size) > 16, which failed CUDA graph capture like vLLM — https://gist.github.com/vkuzo/5bf389079442bb9851ef315cdcb797b4.For better vLLM integration and performance, I would like to support alternative INT8 MatMul kernel support like the Triton-based
scaled_int8_mm. This kernel was implemented by @gau-nernstExample:
Test plan: