Skip to content

Add mixed precision support for LoRA expand & shrink kernels#230

Open
chaojun-zhang wants to merge 1 commit intovllm-project:mainfrom
chaojun-zhang:lora_ops_opt
Open

Add mixed precision support for LoRA expand & shrink kernels#230
chaojun-zhang wants to merge 1 commit intovllm-project:mainfrom
chaojun-zhang:lora_ops_opt

Conversation

@chaojun-zhang
Copy link
Copy Markdown
Contributor

@chaojun-zhang chaojun-zhang commented Mar 27, 2026

… and inputs

Essential Elements of an Effective PR Description Checklist

  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS ABOVE HAVE BEEN CONSIDERED.

Purpose

Add mixed precision support for LoRA expand & shrink kernels (used in vllm ) with float32 inputs and float16 weights.

Test Plan

pytest -s -v tests/test_lora_ops::test_kernels_mixed_precision
vllm side: pytest -s -v tests/lora/test_layers.py

Test Result

(Optional) Documentation Update

BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing (anything written below this line will be removed by GitHub Actions)

Signed-off-by: chaojun-zhang <chaojun.zhang@intel.com>
Copilot AI review requested due to automatic review settings March 27, 2026 07:54
@chaojun-zhang chaojun-zhang changed the title Support mixed precision for LoRA kernels Add mixed precision support for LoRA expand & shrink kernels Mar 27, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds mixed-precision support to the XPU LoRA BGMV kernels so they can accept float32 inputs with fp16/bf16 LoRA weights, and introduces a new test variant to exercise this path.

Changes:

  • Extend bgmv_shrink / bgmv_expand_slice kernels to dispatch input and weight dtypes independently (allowing float32 inputs with fp16/bf16 weights).
  • Update test data generation utilities to allow separate input_dtype vs weight_dtype.
  • Add a new test_kernels_mixed_precision test matrix and mini-profiler params.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.

File Description
tests/test_lora_ops.py Adds mixed-precision test coverage and threads input_dtype through kernel checks.
tests/lora/utils.py Allows generating inputs with a different dtype than the LoRA weights.
csrc/xpu/lora/lora_shrink.cpp Enables mixed dtype dispatch for shrink (independent input/weight types) and updates validation.
csrc/xpu/lora/lora_expand.cpp Enables mixed dtype dispatch for expand-slice and updates validation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +528 to +531
"""
Tests LoRA kernels with mixed precision:
input=float32, weight=float16/bfloat16.
"""
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR’s description template is still unfilled (missing Purpose and Test Result details). Please update the PR description so reviewers can understand intent and validate changes more quickly.

Copilot uses AI. Check for mistakes.
Comment on lines +228 to +229
// Use the minimum vec_size so both types can be vectorized
uint32_t vec_size = std::min(input_vec_size, weight_vec_size);
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file now uses std::min to compute vec_size, but it does not include <algorithm>. Please add the direct header include to avoid relying on transitive includes (which can break builds under different toolchains).

Copilot uses AI. Check for mistakes.
Comment on lines +391 to +393
using weight_t = std::remove_const_t<
std::remove_pointer_t<decltype(weight_ptr)>>;
VLLM_DISPATCH_FLOATING_TYPES(
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new dispatch logic uses std::remove_const_t / std::remove_pointer_t but this file does not include <type_traits>. Add the explicit include to avoid build fragility from transitive includes.

Copilot uses AI. Check for mistakes.
Comment on lines +282 to +283
// Use the minimum vec_size so both types can be vectorized
uint32_t vec_size = std::min(input_vec_size, weight_vec_size);
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file now uses std::min to compute vec_size, but it does not include <algorithm>. Please add the direct header include to avoid relying on transitive includes (which can break builds under different toolchains).

Copilot uses AI. Check for mistakes.
Comment on lines +456 to +459
auto dispatch_output = [&](auto* weight_ptr) {
using weight_t = std::remove_const_t<
std::remove_pointer_t<decltype(weight_ptr)>>;
switch (outputs.scalar_type()) {
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new dispatch logic uses std::remove_const_t / std::remove_pointer_t but this file does not include <type_traits>. Add the explicit include to avoid build fragility from transitive includes.

Copilot uses AI. Check for mistakes.
Comment on lines +528 to +531
"""
Tests LoRA kernels with mixed precision:
input=float32, weight=float16/bfloat16.
"""
Copy link

Copilot AI Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new mixed-precision test sets input_dtype=torch.float32, but the Torch reference path (tests/lora/torch_ops.py) casts inputs to output_tensor.dtype (fp16/bf16) before computing. That means this test may not actually validate the float32-input execution path (it may just compare against a fp16/bf16 reference). Consider updating the reference calculation (or adding a special-case in this test) so the reference keeps float32 inputs and only casts the final result to the output dtype as needed.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants