Add Qwen3.5 support by kinfey · Pull Request #2025 · microsoft/onnxruntime-genai

kinfey · 2026-03-13T12:54:07Z

Add Qwen3.5 support

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Adds Qwen3.5 support to the Python export pipeline and the C++ runtime by introducing Qwen3.5-specific builder logic (including linear attention + auxiliary state handling) and extending config/runtime plumbing to recognize new model types and state tensors.

Changes:

Add Qwen35Model builder with Qwen3.5 hybrid full/linear attention support and related config normalization.
Extend Python builder + generated genai_config.json to include auxiliary decoder state I/O templates and additional decoder metadata.
Extend C++ runtime config/model typing, image processor parameters, and KV cache to support auxiliary decoder state caches.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
`src/python/py/models/builders/qwen.py`	Introduces `Qwen35Model` with Qwen3.5-specific RoPE/RMSNorm behavior, attention gating, and linear-attention export graph.
`src/python/py/models/builders/base.py`	Adds robust config field access, tokenizer loading fallback, special token ID resolution, RoPE interleaving handling, and auxiliary decoder state wiring into inputs/outputs + config.
`src/python/py/models/builders/__init__.py`	Exports `Qwen35Model`.
`src/python/py/models/builder.py`	Wires HF architecture `Qwen3_5ForConditionalGeneration` to `Qwen35Model` and sets model_type.
`src/models/qwen2_5_vl_image_processor.h`	Stores patch/temporal patch sizes in `QwenImageProcessor`.
`src/models/qwen2_5_vl_image_processor.cpp`	Uses configurable patch sizes instead of hard-coded constants.
`src/models/model_type.h`	Adds `qwen3_5_text` (LLM) and `qwen3_5` (VLM) model-type recognition.
`src/models/model.cpp`	Enables `qwen3_5` pipeline/VLM processor registration.
`src/models/kv_cache.h`	Adds `AuxiliaryStateSet` and auxiliary state cache tracking in `DefaultKeyValueCache`.
`src/models/kv_cache.cpp`	Implements auxiliary decoder state cache allocation/update/beam-picking and introduces runtime restrictions (no dynamic batching/sliding window/NvTensorRtRtx).
`src/config.h`	Adds vision patch sizes and decoder fields for rotary dim + linear-attention auxiliary state templates and dims.
`src/config.cpp`	Parses new vision/decoder fields and decoder input/output template strings.
`build.py`	Adds ORT include/lib path resolution for building examples from `--ort_home` or downloaded dependencies.

You can also share your feedback on Copilot code review. Take the survey.

build.py

+    if util.is_windows():
+        library_names = ["onnxruntime.lib", "onnxruntime.dll"]
+    elif util.is_mac():
+        library_names = ["libonnxruntime.dylib"]
+    elif util.is_aix():
+        library_names = ["libonnxruntime.a"]
+    else:
+        library_names = ["libonnxruntime.so"]
+
+    lib_candidates = [ort_home / "lib", ort_home]
+    lib_candidates.extend(sorted(ort_home.glob("runtimes/*/native")))
+    lib_candidates.extend(sorted(ort_home.glob("jni/*")))
+
+    lib_dir = next(
+        (
+            candidate
+            for candidate in lib_candidates
+            if candidate.is_dir() and any((candidate / library_name).exists() for library_name in library_names)
+        ),


src/models/kv_cache.cpp

+}
+
+}  // namespace
+
+template <typename ScoreType>
+void DefaultKeyValueCache::PickPastAuxiliaryState(DeviceSpan<int32_t> beam_indices_device, AuxiliaryStateSet& state_set, int index) {


src/models/kv_cache.cpp

  std::unique_ptr<OrtValue> past_value = OrtValue::CreateTensor<ScoreType>(Allocator(), tensor_shape);

  auto past_span = WrapTensor<ScoreType>(Device(), *past_value);
  auto present_span = WrapTensor<ScoreType>(Device(), present_value);

  for (size_t j = 0; j < beam_indices.size(); j++) {
    int32_t beam_index = beam_indices[j];
    auto present = present_span.subspan(beam_index * block_size_per_beam, block_size_per_beam);
    auto past = past_span.subspan(j * block_size_per_beam, block_size_per_beam);
    past.CopyFrom(present);
  }

  pasts_[index] = std::move(past_value);
 }

 void DefaultKeyValueCache::PickPastState(DeviceSpan<int32_t> beam_indices, int index) {
  if (type_ == Ort::TypeToTensorType<float>) {
    PickPastState<float>(beam_indices, index);
  } else {
    PickPastState<Ort::Float16_t>(beam_indices, index);
  }
 }

+namespace {
+
+int64_t GetElementsPerBeam(const AuxiliaryStateSet& state_set) {
+  static std::mutex mutex;
+  static std::unordered_map<const AuxiliaryStateSet*, int64_t> cache;
+
+  const auto* key = &state_set;


src/python/py/models/builders/base.py

+            return None
+
+        bos_token_id = resolve_special_token_id("bos_token_id")
+        if bos_token_id is None:


src/models/model_type.h

+  inline static bool IsQwen25VL(const std::string& model_type) {
+    // Qwen VL specific check for 3D position IDs (MRoPE)
+    return model_type == "fara" || model_type == "qwen2_5_vl" || model_type == "qwen3_5";
  }



src/python/py/models/builders/qwen.py

+        )
+        return f"{zero_name}/output_0"
+
+    def make_attention_input_proj(self, layer_id, attention, root_input, **kwargs):


src/python/py/models/builders/qwen.py

+            layer_id, attention.v_proj, "v_proj", root_input, kv_shape
+        )
+
+    def make_attention_output_proj(self, layer_id, attention, root_input, **kwargs):


kinfey and others added 30 commits March 7, 2026 12:33

Support Qwen35-4B

a85057c

Support Qwen-3.5B

a205140

Update

b22967e

Update

71a30d2

update

46eb8e3

fixed

3a530f1

fix

f80183f

fixed

b697747

update

34190e5

Merge branch 'microsoft:main' into main

a4877e1

update support qwen35

99a685b

update

9ac0fdb

update

39d1aec

update

c752f9d

update

3c5e476

update

2da32ef

update

df89f76

update

5e29ab7

main

21ce826

update

ff075f0

update

6a544a8

Update

e11a060

Update

d623543

Update

a690df5

Update

1677b0f

update

25b8388

Update

2c08c5c

Update

aa62f0c

Update

b261312

update

c106bba

kinfey and others added 3 commits March 13, 2026 19:46

Potential fix for pull request finding

07dce9c

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

update

4a42c52

Merge remote-tracking branch 'refs/remotes/origin/main'

4a71447

Copilot AI review requested due to automatic review settings March 13, 2026 12:54

Copilot started reviewing on behalf of kinfey March 13, 2026 12:54 View session

Merge branch 'main' into main

d3b4849

Copilot AI reviewed Mar 13, 2026

View reviewed changes

github-advanced-security bot found potential problems Mar 13, 2026

View reviewed changes

kinfey and others added 11 commits March 13, 2026 21:04

update

5726dd5

Update

fd2f2ff

Update

5910713

update

e510f4c

update

bac0faa

update

a2fa7f1

update

754582b

update

cf50e2b

Update

2a4d3bb

Merge branch 'microsoft:main' into main

8c77d80

Merge branch 'microsoft:main' into main

570b32b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Qwen3.5 support#2025

Add Qwen3.5 support#2025
kinfey wants to merge 45 commits intomicrosoft:mainfrom
kinfey:main

kinfey commented Mar 13, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Check notice

Check notice

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kinfey commented Mar 13, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Check notice

Check notice

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants