Skip to content

Add Qwen3.5 support#2025

Open
kinfey wants to merge 45 commits intomicrosoft:mainfrom
kinfey:main
Open

Add Qwen3.5 support#2025
kinfey wants to merge 45 commits intomicrosoft:mainfrom
kinfey:main

Conversation

@kinfey
Copy link
Copy Markdown
Contributor

@kinfey kinfey commented Mar 13, 2026

Add Qwen3.5 support

kinfey and others added 3 commits March 13, 2026 19:46
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings March 13, 2026 12:54
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds Qwen3.5 support to the Python export pipeline and the C++ runtime by introducing Qwen3.5-specific builder logic (including linear attention + auxiliary state handling) and extending config/runtime plumbing to recognize new model types and state tensors.

Changes:

  • Add Qwen35Model builder with Qwen3.5 hybrid full/linear attention support and related config normalization.
  • Extend Python builder + generated genai_config.json to include auxiliary decoder state I/O templates and additional decoder metadata.
  • Extend C++ runtime config/model typing, image processor parameters, and KV cache to support auxiliary decoder state caches.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
src/python/py/models/builders/qwen.py Introduces Qwen35Model with Qwen3.5-specific RoPE/RMSNorm behavior, attention gating, and linear-attention export graph.
src/python/py/models/builders/base.py Adds robust config field access, tokenizer loading fallback, special token ID resolution, RoPE interleaving handling, and auxiliary decoder state wiring into inputs/outputs + config.
src/python/py/models/builders/__init__.py Exports Qwen35Model.
src/python/py/models/builder.py Wires HF architecture Qwen3_5ForConditionalGeneration to Qwen35Model and sets model_type.
src/models/qwen2_5_vl_image_processor.h Stores patch/temporal patch sizes in QwenImageProcessor.
src/models/qwen2_5_vl_image_processor.cpp Uses configurable patch sizes instead of hard-coded constants.
src/models/model_type.h Adds qwen3_5_text (LLM) and qwen3_5 (VLM) model-type recognition.
src/models/model.cpp Enables qwen3_5 pipeline/VLM processor registration.
src/models/kv_cache.h Adds AuxiliaryStateSet and auxiliary state cache tracking in DefaultKeyValueCache.
src/models/kv_cache.cpp Implements auxiliary decoder state cache allocation/update/beam-picking and introduces runtime restrictions (no dynamic batching/sliding window/NvTensorRtRtx).
src/config.h Adds vision patch sizes and decoder fields for rotary dim + linear-attention auxiliary state templates and dims.
src/config.cpp Parses new vision/decoder fields and decoder input/output template strings.
build.py Adds ORT include/lib path resolution for building examples from --ort_home or downloaded dependencies.

You can also share your feedback on Copilot code review. Take the survey.

Comment on lines +55 to +73
if util.is_windows():
library_names = ["onnxruntime.lib", "onnxruntime.dll"]
elif util.is_mac():
library_names = ["libonnxruntime.dylib"]
elif util.is_aix():
library_names = ["libonnxruntime.a"]
else:
library_names = ["libonnxruntime.so"]

lib_candidates = [ort_home / "lib", ort_home]
lib_candidates.extend(sorted(ort_home.glob("runtimes/*/native")))
lib_candidates.extend(sorted(ort_home.glob("jni/*")))

lib_dir = next(
(
candidate
for candidate in lib_candidates
if candidate.is_dir() and any((candidate / library_name).exists() for library_name in library_names)
),
Comment on lines +597 to +602
}

} // namespace

template <typename ScoreType>
void DefaultKeyValueCache::PickPastAuxiliaryState(DeviceSpan<int32_t> beam_indices_device, AuxiliaryStateSet& state_set, int index) {
Comment on lines 542 to +571
std::unique_ptr<OrtValue> past_value = OrtValue::CreateTensor<ScoreType>(Allocator(), tensor_shape);

auto past_span = WrapTensor<ScoreType>(Device(), *past_value);
auto present_span = WrapTensor<ScoreType>(Device(), present_value);

for (size_t j = 0; j < beam_indices.size(); j++) {
int32_t beam_index = beam_indices[j];
auto present = present_span.subspan(beam_index * block_size_per_beam, block_size_per_beam);
auto past = past_span.subspan(j * block_size_per_beam, block_size_per_beam);
past.CopyFrom(present);
}

pasts_[index] = std::move(past_value);
}

void DefaultKeyValueCache::PickPastState(DeviceSpan<int32_t> beam_indices, int index) {
if (type_ == Ort::TypeToTensorType<float>) {
PickPastState<float>(beam_indices, index);
} else {
PickPastState<Ort::Float16_t>(beam_indices, index);
}
}

namespace {

int64_t GetElementsPerBeam(const AuxiliaryStateSet& state_set) {
static std::mutex mutex;
static std::unordered_map<const AuxiliaryStateSet*, int64_t> cache;

const auto* key = &state_set;
Comment on lines +645 to +648
return None

bos_token_id = resolve_special_token_id("bos_token_id")
if bos_token_id is None:
Comment on lines +28 to 32
inline static bool IsQwen25VL(const std::string& model_type) {
// Qwen VL specific check for 3D position IDs (MRoPE)
return model_type == "fara" || model_type == "qwen2_5_vl" || model_type == "qwen3_5";
}

)
return f"{zero_name}/output_0"

def make_attention_input_proj(self, layer_id, attention, root_input, **kwargs):

Check notice

Code scanning / CodeQL

Explicit returns mixed with implicit (fall through) returns Note

Mixing implicit and explicit returns may indicate an error, as implicit returns always return None.
layer_id, attention.v_proj, "v_proj", root_input, kv_shape
)

def make_attention_output_proj(self, layer_id, attention, root_input, **kwargs):

Check notice

Code scanning / CodeQL

Explicit returns mixed with implicit (fall through) returns Note

Mixing implicit and explicit returns may indicate an error, as implicit returns always return None.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants