Conversation
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Adds Qwen3.5 support to the Python export pipeline and the C++ runtime by introducing Qwen3.5-specific builder logic (including linear attention + auxiliary state handling) and extending config/runtime plumbing to recognize new model types and state tensors.
Changes:
- Add
Qwen35Modelbuilder with Qwen3.5 hybrid full/linear attention support and related config normalization. - Extend Python builder + generated
genai_config.jsonto include auxiliary decoder state I/O templates and additional decoder metadata. - Extend C++ runtime config/model typing, image processor parameters, and KV cache to support auxiliary decoder state caches.
Reviewed changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
src/python/py/models/builders/qwen.py |
Introduces Qwen35Model with Qwen3.5-specific RoPE/RMSNorm behavior, attention gating, and linear-attention export graph. |
src/python/py/models/builders/base.py |
Adds robust config field access, tokenizer loading fallback, special token ID resolution, RoPE interleaving handling, and auxiliary decoder state wiring into inputs/outputs + config. |
src/python/py/models/builders/__init__.py |
Exports Qwen35Model. |
src/python/py/models/builder.py |
Wires HF architecture Qwen3_5ForConditionalGeneration to Qwen35Model and sets model_type. |
src/models/qwen2_5_vl_image_processor.h |
Stores patch/temporal patch sizes in QwenImageProcessor. |
src/models/qwen2_5_vl_image_processor.cpp |
Uses configurable patch sizes instead of hard-coded constants. |
src/models/model_type.h |
Adds qwen3_5_text (LLM) and qwen3_5 (VLM) model-type recognition. |
src/models/model.cpp |
Enables qwen3_5 pipeline/VLM processor registration. |
src/models/kv_cache.h |
Adds AuxiliaryStateSet and auxiliary state cache tracking in DefaultKeyValueCache. |
src/models/kv_cache.cpp |
Implements auxiliary decoder state cache allocation/update/beam-picking and introduces runtime restrictions (no dynamic batching/sliding window/NvTensorRtRtx). |
src/config.h |
Adds vision patch sizes and decoder fields for rotary dim + linear-attention auxiliary state templates and dims. |
src/config.cpp |
Parses new vision/decoder fields and decoder input/output template strings. |
build.py |
Adds ORT include/lib path resolution for building examples from --ort_home or downloaded dependencies. |
You can also share your feedback on Copilot code review. Take the survey.
| if util.is_windows(): | ||
| library_names = ["onnxruntime.lib", "onnxruntime.dll"] | ||
| elif util.is_mac(): | ||
| library_names = ["libonnxruntime.dylib"] | ||
| elif util.is_aix(): | ||
| library_names = ["libonnxruntime.a"] | ||
| else: | ||
| library_names = ["libonnxruntime.so"] | ||
|
|
||
| lib_candidates = [ort_home / "lib", ort_home] | ||
| lib_candidates.extend(sorted(ort_home.glob("runtimes/*/native"))) | ||
| lib_candidates.extend(sorted(ort_home.glob("jni/*"))) | ||
|
|
||
| lib_dir = next( | ||
| ( | ||
| candidate | ||
| for candidate in lib_candidates | ||
| if candidate.is_dir() and any((candidate / library_name).exists() for library_name in library_names) | ||
| ), |
| } | ||
|
|
||
| } // namespace | ||
|
|
||
| template <typename ScoreType> | ||
| void DefaultKeyValueCache::PickPastAuxiliaryState(DeviceSpan<int32_t> beam_indices_device, AuxiliaryStateSet& state_set, int index) { |
| std::unique_ptr<OrtValue> past_value = OrtValue::CreateTensor<ScoreType>(Allocator(), tensor_shape); | ||
|
|
||
| auto past_span = WrapTensor<ScoreType>(Device(), *past_value); | ||
| auto present_span = WrapTensor<ScoreType>(Device(), present_value); | ||
|
|
||
| for (size_t j = 0; j < beam_indices.size(); j++) { | ||
| int32_t beam_index = beam_indices[j]; | ||
| auto present = present_span.subspan(beam_index * block_size_per_beam, block_size_per_beam); | ||
| auto past = past_span.subspan(j * block_size_per_beam, block_size_per_beam); | ||
| past.CopyFrom(present); | ||
| } | ||
|
|
||
| pasts_[index] = std::move(past_value); | ||
| } | ||
|
|
||
| void DefaultKeyValueCache::PickPastState(DeviceSpan<int32_t> beam_indices, int index) { | ||
| if (type_ == Ort::TypeToTensorType<float>) { | ||
| PickPastState<float>(beam_indices, index); | ||
| } else { | ||
| PickPastState<Ort::Float16_t>(beam_indices, index); | ||
| } | ||
| } | ||
|
|
||
| namespace { | ||
|
|
||
| int64_t GetElementsPerBeam(const AuxiliaryStateSet& state_set) { | ||
| static std::mutex mutex; | ||
| static std::unordered_map<const AuxiliaryStateSet*, int64_t> cache; | ||
|
|
||
| const auto* key = &state_set; |
| return None | ||
|
|
||
| bos_token_id = resolve_special_token_id("bos_token_id") | ||
| if bos_token_id is None: |
| inline static bool IsQwen25VL(const std::string& model_type) { | ||
| // Qwen VL specific check for 3D position IDs (MRoPE) | ||
| return model_type == "fara" || model_type == "qwen2_5_vl" || model_type == "qwen3_5"; | ||
| } | ||
|
|
| ) | ||
| return f"{zero_name}/output_0" | ||
|
|
||
| def make_attention_input_proj(self, layer_id, attention, root_input, **kwargs): |
Check notice
Code scanning / CodeQL
Explicit returns mixed with implicit (fall through) returns Note
| layer_id, attention.v_proj, "v_proj", root_input, kv_shape | ||
| ) | ||
|
|
||
| def make_attention_output_proj(self, layer_id, attention, root_input, **kwargs): |
Check notice
Code scanning / CodeQL
Explicit returns mixed with implicit (fall through) returns Note
Add Qwen3.5 support