Conversation
| intermediate_size = int(config.block_ffn_dim_multiplier * intermediate_size) | ||
| multiple_of = getattr(config, "block_multiple_of", 1) | ||
| intermediate_size = multiple_of * ((intermediate_size + multiple_of - 1) // multiple_of) | ||
| self.intermediate_size = intermediate_size |
Check warning
Code scanning / CodeQL
Overwriting attribute in super-class or sub-class Warning
| conv_cache_shape = ["batch_size", self.hidden_size, self.conv_L_cache] | ||
|
|
||
| # Add conv cache input | ||
| past_conv_name = f"past_conv.{i}" |
There was a problem hiding this comment.
Is this the standard naming convention for past/present convolution caches? Could we use a format such as past.{i}.conv and present.{i}.conv to more closely match the KV cache naming format?
There was a problem hiding this comment.
Yeah, the original choice of name was due to the mismatch between past_key_values.{i}.key and present.{i}.key (meaning we would name this one past_conv.{i}.conv and present_conv.{i}.conv... and then leading to past_conv.{i} and present_conv.{i}).
unfortunately there are many lfm2-based ONNX models already in use which have this convention 😅
There was a problem hiding this comment.
similar naming issue with mamba-based models, which introduce a ssm cache mechanism.

If we had a do-over, probably this would work best:
past.{i}.key
past.{i}.value
past.{i}.conv
past.{i}.ssm
present.{i}.key
present.{i}.value
present.{i}.conv
present.{i}.ssm
There was a problem hiding this comment.
We can still name new models that are produced by the model builder in this format. The GenAI config will contain a mapping to the input category and the input format.
For example, the current models on Hugging Face can use
"past_conv_names": "past_conv.%d"while newly produced models can use the following
"past_conv_names": "past.%d.conv"In this way, both name formats are supported.
|
Apologies for the delay; only returned to the PR now :) Addressed all comments. The one point of discussion is the naming of the I/O. There are currently 38 models on the HF hub that use the current naming convention ( There are other non-LFM2 models too, like those with mamba layers that use this naming convention too. I'm not strictly opposed to changing this to something like but this would affect naming for |
|
@kunal-vaishnavi While we're talking about naming, I came across an interesting case that required splitting names differently for a different kind of model. onnx-community/Olmo-Hybrid-Instruct-SFT-7B-ONNX requires past_conv to have separate query, key, and value caches. So, the names look like: So, if we wanted to use the suggested format, we would use the same names, when one could refer to a linear cache and another could refer to a attention cache, but each share names WDYT? |
|
I looked at the ONNX model and the forward pass for this model in Transformers. The attention layer is standard and follows typical conventions for KV cache updating. In the ONNX model, the GQA op is handling this as expected. The convolution layer is also doing a similar concatenation but the shape differs from the KV cache concatenation. Using per-layer shapes to represent different-sized inputs and outputs that share the same template name is possible inside ORT GenAI as long as the cache names are identified in a separate section in the GenAI config. For example: "past_conv_key_names": "past.%d.key",
"past_conv_value_names": "past.%d.value",
"past_key_names": "past.%d.key",
"past_value_names": "past.%d.value",Inside ORT GenAI, we would either have to specify layer indices to distinguish between which cache to pre-allocate for a layer index or read the shapes from the inference session. Currently, layer indices are specified for distinguishing between sliding attention and full attention for the TRT-RTX EP. If we want to avoid sharing names and therefore avoid specified layer indices, the convolution caches could be named as As a separate note, the convolution layer appears to have separate Q, K, V depth-wise convolutions. I believe these can be packed together into one depth-wise convolution in the ONNX model just like the traditional qkv = qkv_proj(hidden_states)
qkv, new_state = qkv_conv1d(qkv, cache=packed_state, use_precomputed=..., output_final_state=use_cache)
q, k, v = torch.split(qkv, [...], dim=-1)
# store new_state in conv_states_qkv[layer_idx] |

This PR adds support for the LFM2 series of LLMs from LiquidAI: https://huggingface.co/models?other=lfm2&sort=trending&search=LiquidAI%2F
For sample inference code, you can refer to the model card I uploaded for the converted models: https://huggingface.co/onnx-community/LFM2-350M-ONNX#onnxruntime
I know there are still a couple of things missing from the PR (this PR only adds model builder support)... so hopefully @kunal-vaishnavi can finalize the last few things on the checklist 😇 (I've enabled edits by maintainers).
cc @ykhrustalev too 🥳