Skip to content

Add model builder support for LFM2#1979

Open
xenova wants to merge 10 commits intomicrosoft:mainfrom
xenova:add-lfm2
Open

Add model builder support for LFM2#1979
xenova wants to merge 10 commits intomicrosoft:mainfrom
xenova:add-lfm2

Conversation

@xenova
Copy link
Contributor

@xenova xenova commented Feb 14, 2026

This PR adds support for the LFM2 series of LLMs from LiquidAI: https://huggingface.co/models?other=lfm2&sort=trending&search=LiquidAI%2F

For sample inference code, you can refer to the model card I uploaded for the converted models: https://huggingface.co/onnx-community/LFM2-350M-ONNX#onnxruntime

I know there are still a couple of things missing from the PR (this PR only adds model builder support)... so hopefully @kunal-vaishnavi can finalize the last few things on the checklist 😇 (I've enabled edits by maintainers).

cc @ykhrustalev too 🥳

intermediate_size = int(config.block_ffn_dim_multiplier * intermediate_size)
multiple_of = getattr(config, "block_multiple_of", 1)
intermediate_size = multiple_of * ((intermediate_size + multiple_of - 1) // multiple_of)
self.intermediate_size = intermediate_size

Check warning

Code scanning / CodeQL

Overwriting attribute in super-class or sub-class Warning

Assignment overwrites attribute intermediate_size, which was previously defined in superclass
Model
.
Copy link

@ykhrustalev ykhrustalev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

conv_cache_shape = ["batch_size", self.hidden_size, self.conv_L_cache]

# Add conv cache input
past_conv_name = f"past_conv.{i}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the standard naming convention for past/present convolution caches? Could we use a format such as past.{i}.conv and present.{i}.conv to more closely match the KV cache naming format?

Copy link
Contributor Author

@xenova xenova Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the original choice of name was due to the mismatch between past_key_values.{i}.key and present.{i}.key (meaning we would name this one past_conv.{i}.conv and present_conv.{i}.conv... and then leading to past_conv.{i} and present_conv.{i}).

unfortunately there are many lfm2-based ONNX models already in use which have this convention 😅

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar naming issue with mamba-based models, which introduce a ssm cache mechanism.
image

If we had a do-over, probably this would work best:

past.{i}.key
past.{i}.value
past.{i}.conv
past.{i}.ssm

present.{i}.key
present.{i}.value
present.{i}.conv
present.{i}.ssm

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can still name new models that are produced by the model builder in this format. The GenAI config will contain a mapping to the input category and the input format.

For example, the current models on Hugging Face can use

"past_conv_names": "past_conv.%d"

while newly produced models can use the following

"past_conv_names": "past.%d.conv"

In this way, both name formats are supported.

@xenova
Copy link
Contributor Author

xenova commented Feb 26, 2026

Apologies for the delay; only returned to the PR now :) Addressed all comments.

The one point of discussion is the naming of the I/O. There are currently 38 models on the HF hub that use the current naming convention (past_conv.{i} and present_conv.{i}), so it would be easier to maintain this way (as downstream libraries like transformers.js make assumptions about the naming). e.g., LFM2-24B-A2B-ONNX

past_conv.0
name: past_conv.0
tensor: float16[batch_size,2048,3]

past_conv.1
name: past_conv.1
tensor: float16[batch_size,2048,3]

There are other non-LFM2 models too, like those with mamba layers that use this naming convention too.

past_conv.0
name: past_conv.0
tensor: float32[batch_size,1792,4]

past_ssm.0
name: past_ssm.0
tensor: float32[batch_size,48,32,128]

I'm not strictly opposed to changing this to something like

past.{i}.key
past.{i}.value
past.{i}.conv
past.{i}.ssm

present.{i}.key
present.{i}.value
present.{i}.conv
present.{i}.ssm

but this would affect naming for past_key_values too (this was the original reason for naming things past_X)

@xenova
Copy link
Contributor Author

xenova commented Mar 6, 2026

@kunal-vaishnavi While we're talking about naming, I came across an interesting case that required splitting names differently for a different kind of model. onnx-community/Olmo-Hybrid-Instruct-SFT-7B-ONNX requires past_conv to have separate query, key, and value caches. So, the names look like:
image


So, if we wanted to use the suggested format, we would use the same names, when one could refer to a linear cache and another could refer to a attention cache, but each share names

past.0.query // conv
past.0.key  // conv
past.0.value  // conv
past.0.recurrent  // recurrent
...
past.3.key  // attn
past.3.value  // attn

WDYT?

@kunal-vaishnavi
Copy link
Contributor

I looked at the ONNX model and the forward pass for this model in Transformers.

The attention layer is standard and follows typical conventions for KV cache updating. In the ONNX model, the GQA op is handling this as expected. The convolution layer is also doing a similar concatenation but the shape differs from the KV cache concatenation.

Using per-layer shapes to represent different-sized inputs and outputs that share the same template name is possible inside ORT GenAI as long as the cache names are identified in a separate section in the GenAI config. For example:

"past_conv_key_names": "past.%d.key",
"past_conv_value_names": "past.%d.value",
"past_key_names": "past.%d.key",
"past_value_names": "past.%d.value",

Inside ORT GenAI, we would either have to specify layer indices to distinguish between which cache to pre-allocate for a layer index or read the shapes from the inference session. Currently, layer indices are specified for distinguishing between sliding attention and full attention for the TRT-RTX EP. If we want to avoid sharing names and therefore avoid specified layer indices, the convolution caches could be named as past.{i}.q_conv, past.{i}.k_conv, and past.{i}.v_conv. Thus, it is doable to add but it will be easier if the names are distinguished.

As a separate note, the convolution layer appears to have separate Q, K, V depth-wise convolutions. I believe these can be packed together into one depth-wise convolution in the ONNX model just like the traditional q_proj, k_proj, and v_proj packing (which can also be done on the ONNX model). The forward pass could look like the following pseudocode.

qkv = qkv_proj(hidden_states)
qkv, new_state = qkv_conv1d(qkv, cache=packed_state, use_precomputed=..., output_final_state=use_cache)
q, k, v = torch.split(qkv, [...], dim=-1)
# store new_state in conv_states_qkv[layer_idx]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants