Add model builder support for LFM2 by xenova · Pull Request #1979 · microsoft/onnxruntime-genai

xenova · 2026-02-14T04:39:43Z

This PR adds support for the LFM2 series of LLMs from LiquidAI: https://huggingface.co/models?other=lfm2&sort=trending&search=LiquidAI%2F

For sample inference code, you can refer to the model card I uploaded for the converted models: https://huggingface.co/onnx-community/LFM2-350M-ONNX#onnxruntime

I know there are still a couple of things missing from the PR (this PR only adds model builder support)... so hopefully @kunal-vaishnavi can finalize the last few things on the checklist 😇 (I've enabled edits by maintainers).

cc @ykhrustalev too 🥳

src/python/py/models/builders/lfm2.py

+                intermediate_size = int(config.block_ffn_dim_multiplier * intermediate_size)
+            multiple_of = getattr(config, "block_multiple_of", 1)
+            intermediate_size = multiple_of * ((intermediate_size + multiple_of - 1) // multiple_of)
+        self.intermediate_size = intermediate_size


ykhrustalev

lgtm

src/python/py/models/builders/lfm2.py

kunal-vaishnavi · 2026-02-16T00:27:05Z

src/python/py/models/builders/lfm2.py

+                conv_cache_shape = ["batch_size", self.hidden_size, self.conv_L_cache]
+
+                # Add conv cache input
+                past_conv_name = f"past_conv.{i}"


Is this the standard naming convention for past/present convolution caches? Could we use a format such as past.{i}.conv and present.{i}.conv to more closely match the KV cache naming format?

Yeah, the original choice of name was due to the mismatch between past_key_values.{i}.key and present.{i}.key (meaning we would name this one past_conv.{i}.conv and present_conv.{i}.conv... and then leading to past_conv.{i} and present_conv.{i}).

unfortunately there are many lfm2-based ONNX models already in use which have this convention 😅

similar naming issue with mamba-based models, which introduce a ssm cache mechanism.

If we had a do-over, probably this would work best:

past.{i}.key past.{i}.value past.{i}.conv past.{i}.ssm present.{i}.key present.{i}.value present.{i}.conv present.{i}.ssm

We can still name new models that are produced by the model builder in this format. The GenAI config will contain a mapping to the input category and the input format.

For example, the current models on Hugging Face can use

"past_conv_names": "past_conv.%d"

while newly produced models can use the following

"past_conv_names": "past.%d.conv"

In this way, both name formats are supported.

src/python/py/models/builders/lfm2.py

xenova · 2026-02-26T00:23:38Z

Apologies for the delay; only returned to the PR now :) Addressed all comments.

The one point of discussion is the naming of the I/O. There are currently 38 models on the HF hub that use the current naming convention (past_conv.{i} and present_conv.{i}), so it would be easier to maintain this way (as downstream libraries like transformers.js make assumptions about the naming). e.g., LFM2-24B-A2B-ONNX

past_conv.0
name: past_conv.0
tensor: float16[batch_size,2048,3]

past_conv.1
name: past_conv.1
tensor: float16[batch_size,2048,3]

There are other non-LFM2 models too, like those with mamba layers that use this naming convention too.

past_conv.0
name: past_conv.0
tensor: float32[batch_size,1792,4]

past_ssm.0
name: past_ssm.0
tensor: float32[batch_size,48,32,128]

I'm not strictly opposed to changing this to something like

past.{i}.key
past.{i}.value
past.{i}.conv
past.{i}.ssm

present.{i}.key
present.{i}.value
present.{i}.conv
present.{i}.ssm

but this would affect naming for past_key_values too (this was the original reason for naming things past_X)

xenova · 2026-03-06T20:10:23Z

@kunal-vaishnavi While we're talking about naming, I came across an interesting case that required splitting names differently for a different kind of model. onnx-community/Olmo-Hybrid-Instruct-SFT-7B-ONNX requires past_conv to have separate query, key, and value caches. So, the names look like:

So, if we wanted to use the suggested format, we would use the same names, when one could refer to a linear cache and another could refer to a attention cache, but each share names

past.0.query // conv
past.0.key  // conv
past.0.value  // conv
past.0.recurrent  // recurrent
...
past.3.key  // attn
past.3.value  // attn

WDYT?

kunal-vaishnavi · 2026-03-09T22:33:41Z

I looked at the ONNX model and the forward pass for this model in Transformers.

The attention layer is standard and follows typical conventions for KV cache updating. In the ONNX model, the GQA op is handling this as expected. The convolution layer is also doing a similar concatenation but the shape differs from the KV cache concatenation.

Using per-layer shapes to represent different-sized inputs and outputs that share the same template name is possible inside ORT GenAI as long as the cache names are identified in a separate section in the GenAI config. For example:

"past_conv_key_names": "past.%d.key",
"past_conv_value_names": "past.%d.value",
"past_key_names": "past.%d.key",
"past_value_names": "past.%d.value",

Inside ORT GenAI, we would either have to specify layer indices to distinguish between which cache to pre-allocate for a layer index or read the shapes from the inference session. Currently, layer indices are specified for distinguishing between sliding attention and full attention for the TRT-RTX EP. If we want to avoid sharing names and therefore avoid specified layer indices, the convolution caches could be named as past.{i}.q_conv, past.{i}.k_conv, and past.{i}.v_conv. Thus, it is doable to add but it will be easier if the names are distinguished.

As a separate note, the convolution layer appears to have separate Q, K, V depth-wise convolutions. I believe these can be packed together into one depth-wise convolution in the ONNX model just like the traditional q_proj, k_proj, and v_proj packing (which can also be done on the ONNX model). The forward pass could look like the following pseudocode.

qkv = qkv_proj(hidden_states)
qkv, new_state = qkv_conv1d(qkv, cache=packed_state, use_precomputed=..., output_final_state=use_cache)
q, k, v = torch.split(qkv, [...], dim=-1)
# store new_state in conv_states_qkv[layer_idx]

Add support for LFM2

cce8265

github-advanced-security bot found potential problems Feb 14, 2026

View reviewed changes

ykhrustalev approved these changes Feb 14, 2026

View reviewed changes

src/python/py/models/builders/lfm2.py Outdated Show resolved Hide resolved