Question: where would a flash-resident tiny language runtime fit relative to on-device GenAI runtimes? #2033

Alpha-Guardian · 2026-03-17T09:05:56Z

Alpha-Guardian
Mar 17, 2026

Hi ONNX Runtime GenAI folks,

I wanted to share a small on-device language-runtime experiment and ask how systems like this should be viewed relative to more familiar on-device GenAI runtimes.

We built a public demo line called Engram and deployed it on a commodity ESP32-C3.

Current public numbers:

Host-side benchmark capability
- LogiQA = 0.392523
- IFEval = 0.780037
Published board proof
- LogiQA 642 = 249 / 642 = 0.3878504672897196
- host_full_match = 642 / 642
- runtime artifact size = 1,380,771 bytes

Important scope note:

This is not presented as unrestricted open-input native LLM generation on MCU.

The board-side path is closer to a flash-resident, table-driven runtime with:

packed token weights
hashed lookup structures
fixed compiled probe batches
streaming fold / checksum style execution over precompiled structures

So this is not a standard portable GenAI runtime story. It is closer to a task-specialized language runtime whose behavior has been compiled into a very constrained execution form.

Repo:
https://github.com/Alpha-Guardian/Engram

What I’m curious about is whether systems like this should be viewed as:

an extreme on-device specialization endpoint
outside the normal GenAI runtime family
or an adjacent future category for highly constrained language systems

Would love to hear any thoughts.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: where would a flash-resident tiny language runtime fit relative to on-device GenAI runtimes? #2033

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Question: where would a flash-resident tiny language runtime fit relative to on-device GenAI runtimes? #2033

Uh oh!

Alpha-Guardian Mar 17, 2026

Replies: 0 comments

Alpha-Guardian
Mar 17, 2026