[Arm] Enable Gather MatMul with KleidiAI Microkernels by abhijain1204fujitsu · Pull Request #34303 · openvinotoolkit/openvino

abhijain1204fujitsu · 2026-02-25T03:14:52Z

[ About ]

Enable Gathermatmul and GatherMatmul-Compressed on ARM.
Bug fix [ require reorders before packing if weight matrix transposed. ] related to KleidiAI execution of matmul in F32 precision.
Additional memory consumption due to weight duplication is managed.

[ Design ]

Scratchpad control is moved from GatherMatmul node level to executor level to support NUMA based Expert Parallelism in future.

[Benchmark Results]

**Results are measured on single socket Graviton4 machine [ 96 cores ]
Kleidiai support is enabled and tested for F32, INT8 and INT4 precisions. For F32 OneDNN is made the default.

This work is contributed by @ashwins990 and @abhijain1204fujitsu

maxnick · 2026-03-12T15:39:38Z

src/plugins/intel_cpu/src/nodes/gathermatmul.h

+#else
+
+    ov::element::Type getRuntimePrecision() const override;
+    Algorithm algorithm = Algorithm::GatherMatmulDefault;
+    size_t numExperts = 0;
+
+    std::vector<ExecutorPtr> executor;
+    std::vector<MemoryArgs> memArgsFC;
+
+    MemoryPtr m_weightsMemory = nullptr;
+    MemoryPtr m_tmpInpBuffer = nullptr;
+    MemoryDescPtr m_tmpInputDesc = nullptr;
+    MemoryDescPtr m_tmpOutputDesc = nullptr;
+
+#endif


Some fields are clearly duplicated between if and else branches. Should we narrow the scope?

maxnick · 2026-03-12T15:41:01Z

src/plugins/intel_cpu/src/nodes/gathermatmul_arm.cpp

I assume this file is a temporal solution, and ARM specific implementation will be moved to corresponding executor.

maxnick · 2026-03-12T15:49:44Z

src/plugins/intel_cpu/src/nodes/gathermatmul_arm.cpp

+                continue;
+            }
+
+            parallel_for(num_valid_rows, [&](size_t m) {


It's better to use CpuParallel class in such contexts to align the implementation with the x64 approach.

maxnick · 2026-03-12T16:07:57Z

src/plugins/intel_cpu/src/transformations/cpu_opset/common/pass/moe_matmuls_fusion.cpp

Do we really need to keep keep_dims=true followed by Squeeze operation under define?
I looks like it won't be a problem to support this variation in generic fashion. Just to reduce the code complexity.

maxnick · 2026-03-12T16:08:55Z

src/plugins/intel_cpu/tests/functional/custom/subgraph_tests/src/arm/moe.cpp

Cannot we reuse the exiting x64 test via moving it to the common scope and enabling corresponding instances for arm?

ashwins990 · 2026-03-12T16:30:07Z

Hi @maxnick. Thanks for the comment

I have modified the implementation, moving some of the Gathermatmul logic to KleidiAIExecutor, also keeping the executor interface light as discussed. Also I have integrated the logic in the same file "gathermatmul.cpp" and reuse existing x86 code.

I will update this PR with the new refactored logic in the coming week once its approved internally.

Will move the relevant tests to common scope as well.

ashwins990 · 2026-03-25T16:06:29Z

Hi @maxnick, I have made the requested changes and fixed the test cases, please review the PR. Thanks!

abhijain1204fujitsu requested review from a team as code owners February 25, 2026 03:14

github-actions bot added the category: CPU OpenVINO CPU plugin label Feb 25, 2026

sys-openvino-ci added the ExternalPR External contributor label Feb 25, 2026

alvoron self-assigned this Mar 4, 2026

maxnick modified the milestones: 2026.0, 2026.1 Mar 4, 2026

maxnick self-assigned this Mar 4, 2026

maxnick requested changes Mar 12, 2026

View reviewed changes

praasz modified the milestones: 2026.1, 2026.2 Mar 20, 2026

ashwins990 added 2 commits March 23, 2026 13:36

initial commit + added tests for GatherMatmul op ARM

3fc2fe9

support transformation in generic fashion

71109bc

abhijain1204fujitsu force-pushed the GatherMatmul-on-ARM-with-Kleidiai branch from c4f4a52 to 71109bc Compare March 24, 2026 11:47

Updated transformation tests

4a59819

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Arm] Enable Gather MatMul with KleidiAI Microkernels#34303

[Arm] Enable Gather MatMul with KleidiAI Microkernels#34303
abhijain1204fujitsu wants to merge 3 commits intoopenvinotoolkit:masterfrom
MonakaResearch:GatherMatmul-on-ARM-with-Kleidiai

abhijain1204fujitsu commented Feb 25, 2026 •

edited

Loading

Uh oh!

maxnick Mar 12, 2026

Uh oh!

maxnick Mar 12, 2026

Uh oh!

maxnick Mar 12, 2026

Uh oh!

maxnick Mar 12, 2026

Uh oh!

maxnick Mar 12, 2026

Uh oh!

ashwins990 commented Mar 12, 2026

Uh oh!

ashwins990 commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

abhijain1204fujitsu commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

maxnick Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

maxnick Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

maxnick Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

maxnick Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

maxnick Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

ashwins990 commented Mar 12, 2026

Uh oh!

ashwins990 commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

abhijain1204fujitsu commented Feb 25, 2026 •

edited

Loading