Skip to content

[Arm] Enable Gather MatMul with KleidiAI Microkernels#34303

Open
abhijain1204fujitsu wants to merge 3 commits intoopenvinotoolkit:masterfrom
MonakaResearch:GatherMatmul-on-ARM-with-Kleidiai
Open

[Arm] Enable Gather MatMul with KleidiAI Microkernels#34303
abhijain1204fujitsu wants to merge 3 commits intoopenvinotoolkit:masterfrom
MonakaResearch:GatherMatmul-on-ARM-with-Kleidiai

Conversation

@abhijain1204fujitsu
Copy link
Contributor

@abhijain1204fujitsu abhijain1204fujitsu commented Feb 25, 2026

[ About ]

  • Enable Gathermatmul and GatherMatmul-Compressed on ARM.
  • Bug fix [ require reorders before packing if weight matrix transposed. ] related to KleidiAI execution of matmul in F32 precision.
  • Additional memory consumption due to weight duplication is managed.

[ Design ]

  • Scratchpad control is moved from GatherMatmul node level to executor level to support NUMA based Expert Parallelism in future.

[Benchmark Results]
image (3) (1)

image (4) (1)

**Results are measured on single socket Graviton4 machine [ 96 cores ]
Kleidiai support is enabled and tested for F32, INT8 and INT4 precisions. For F32 OneDNN is made the default.

This work is contributed by @ashwins990 and @abhijain1204fujitsu

@abhijain1204fujitsu abhijain1204fujitsu requested review from a team as code owners February 25, 2026 03:14
@github-actions github-actions bot added the category: CPU OpenVINO CPU plugin label Feb 25, 2026
@sys-openvino-ci sys-openvino-ci added the ExternalPR External contributor label Feb 25, 2026
@alvoron alvoron self-assigned this Mar 4, 2026
@maxnick maxnick modified the milestones: 2026.0, 2026.1 Mar 4, 2026
@maxnick maxnick self-assigned this Mar 4, 2026
Comment on lines +80 to +94
#else

ov::element::Type getRuntimePrecision() const override;
Algorithm algorithm = Algorithm::GatherMatmulDefault;
size_t numExperts = 0;

std::vector<ExecutorPtr> executor;
std::vector<MemoryArgs> memArgsFC;

MemoryPtr m_weightsMemory = nullptr;
MemoryPtr m_tmpInpBuffer = nullptr;
MemoryDescPtr m_tmpInputDesc = nullptr;
MemoryDescPtr m_tmpOutputDesc = nullptr;

#endif
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some fields are clearly duplicated between if and else branches. Should we narrow the scope?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this file is a temporal solution, and ARM specific implementation will be moved to corresponding executor.

continue;
}

parallel_for(num_valid_rows, [&](size_t m) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to use CpuParallel class in such contexts to align the implementation with the x64 approach.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need to keep keep_dims=true followed by Squeeze operation under define?
I looks like it won't be a problem to support this variation in generic fashion. Just to reduce the code complexity.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cannot we reuse the exiting x64 test via moving it to the common scope and enabling corresponding instances for arm?

@ashwins990
Copy link
Contributor

Hi @maxnick. Thanks for the comment

I have modified the implementation, moving some of the Gathermatmul logic to KleidiAIExecutor, also keeping the executor interface light as discussed. Also I have integrated the logic in the same file "gathermatmul.cpp" and reuse existing x86 code.

I will update this PR with the new refactored logic in the coming week once its approved internally.

Will move the relevant tests to common scope as well.

@praasz praasz modified the milestones: 2026.1, 2026.2 Mar 20, 2026
@abhijain1204fujitsu abhijain1204fujitsu force-pushed the GatherMatmul-on-ARM-with-Kleidiai branch from c4f4a52 to 71109bc Compare March 24, 2026 11:47
@ashwins990
Copy link
Contributor

Hi @maxnick, I have made the requested changes and fixed the test cases, please review the PR. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

category: CPU OpenVINO CPU plugin ExternalPR External contributor

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants