[Arm] Enable Gather MatMul with KleidiAI Microkernels#34303
[Arm] Enable Gather MatMul with KleidiAI Microkernels#34303abhijain1204fujitsu wants to merge 3 commits intoopenvinotoolkit:masterfrom
Conversation
| #else | ||
|
|
||
| ov::element::Type getRuntimePrecision() const override; | ||
| Algorithm algorithm = Algorithm::GatherMatmulDefault; | ||
| size_t numExperts = 0; | ||
|
|
||
| std::vector<ExecutorPtr> executor; | ||
| std::vector<MemoryArgs> memArgsFC; | ||
|
|
||
| MemoryPtr m_weightsMemory = nullptr; | ||
| MemoryPtr m_tmpInpBuffer = nullptr; | ||
| MemoryDescPtr m_tmpInputDesc = nullptr; | ||
| MemoryDescPtr m_tmpOutputDesc = nullptr; | ||
|
|
||
| #endif |
There was a problem hiding this comment.
Some fields are clearly duplicated between if and else branches. Should we narrow the scope?
There was a problem hiding this comment.
I assume this file is a temporal solution, and ARM specific implementation will be moved to corresponding executor.
| continue; | ||
| } | ||
|
|
||
| parallel_for(num_valid_rows, [&](size_t m) { |
There was a problem hiding this comment.
It's better to use CpuParallel class in such contexts to align the implementation with the x64 approach.
There was a problem hiding this comment.
Do we really need to keep keep_dims=true followed by Squeeze operation under define?
I looks like it won't be a problem to support this variation in generic fashion. Just to reduce the code complexity.
There was a problem hiding this comment.
Cannot we reuse the exiting x64 test via moving it to the common scope and enabling corresponding instances for arm?
|
Hi @maxnick. Thanks for the comment I have modified the implementation, moving some of the Gathermatmul logic to KleidiAIExecutor, also keeping the executor interface light as discussed. Also I have integrated the logic in the same file "gathermatmul.cpp" and reuse existing x86 code. I will update this PR with the new refactored logic in the coming week once its approved internally. Will move the relevant tests to common scope as well. |
c4f4a52 to
71109bc
Compare
|
Hi @maxnick, I have made the requested changes and fixed the test cases, please review the PR. Thanks! |
[ About ]
[ Design ]
[Benchmark Results]

**Results are measured on single socket Graviton4 machine [ 96 cores ]
Kleidiai support is enabled and tested for F32, INT8 and INT4 precisions. For F32 OneDNN is made the default.
This work is contributed by @ashwins990 and @abhijain1204fujitsu