Add RVV support for sdpa operator by chenglimin · Pull Request #6557 · Tencent/ncnn

chenglimin · 2026-02-26T08:48:02Z

This PR implements the sdpa operator for the RISC-V backend using RISC-V Vector (RVV) intrinsics.

Performance:
The RVV implementation provides a up to 5.9x speedup compared to the existing C++ scalar implementation.
Performance Test Environment: BananaPi (VLEN=256bit)
Correctness: correct on BananaPi, MusePi and K230(VLEN=128bit).

tencent-adm · 2026-02-26T08:48:23Z

Thank you for your submission, we really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
5 out of 7 committers have signed the CLA.

✅ ihb2032
✅ NKID00
✅ chenglimin
✅ futz12
✅ MouriNaruto
❌ nihui
❌ dependabot[bot]
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

codecov-commenter · 2026-02-26T11:39:09Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 93.42%. Comparing base (081f2b8) to head (ec9ef0e).
⚠️ Report is 28 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #6557      +/-   ##
==========================================
+ Coverage   93.18%   93.42%   +0.24%     
==========================================
  Files         832      764      -68     
  Lines      266714   257359    -9355     
==========================================
- Hits       248545   240448    -8097     
+ Misses      18169    16911    -1258

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copilot

Pull request overview

This PR adds a RISC-V architecture-specific implementation of the Scaled Dot-Product Attention (SDPA) operator. The implementation follows the existing architectural pattern used in ncnn, where arch-specific SDPA layers delegate computational work to optimized Gemm and Softmax layers rather than implementing SIMD intrinsics directly in the SDPA layer itself. The claimed 5.9x speedup comes from leveraging existing RVV-optimized Gemm and Softmax implementations.

Changes:

Adds SDPA_riscv class that extends the base SDPA layer
Implements elempack > 1 handling not present in the x86 version
Delegates QK^T and AttnV matrix multiplications to Gemm layers and softmax to Softmax layer

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File	Description
src/layer/riscv/sdpa_riscv.h	Header file defining the SDPA_riscv class with pipeline management and forward methods
src/layer/riscv/sdpa_riscv.cpp	Implementation that creates and manages Gemm/Softmax sub-layers, includes elempack fallback logic

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-26T11:51:30Z

src/layer/riscv/sdpa_riscv.cpp

+#if __riscv_vector
+#include <riscv_vector.h>
+#endif
+


The RVV header is included but never used in this file. The actual RVV optimizations come from delegating to RVV-optimized Gemm and Softmax layers. Consider removing this unused include or adding a comment explaining that RVV acceleration is achieved through the underlying layers (Gemm, Softmax) rather than direct intrinsics in this file.

Suggested change

#if __riscv_vector

#include <riscv_vector.h>

#endif

Copilot · 2026-02-26T11:51:31Z

src/layer/riscv/sdpa_riscv.cpp

+                packing->forward(bottom_blobs[i], unpacked_bottom_blobs[i], opt);
+                delete packing;


The return value from packing->forward is not checked. If the packing operation fails, the error will go undetected. Consider checking the return value and propagating the error appropriately.

Copilot · 2026-02-26T11:51:31Z

src/layer/riscv/sdpa_riscv.cpp

+            packing->forward(unpacked_top_blobs[i], top_blobs[i], _opt); // use original allocator for output
+            delete packing;


The return value from packing->forward is not checked. If the repacking operation fails, the error will go undetected. Consider checking the return value and propagating the error appropriately.

nihui · 2026-02-26T11:55:43Z

src/layer/riscv/sdpa_riscv.cpp

+    if (elempack > 1)
+    {
+        // fallback: unpack -> forward -> repack
+        Option opt = _opt;
+        opt.blob_allocator = _opt.workspace_allocator;
+
+        std::vector<Mat> unpacked_bottom_blobs(bottom_blobs.size());
+        for (size_t i = 0; i < bottom_blobs.size(); i++)
+        {
+            if (bottom_blobs[i].empty()) continue;
+
+            if (bottom_blobs[i].elempack == 1)
+            {
+                unpacked_bottom_blobs[i] = bottom_blobs[i];
+            }
+            else
+            {
+                ncnn::Layer* packing = ncnn::create_layer_cpu(ncnn::LayerType::Packing);
+                ncnn::ParamDict pd;
+                pd.set(0, 1); // out_elempack
+                packing->load_param(pd);
+                packing->forward(bottom_blobs[i], unpacked_bottom_blobs[i], opt);
+                delete packing;
+            }
+        }
+
+        std::vector<Mat> unpacked_top_blobs(top_blobs.size());
+
+        // call forward with elempack=1
+        int ret = forward(unpacked_bottom_blobs, unpacked_top_blobs, _opt);
+        if (ret != 0) return ret;
+
+        // repack outputs
+        for (size_t i = 0; i < top_blobs.size(); i++)
+        {
+            if (unpacked_top_blobs[i].empty()) continue;
+
+            ncnn::Layer* packing = ncnn::create_layer_cpu(ncnn::LayerType::Packing);
+            ncnn::ParamDict pd;
+            pd.set(0, elempack); // out_elempack
+            packing->load_param(pd);
+            packing->forward(unpacked_top_blobs[i], top_blobs[i], _opt); // use original allocator for output
+            delete packing;
+        }
+
+        return 0;
+    }


dead code block, elempack will always be 1 if the layer does not support packing

the code has been changed.

* drop out pad

)

Signed-off-by: ihb2032 <hebome@foxmail.com>

…Tencent#6585)

…encent#6586)

…encent#6591)

* Downgrade actions/cache from v5 to v4

…16 storage (Tencent#6593)

…tch (Tencent#6594)

…512bf16 dispatch (Tencent#6595)

Bumps [codecov/codecov-action](https://github.com/codecov/codecov-action) from 5 to 6. - [Release notes](https://github.com/codecov/codecov-action/releases) - [Changelog](https://github.com/codecov/codecov-action/blob/main/CHANGELOG.md) - [Commits](codecov/codecov-action@v5...v6) --- updated-dependencies: - dependency-name: codecov/codecov-action dependency-version: '6' dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

…ncent#6552)

…t bf16 storage, skip mha bf16 tests (Tencent#6623)

…/dropout/quantize/dequantize/bnll x86 support bf16 storage (Tencent#6624)

github-actions bot added the riscv label Feb 26, 2026

sdpa

78b9ae6

chenglimin force-pushed the patch-sdpa branch from 1461b7b to 78b9ae6 Compare February 26, 2026 09:11

nihui requested a review from Copilot February 26, 2026 11:46

Copilot started reviewing on behalf of nihui February 26, 2026 11:47 View session

Copilot AI reviewed Feb 26, 2026

View reviewed changes

nihui requested changes Feb 26, 2026

View reviewed changes

Merge branch 'master' into patch-sdpa

66648cc

chenglimin requested a review from nihui March 3, 2026 11:11

chenglimin and others added 2 commits April 1, 2026 11:40

WIP: save local changes before rebase

9d189bf

apply code-format changes

ec9ef0e

nihui closed this Apr 1, 2026

nihui reopened this Apr 1, 2026

nihui and others added 15 commits April 1, 2026 15:54

implement vulkan gemm packed (Tencent#6573)

50c13d5

* drop out pad

vkmat add memory type index, is_device_local (Tencent#6581)

9f2e0c2

fmod logaddexp floor_divide remainder support for binaryop (Tencent#6549

09dfd1b

)

fix: add missing NCNN_MALLOC_OVERREAD padding for MSVC (Tencent#6583)

329484c

Signed-off-by: ihb2032 <hebome@foxmail.com>

x86: add AbsVal_x86 with fp16s and bf16s storage support (Tencent#6584)

bf57baa

x86: add LayerNorm_x86 bf16s storage support with avx512bf16 dispatch (…

5b66db5

…Tencent#6585)

x86: add RMSNorm_x86 bf16s storage support with avx512bf16 dispatch (T…

12396c8

…encent#6586)

x86: add UnaryOp_x86 bf16s storage support (Tencent#6588)

f285027

clip relu sigmoid x86 bf16s (Tencent#6589)

55e1948

drop virtual inheritance (Tencent#6590)

723fc18

x86: add BinaryOp_x86 bf16s storage support with avx512bf16 dispatch (T…

b1ce9f9

…encent#6591)

update pnnx ci torch 2.10.0 (Tencent#6592)

eeb2a0b

* Downgrade actions/cache from v5 to v4

x86 concat slice flatten reshape crop padding packing support fp16 bf…

919c896

…16 storage (Tencent#6593)

x86 groupnorm instancenorm support bf16 storage with avx512bf16 dispa…

3da4ee3

…tch (Tencent#6594)

x86 batchnorm prelu scale swish softmax support bf16 storage with avx…

1d91733

…512bf16 dispatch (Tencent#6595)

nihui and others added 16 commits April 1, 2026 15:54

x86 interp optimization (Tencent#6597)

8b5f2bf

x86 gemm support bf16 storage (Tencent#6598)

04b5117

x86 gemm int8 optimization with alignr (Tencent#6600)

bba0de3

fix binaryop vulkan shader compilation on moltenvk (Tencent#6602)

9fffd13

x86 erf and gelu optimization (Tencent#6604)

039f0c0

support mips elu erf gelu selu (Tencent#6607)

2426a8b

fix: subgroup float16 extension should be required (Tencent#6615)

1f0c2a8

x86 gemm bf16s optimization with avx512bf16 (Tencent#6609)

191d239

Add benchmark results for several instances from Microsoft Azure. (Te…

7fa7a37

…ncent#6552)

update docs for new convertmodel website (Tencent#6617)

e10d7c8

gemm x86 support out_elemtype, multiheadattention and sdpa x86 suppor…

a6a04ea

…t bf16 storage, skip mha bf16 tests (Tencent#6623)

rotaryembed/tanh/selu/mish/hardswish/hardsigmoid/gelu/erf/elu/eltwise…

f182b41

…/dropout/quantize/dequantize/bnll x86 support bf16 storage (Tencent#6624)

deconvolution x86 support bf16 storage, clean includes (Tencent#6627)

f6a11a5

WIP: save local changes before rebase

b6dab49

Update k1 toolchain config

1f7d768

github-actions bot added core tool pnnx vulkan test layer arm loongarch mips x86 doc labels Apr 1, 2026

chenglimin and others added 2 commits April 1, 2026 08:32

apply code-format changes

517bbd6

change year

64ac9a1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add RVV support for sdpa operator#6557

Add RVV support for sdpa operator#6557
chenglimin wants to merge 37 commits intoTencent:masterfrom
chenglimin:patch-sdpa

chenglimin commented Feb 26, 2026

Uh oh!

tencent-adm commented Feb 26, 2026 •

edited

Loading

Uh oh!

codecov-commenter commented Feb 26, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 26, 2026

Uh oh!

Copilot AI Feb 26, 2026

Uh oh!

Copilot AI Feb 26, 2026

Uh oh!

nihui Feb 26, 2026

Uh oh!

chenglimin Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

		packing->forward(bottom_blobs[i], unpacked_bottom_blobs[i], opt);
		delete packing;

		packing->forward(unpacked_top_blobs[i], top_blobs[i], _opt); // use original allocator for output
		delete packing;

Conversation

chenglimin commented Feb 26, 2026

Uh oh!

tencent-adm commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-commenter commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

nihui Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

chenglimin Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

tencent-adm commented Feb 26, 2026 •

edited

Loading

codecov-commenter commented Feb 26, 2026 •

edited

Loading