Skip to content

Deformerconv vk shader#6618

Open
futz12 wants to merge 7 commits intoTencent:masterfrom
futz12:deformerconv-vk
Open

Deformerconv vk shader#6618
futz12 wants to merge 7 commits intoTencent:masterfrom
futz12:deformerconv-vk

Conversation

@futz12
Copy link
Copy Markdown
Contributor

@futz12 futz12 commented Mar 28, 2026

No description provided.

@tencent-adm
Copy link
Copy Markdown
Member

CLA assistant check
Thank you for your submission, we really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@nihui nihui closed this Mar 30, 2026
@nihui nihui reopened this Mar 30, 2026
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Mar 30, 2026

Codecov Report

❌ Patch coverage is 89.07923% with 51 lines in your changes missing coverage. Please review.
✅ Project coverage is 93.45%. Comparing base (033fe3f) to head (3f91431).
⚠️ Report is 4 commits behind head on master.

Files with missing lines Patch % Lines
src/layer/vulkan/deformableconv2d_vulkan.cpp 89.07% 51 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6618      +/-   ##
==========================================
+ Coverage   93.42%   93.45%   +0.02%     
==========================================
  Files         873      875       +2     
  Lines      275524   280881    +5357     
==========================================
+ Hits       257406   262491    +5085     
- Misses      18118    18390     +272     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@nihui
Copy link
Copy Markdown
Member

nihui commented Mar 30, 2026

  PR #6618 Review: DeformableConv2D Vulkan Shader Implementation

  Overview

  This PR adds Vulkan GPU acceleration for DeformableConv2D with three execution paths: packed (basic), packed GEMM, and cooperative matrix (CM) GEMM. It spans ~2,680 lines across 5
  files.

  ---
  🐛 Potential Logic Bugs

  1. Inconsistent boundary check in bilinear interpolation between packed and CM shaders

  - deformableconv2d_packed.comp (line ~124): Uses strict inequality:
  bool cond = h_im > -1.f && w_im > -1.f && h_im < float(psc(h)) && w_im < float(psc(w));
  - deformableconv2d_gemm_cm.comp (line ~87): Uses <= / >=:
  if (h_im <= -1.f || w_im <= -1.f || h_im >= float(psc(h)) || w_im >= float(psc(w)))
      return 0.f;
  - These are actually logically equivalent (> -1 vs not <= -1), so this is fine, but the style inconsistency makes review harder.

  2. out_elempack == 1 output indexing in packed shader may be wrong ⚠

  In deformableconv2d_packed.comp, the scalar output path:
  const int channel_step = psc(outcstep) / 4;
  const int gi = gz * psc(outcstep) + gy * psc(outw) + gx;
  Here gi is computed using the packed outcstep (which is the stride for out_elempack=4 blobs), but then channel_step is derived by dividing outcstep by 4. The base index gi = gz *
  psc(outcstep) treats gz as a pack4 group index, writing sum.r to gi, sum.g to gi + channel_step, etc. This seems correct since gz ranges over outc (which is num_output_packed/4),
  and outcstep in push constants is already set to the scalar cstep. However, there's a subtle issue: when out_elempack == 1, psc(outcstep) is set to top_blob.cstep * 4 in the C++
  code, so channel_step = outcstep / 4 = top_blob.cstep. This appears correct but is fragile and non-obvious.

  3. weight_data_packed allocation assumes padding to num_output_packed but CM weight path doesn't ⚠

  In create_pipeline():
  - Standard path: num_output_packed = (num_output + 3) / 4 * 4 — weights are padded to a multiple of 4.
  - CM path: weight_data_cm_packed is sized based on coopmat_N * UNROLL_SG_N * UNROLL_WG_N blocks, which is correct for CM tile alignment. But the boundary check gni < num_output is
  only in the weight packing loop — if the shader reads beyond packed weight bounds due to tile size rounding, it could read garbage. The *p++ = 0.f zero-fill in the host-side packing
   handles this, so it's safe.

  4. CM shader bias loading broadcasts row-major but bias is 1D ⚠

  coopMatLoad(bias, bias_data, ((wgni * UNROLL_WG_N + sgni) * UNROLL_SG_N + zn) * Nd4, 0, gl_CooperativeMatrixLayoutRowMajor);
  The bias is a 1D vector, loaded with stride 0 and RowMajor layout. This means every row of the MxN cooperative matrix gets the same N-element bias row, which is the correct
  broadcast for convolution bias. ✅

  5. Missing weight_data.release() for non-CM path 🐛

  In upload_model():
  cmd.record_upload(weight_data_packed, weight_data_gpu, opt);
  weight_data_packed.release();
  After uploading, weight_data_packed is released but weight_data itself (the original CPU weight data from the parent class) is never released. In the CM path, neither weight_data is
   released. In similar ncnn layers (e.g., convolution_vulkan.cpp), weight_data.release() is called to free CPU memory after GPU upload. This is a memory leak for the original weight
  data on CPU side.

  6. bias_data.release() only called when bias_term is true

  This is correct — bias_data only has content when bias_term == 1. ✅

  ---
  ⚠ Suboptimal Implementation Issues

  7. Redundant per-invocation offset/mask reads in CM shader

  In the cooperative matrix shader, each thread independently reads offset and mask data via get_offset() / get_mask() for every spatial position in its tile. For a cooperative matrix
   tile of size M, this means many threads in the same subgroup read the same offset values for the same spatial position but different input channels. The offset/mask only depends on
   (sy, sx, kernel_position), not on the input channel.

  Suggestion: Precompute offset+mask values per spatial position and cache in shared memory, then reuse across channel iterations.

  8. Massive code duplication in CM shader

  The cooperative matrix shader has ~1385 lines with 3 near-identical code paths:
  - kk >= UNROLL_SG_K * 2 (prefetch pipeline)
  - kk >= UNROLL_SG_K (single batch)
  - k < kk (tail loop)

  Each duplicates the full tile loading, offset computation, and cooperative matrix multiply logic. This makes maintenance error-prone. Consider extracting common logic into helper
  functions or macros.

  9. elempack == 4 weight repacking for CM is O(n⁴) with poor cache behavior

  for (int i = 0; i < num_output; i++)
      for (int j = 0; j < num_input / 4; j++)
          for (int k = 0; k < maxk; k++)
              weight_data_r2[...] = weight_data[(i * num_input + j * 4) * maxk + k];
  The inner loop strides through weight_data with step maxk (for different input channels j4, j4+1, etc.), causing cache thrashing for large weight tensors. This only runs once during
   initialization, so it's not a hot path, but could be improved.

  10. GEMM dispatcher rounding: (top_blob.w * top_blob.h + 3) / 4

  In the GEMM forward path:
  dispatcher.w = (top_blob.w * top_blob.h + 3) / 4;
  This means up to 3 extra work items are dispatched. The shader must handle out-of-bounds checks for these extra items. This is standard practice but should be verified that the
  shader does check bounds (it likely does via gx >= outw*outh type checks).

  11. channels variable uses wrong value for GEMM condition check

  In forward():
  int channels = bottom_blob_bordered.c;
  // ...
  if (opt.use_sgemm_convolution && channels * maxk >= 8 && num_output >= 8 ...)
  Here channels is the packed channel count (num_input / elempack), not the raw input channels. The condition channels * maxk >= 8 is checking packed channels × kernel size, which is
  a different threshold than what might be intended. For elempack=4, a model with 8 raw input channels would have channels=2, and with a 1×1 kernel, channels * maxk = 2, which would
  skip the GEMM path even though there are enough operations to benefit from it. The create_pipeline() uses num_input * maxk >= 8 (raw channels), creating an inconsistency. This is a
  logic bug — the GEMM pipeline might be created but never used, or the condition could mismatch.

  12. No VkImageMemoryBarrier / explicit synchronization for convert_packing

  The forward() function calls vkdev->convert_packing() on offset and mask tensors but doesn't appear to insert explicit pipeline barriers between the packing conversion and the main
  dispatch. This should be handled internally by convert_packing and record_pipeline, but it's worth verifying.

  13. Shared memory sizing in CM shader uses specialization constants

  shared uvec2 tmp_v[UNROLL_WG_M][UNROLL_SG_M * UNROLL_SG_K * M * K / 4];
  shared uvec2 tmp_k[UNROLL_WG_N][UNROLL_SG_N * UNROLL_SG_K * K * N / 4];
  shared uvec2 tmp_o[UNROLL_WG_N * UNROLL_WG_M][UNROLL_SG_N * UNROLL_SG_M * M * N / 4];
  With default values (all 2, M=N=K=1), this computes to very small sizes. But with actual runtime values (e.g., M=N=16, K=16, UNROLL=2), this could exceed the shared memory limit
  (typically 32KB or 48KB). There's no compile-time or runtime check for this.

  ---
  Summary of Key Issues

  ┌─────┬──────────┬───────────────────────────────────────────────────────────────────────────────────────┐
  │  #  │ Severity │                                         Issue                                         │
  ├─────┼──────────┼───────────────────────────────────────────────────────────────────────────────────────┤
  │ 5   │ 🐛 Bug   │ weight_data CPU memory not released after GPU upload                                  │
  ├─────┼──────────┼───────────────────────────────────────────────────────────────────────────────────────┤
  │ 11  │ 🐛 Bug   │ GEMM condition in forward() uses packed channels vs raw channels in create_pipeline() │
  ├─────┼──────────┼───────────────────────────────────────────────────────────────────────────────────────┤
  │ 7   │ ⚠ Perf  │ Redundant offset/mask reads per-thread in CM shader                                   │
  ├─────┼──────────┼───────────────────────────────────────────────────────────────────────────────────────┤
  │ 8   │ ⚠ Maint │ ~1000 lines of duplicated code in CM shader                                           │
  ├─────┼──────────┼───────────────────────────────────────────────────────────────────────────────────────┤
  │ 13  │ ⚠ Risk  │ No shared memory overflow guard in CM shader                                          │
  ├─────┼──────────┼───────────────────────────────────────────────────────────────────────────────────────┤
  │ 9   │ 💡 Minor │ Cache-unfriendly weight repacking loop                                                │
  └─────┴──────────┴───────────────────────────────────────────────────────────────────────────────────────┘

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a Vulkan backend for DeformableConv2D, including packed and GEMM-based compute shaders (with an optional cooperative-matrix path) and the corresponding Vulkan layer implementation.

Changes:

  • Introduces Vulkan compute shaders for deformable conv2d in packed, packed+GEMM, and cooperative-matrix GEMM variants.
  • Adds DeformableConv2D_vulkan layer implementation with pipeline creation, weight packing/upload, and runtime dispatch (mask/no-mask, GEMM vs non-GEMM, coop-mat vs non-coop).
  • Implements weight packing for both standard packed and cooperative-matrix layouts.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
src/layer/vulkan/shader/deformableconv2d_packed.comp Packed direct deformable-conv compute shader (mask optional).
src/layer/vulkan/shader/deformableconv2d_packed_gemm.comp Packed GEMM-style deformable-conv shader processing 4 spatial positions per invocation (mask optional).
src/layer/vulkan/shader/deformableconv2d_gemm_cm.comp Cooperative-matrix GEMM deformable-conv shader (fp16 storage path, mask optional).
src/layer/vulkan/deformableconv2d_vulkan.h Declares the Vulkan layer, pipelines, and cooperative-matrix tuning parameters.
src/layer/vulkan/deformableconv2d_vulkan.cpp Implements pipeline creation, weight packing/upload, padding handling, and runtime dispatch across shader variants.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

return -100;

const int num_output_packed = (num_output + 3) / 4 * 4;
const int c_iterations = channels / elempack;
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

channels is already the packed channel count (VkMat::c) for the bordered input. Dividing by elempack makes c_iterations incorrect when elempack>1 (e.g., pack4), and it no longer matches the value baked into the shader specialization constants (num_input / elempack). This can break execution if psc(c) falls back to push constants (e.g., dynamic specialization), and it’s also misleading even if currently unused. Use the packed channel count directly (or derive from the known num_input / elempack) instead of channels / elempack.

Suggested change
const int c_iterations = channels / elempack;
const int c_iterations = channels;

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants