Skip to content

conv1d gemm with cm#6587

Open
futz12 wants to merge 4 commits intoTencent:masterfrom
futz12:conv1d-opt
Open

conv1d gemm with cm#6587
futz12 wants to merge 4 commits intoTencent:masterfrom
futz12:conv1d-opt

Conversation

@futz12
Copy link
Copy Markdown
Contributor

@futz12 futz12 commented Mar 8, 2026

No description provided.

@nihui nihui closed this Mar 8, 2026
@nihui nihui reopened this Mar 8, 2026
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Mar 8, 2026

Codecov Report

❌ Patch coverage is 50.62893% with 157 lines in your changes missing coverage. Please review.
✅ Project coverage is 93.15%. Comparing base (2490d63) to head (22a7d4d).

Files with missing lines Patch % Lines
src/layer/vulkan/convolution1d_vulkan.cpp 50.62% 157 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6587      +/-   ##
==========================================
- Coverage   93.22%   93.15%   -0.07%     
==========================================
  Files         904      904              
  Lines      283731   284182     +451     
==========================================
+ Hits       264495   264723     +228     
- Misses      19236    19459     +223     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@futz12 futz12 changed the title conv1d gemm [WIP]conv1d gemm Mar 8, 2026
@futz12
Copy link
Copy Markdown
Contributor Author

futz12 commented Apr 2, 2026

@nihui run ci please ^_^

@tencent-adm
Copy link
Copy Markdown
Member

CLA assistant check
Thank you for your submission, we really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@futz12 futz12 changed the title [WIP]conv1d gemm conv1d gemm with gemm & cm Apr 2, 2026
@futz12 futz12 changed the title conv1d gemm with gemm & cm conv1d gemm with cm Apr 2, 2026
@nihui
Copy link
Copy Markdown
Member

nihui commented Apr 2, 2026

  ---
  PR Overview

  This PR adds GEMM-based and 1x1s1d1 optimized paths for Convolution1D on the Vulkan backend, with cooperative matrix (CM) support. It modifies convolution1d_vulkan.cpp/.h and adds 4 new shader files.

  ---
  Potential Logic Errors

  1. Missing assertion / no fallback when cooperative matrix query returns zeros

  In both the GEMM and 1x1s1d1 CM paths:
  // assert coopmat_M != 0 && coopmat_N != 0 && coopmat_K != 0
  This is only a comment, not an actual assertion or check. If get_optimal_cooperative_matrix_mnk returns zeros (e.g., no supported configuration), the code proceeds to divide by zero in UNROLL_SG_* calculations and creates invalid
  pipelines. This should be a real check with a fallback to the non-CM path.

  2. Weight repacking in the GEMM CM path uses weight_data directly when elempack != 4

  if (elempack == 4) {
      // repack from maxk-inch-outch to 4-maxk-inch/4-outch
      weight_data_r2.create(4 * maxk * (num_input / 4) * num_output);
      ...
  } else {
      weight_data_r2 = weight_data;  // flat, not reshaped yet
  }
  // from maxk-inch-outch to inch-maxk-outch
  Mat weight_data_r2_r = weight_data_r2.reshape(maxk, num_input, num_output);

  When elempack == 4, weight_data_r2 is created with dimensions 4 * maxk * (num_input / 4) * num_output which equals maxk * num_input * num_output -- the same total element count. But the memory layout has been changed to
  4-maxk-inch/4-outch, so the subsequent .reshape(maxk, num_input, num_output) interprets this rearranged data as if it were still in the original layout. The reshape doesn't re-interpret correctly after the repack -- the data is in
  [outch][inch/4][maxk][4] order but reshape treats it as [outch][inch][maxk]. This is likely a data layout mismatch bug that would produce incorrect weight values in the CM path when elempack == 4.

  3. Hardcoded size = 1024 for cooperative matrix tile calculation

  int size = 1024;  // used for UNROLL_SG_M and UNROLL_WG_M

  This is used as a proxy for the spatial dimension (outw) in the M-dimension tile calculations. However, the actual output width (outw) is only known at forward time, not create_pipeline time. This means:
  - If outw is much smaller than 1024, the unrolling factors will be too aggressive, causing many wasted threads processing out-of-bounds elements.
  - If outw is much larger than 1024, the tile factors might be suboptimal (though capped at 2, so less harmful).

  This is arguably a design trade-off, but the magic number should at minimum be documented.

  4. elempack calculated differently at create_pipeline vs. forward time

  In create_pipeline:
  int elempack = num_input % 4 == 0 ? 4 : 1;

  In forward:
  int elempack = bottom_blob.elempack;  // comes from the actual input blob

  Then num_input in forward is:
  const int num_input = bottom_blob_bordered.h * elempack;

  If the runtime elempack differs from the one assumed at pipeline creation, the weight layout and specialization constants will be inconsistent with the actual data layout, producing incorrect results. There's no guard checking that
  the runtime elempack matches expectations.

  5. Non-CM GEMM path: dispatcher dimensions don't match workgroup logic

  // Non-CM GEMM forward:
  VkMat dispatcher;
  dispatcher.w = (top_blob.w + 3) / 4;
  dispatcher.h = num_output_packed / 4;
  dispatcher.c = 1;

  But the local size was set as:
  pipeline_convolution1d_gemm->set_local_size_xyz(8, 8, 1);
  // or
  pipeline_convolution1d_gemm->set_local_size_xyz(16, std::min(4, num_output_packed / 4), 1);

  The dispatcher divides top_blob.w by 4 (rounding up), which implies each workgroup processes 4 output elements spatially. But the local size in x is 8 or 16 -- this means each thread covers a different fraction of the spatial
  dimension. Whether this matches the shader logic is unclear without the shader, but the asymmetry between the +3)/4 dispatch and local sizes of 8/16 is suspicious and may cause either redundant computation or missing coverage.

  6. Bindings inconsistency in non-CM paths

  // Non-CM GEMM forward:
  std::vector<VkMat> bindings(6);
  bindings[0] = bottom_blob_bordered;
  bindings[1] = top_blob;
  bindings[2] = bottom_blob_bordered;  // duplicate of bindings[0]
  bindings[3] = top_blob;              // duplicate of bindings[1]
  bindings[4] = weight_data_gpu;
  bindings[5] = bias_data_gpu;

  Bindings 0-1 and 2-3 are duplicated. This is a common pattern in ncnn for supporting different storage types (image vs. buffer), but it's worth verifying these 6 bindings match what the convolution1d_packed_gemm shader expects. If the
   shader only declares 4 descriptor bindings, this would be a mismatch.

  ---
  Suboptimal Implementation Issues

  7. Massive code duplication between GEMM and 1x1s1d1 CM paths

  The cooperative matrix pipeline creation code for GEMM and 1x1s1d1 is nearly identical (~100 lines each):
  - Same UNROLL calculation logic
  - Same weight packing loop structure (with minor dimension differences)
  - Same specialization constant setup
  - Same pipeline creation

  This should be refactored into a shared helper function with parameters for the dimension differences.

  8. Weight packing has O(n^4) nested loops with poor cache locality

  In the CM weight packing:
  for (int q = 0; q < num_output_packed; q += coopmat_N * UNROLL_SG_N * UNROLL_WG_N) {
      for (int p = 0; p < num_input_packed * maxk; p += coopmat_K * UNROLL_SG_K) {
          for (int n = 0; n < coopmat_N * UNROLL_SG_N * UNROLL_WG_N; n++) {
              for (int k = 0; k < coopmat_K * UNROLL_SG_K; k++) {
                  // random access into weight_data_r2_r via .channel().row()[]

  The innermost loop accesses weight_data_r2_r.channel(q + n).row(w)[kx] which involves channel + row pointer arithmetic on every iteration. This has poor spatial locality and could be significantly improved by hoisting the channel/row
  lookups.

  9. Redundant num_input_packed / num_output_packed calculations

  These are computed identically ((x + 3) / 4 * 4) in multiple branches. They should be computed once at the top of create_pipeline and reused.

  10. No UNROLL_WG_K parameter

  The unrolling scheme defines UNROLL_WG_M and UNROLL_WG_N for the workgroup level but only UNROLL_SG_K for the K dimension. There's no UNROLL_WG_K. This is asymmetric -- on hardware where the K dimension is large (e.g., large num_input
   * maxk), having a workgroup-level K unroll could improve occupancy. This may be intentional but limits performance for K-dominant configs.

  11. destroy_pipeline resets cooperative matrix state unnecessarily

  use_cooperative_matrix = false;
  coopmat_M = 0; coopmat_N = 0; coopmat_K = 0;
  ...
  UNROLL_SG_M = 1; UNROLL_SG_N = 1; ...

  These member variables are always fully set in create_pipeline. Resetting them in destroy_pipeline is harmless but unnecessary and adds maintenance burden.

  12. Missing elempack == 8 support

  The code only handles elempack of 1 or 4. Many modern Vulkan implementations support 8-wide packing. This limits performance on hardware that could benefit from wider packing.

  ---
  Summary

  ┌──────┬──────────┬──────────────────────────────────────────────────────────────────────────────────────────┐
  │      │ Severity │ Issue                                                                                    │
  ├──────┼──────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
  │ 1    │ High     │ No real check when coopmat MNK query returns 0 -- division by zero risk                  │
  ├──────┼──────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
  │ 2    │ High     │ Weight repack for elempack==4 in GEMM CM path likely produces wrong layout after reshape │
  ├──────┼──────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
  │ 3    │ Medium   │ Hardcoded size=1024 may cause suboptimal tiling for very small/large spatial dims        │
  ├──────┼──────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
  │ 4    │ Medium   │ elempack mismatch between create_pipeline and forward has no runtime guard               │
  ├──────┼──────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
  │ 5    │ Medium   │ Dispatcher vs. local_size arithmetic mismatch in non-CM GEMM path                        │
  ├──────┼──────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
  │ 6    │ Low      │ Duplicate bindings should be verified against shader descriptor layout                   │
  ├──────┼──────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
  │ 7    │ Low      │ Significant code duplication between GEMM and 1x1s1d1 CM paths                           │
  ├──────┼──────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
  │ 8    │ Low      │ Poor cache locality in CM weight packing inner loops                                     │
  ├──────┼──────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
  │ 9-12 │ Low      │ Minor: redundant calculations, missing UNROLL_WG_K, unnecessary resets, no elempack=8    │
  └──────┴──────────┴──────────────────────────────────────────────────────────────────────────────────────────┘

  Issues #1 and #2 are likely correctness bugs that should be addressed before merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants