Skip to content

Add 2D indexing matmul test case#81

Open
jackalcooper wants to merge 8 commits intomainfrom
allow-gpu-mod-merge-another-mod
Open

Add 2D indexing matmul test case#81
jackalcooper wants to merge 8 commits intomainfrom
allow-gpu-mod-merge-another-mod

Conversation

@jackalcooper
Copy link
Contributor

@jackalcooper jackalcooper commented Nov 25, 2025

Summary by CodeRabbit

  • New Features

    • Added GPU vector-add workflow, a reusable KernelUtil for copying terms to float buffers, and a new 2D-indexing matmul variant.
  • Refactor

    • Reorganized matmul into Index1D/Index2D modules and centralized grid/block normalization; JIT cloning now preserves GPU container metadata; added an extra optimization pass in the IR pipeline.
  • Tests

    • Updated tests to match module reorganization and removed the duplicated VecAddKernel test module.
  • Chores

    • Minor cleanups replacing temporary zero variables with literals.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 25, 2025

Walkthrough

A new KernelUtil.copy_terms_as_floats helper was added. MatMul kernels were reorganized into MatMulKernel.Index1D, MatMulKernel.Index2D, and MatMulKernel.Square.Index1D. A new VecAddKernel was added. GPU launch dim normalization and JIT cloning for gpu.module/gpu.container_module were introduced. Tests updated to new module paths.

Changes

Cohort / File(s) Summary
KernelUtil Module
bench/gpu/kernel_util.ex
New module providing copy_terms_as_floats(env, tail :: Pointer.t(Term.t()), arr :: Pointer.t(f32())) that iterates a Term list, converts elements to float (f32) and writes them into a device/host pointer buffer.
MatMulKernel Reorganization
bench/gpu/matmul.ex
Split/renamed kernels: 1D and square kernels moved to MatMulKernel.Index1D and MatMulKernel.Square.Index1D; added MatMulKernel.Index2D. Replaced prior in-module term-copy helper with KernelUtil.copy_terms_as_floats and adjusted launch/memory logic for 2D grid/block.
VecAddKernel Module
bench/gpu/vec_add.ex
New VecAddKernel offering defk vec_add/3, defk noop/0, defk barrier/0, and defm main/2; allocates GPU memory, uses KernelUtil.copy_terms_as_floats for inputs, launches kernels, copies results back, and returns an Elixir list; includes random_floats/0.
GPU Launch Normalization
lib/charms/gpu.ex
Added private normalize_dims(dims, ir) and replaced inline grid/block normalization in GPU.launch with this helper.
JIT Cloning Enhancements
lib/charms/jit.ex
Propagates gpu.container_module into cloned module when present and includes gpu.module in the set of MLIR ops cloned (expanded from only func.func and memref.global).
Tests: removal and updates
test/defk_test.exs, test/matmul_test.exs
Removed VecAddKernel definitions from test/defk_test.exs. Updated test/matmul_test.exs to reference MatMulKernel.Index1D, MatMulKernel.Square.Index1D, and new MatMulKernel.Index2D APIs and adjusted 2D signatures and call sites.
Pointer/Allocator API extension
lib/charms/pointer.ex
Added extra_arguments \\\\ [] parameter to private do_allocate/6 and passed it into allocator invocations (first argument) to support extended allocator metadata.
Definition pass insertion
lib/charms/definition.ex
Inserted Beaver.Composer.nested("func.func", "promote-buffers-to-stack") into the composer chain (new nested Beaver composer pass after canonicalization).
Small local refactors
bench/sort_util.ex, bench/vec_add_int_list.ex
Replaced temporary typed-zero locals with literal 0 assignments in a couple of places; no public API changes.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant Host as Host/Test
    participant VM as Elixir VM (defm)
    participant Util as KernelUtil
    participant GPU as GPU Runtime

    Host->>VM: call main(env, l_a, l_b)
    activate VM

    VM->>GPU: GPU.alloc (device a,b,c and host_buf)
    GPU-->>VM: device pointers

    VM->>Util: copy_terms_as_floats(env, l_a_tail, a_ptr)
    activate Util
    Util->>Util: traverse list -> Term.to_f64! -> truncate -> f32
    Util->>GPU: write f32 values into a_ptr
    Util-->>VM: done
    deactivate Util

    VM->>Util: copy_terms_as_floats(env, l_b_tail, b_ptr)
    activate Util
    Util->>GPU: write f32 values into b_ptr
    Util-->>VM: done
    deactivate Util

    VM->>GPU: GPU.launch(kernel, normalized_dims)
    activate GPU
    GPU->>GPU: kernel compute (vec_add / matmul)
    GPU-->>VM: kernel complete
    deactivate GPU

    VM->>GPU: GPU.memcpy(device c_ptr -> host_buf)
    GPU-->>VM: host_buf populated

    VM->>VM: build Elixir list from host_buf
    VM-->>Host: return result list
    deactivate VM
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

  • Pay extra attention to:
    • Indexing, boundary checks, and address calculations in MatMulKernel.Index2D and MatMulKernel.Index1D.
    • Numeric conversion, truncation, and pointer writes in KernelUtil.copy_terms_as_floats.
    • Correctness of normalize_dims assumptions and its interaction with existing launch code.
    • JIT cloning logic around gpu.container_module and cloning of gpu.module operations.
    • Tests updated to new module paths and any mismatches in expected signatures.

Possibly related PRs

Poem

🐰 I hop through lists and make them light,
I turn each term into a float just right.
Kernels leap in rows and grids,
buffers hum where data bids—
a tiny rabbit smiles at bytes ✨

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Title check ⚠️ Warning The PR title accurately reflects the main objective: adding 2D indexing support for matmul testing. However, the changes are significantly broader, including new GPU modules (VecAddKernel, KernelUtil), architectural refactoring (1D/2D kernel organization), IR optimization updates, and infrastructure improvements across multiple files. Consider a more comprehensive title like 'Refactor GPU kernels with 2D indexing support and add KernelUtil module' or 'Reorganize matmul kernels into 1D/2D variants and extract common utilities' to better reflect the substantial scope of changes.
Docstring Coverage ⚠️ Warning Docstring coverage is 12.50% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch allow-gpu-mod-merge-another-mod

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d29fd16 and cbcfac6.

📒 Files selected for processing (2)
  • lib/charms/definition.ex (1 hunks)
  • lib/charms/pointer.ex (2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: Build and test (1.17)
  • GitHub Check: Build and test (1.18)
🔇 Additional comments (2)
lib/charms/definition.ex (1)

407-407: Excellent! I've now gathered sufficient information to complete the verification. Let me provide the final rewritten review comment:

The pass name is valid and properly placed; the connection to 2D matmul test objectives is justified.

The "promote-buffers-to-stack" pass is a legitimate MLIR bufferization optimization that converts heap allocations to stack allocations when safe. This directly benefits GPU kernels like the 2D matmul implementation in bench/gpu/matmul.ex, where reducing heap allocations improves performance. The pipeline placement after canonicalization is correct, and this optimization is appropriate for the MatMul kernel reorganization mentioned in the PR summary.

lib/charms/pointer.ex (1)

11-11: The extra_arguments parameter design has already been validated by existing tests.

This is a new module added with the extra_arguments feature intentionally designed in from the start (not a retrofit). The test file test/string_test.exs exercises this code path by calling Pointer.allocate(Term.t()), which internally invokes do_allocate with the default extra_arguments=[]. Since the code compiles and tests pass, Beaver v0.4.7's MemRef.alloc/1 and MemRef.alloca/1 functions already support this calling convention. No additional verification is needed.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (4)
lib/charms/gpu.ex (1)

41-56: Missing single-element tuple pattern {x}.

The helper handles [x] but not {x}. For consistency with other tuple patterns ({x, y}, {x, y, z}), consider adding the single-element tuple case:

         [x, y] -> {x, y, 1}
         [x] -> {x, 1, 1}
+        {x} -> {x, 1, 1}
         val -> {val, 1, 1}
bench/gpu/matmul.ex (1)

116-118: Consider consolidating random_list/1 into KernelUtil.

The random_list/1 function is duplicated in both Index1D and Index2D. Since KernelUtil already serves as a shared utility module, this helper could be moved there to reduce duplication.

# In bench/gpu/kernel_util.ex
def random_list(size) do
  Enum.map(1..size, fn _ -> :rand.uniform() end)
end

Also applies to: 359-361

bench/gpu/vec_add.ex (2)

5-6: Prefer integer math for @grid_size to avoid float/ceil/1 subtleties

@grid_size ceil(@size / @block_size) is mathematically fine but goes through float division and depends on ceil/1 being in scope for module attributes. A purely integer, “GPU-style” formula is more robust and avoids any surprises with numeric types:

-  @size 200_000
-  @block_size 1024
+  @size 200_000
+  @block_size 1024
@@
-  @grid_size ceil(@size / @block_size)
+  @grid_size div(@size + @block_size - 1, @block_size)

This keeps @grid_size as an integer, matches the classic ceil_div pattern, and removes the float dependency.

Please double‑check this against your target Elixir/Charms versions to ensure it matches how other kernels compute grid sizes.

Also applies to: 23-23


15-22: Clarify the necessity of noop and barrier kernel launches

Defining noop/0 and barrier/0 as tiny kernels and launching them after vec_add/3 is harmless, but it’s not obvious from this file what behavior they’re verifying (host-level sync vs. in-kernel control-flow semantics).

If they’re only here for experimentation or internal benchmarking, consider adding a short comment, or drop the extra launches if they’re not needed to keep the benchmark focused on vec_add/3.

Please confirm that these extra launches are intentional for exercising Charms/GPU behavior rather than leftover scaffolding.

Also applies to: 51-53

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a5daf50 and fe9c3b7.

📒 Files selected for processing (7)
  • bench/gpu/kernel_util.ex (1 hunks)
  • bench/gpu/matmul.ex (9 hunks)
  • bench/gpu/vec_add.ex (1 hunks)
  • lib/charms/gpu.ex (2 hunks)
  • lib/charms/jit.ex (1 hunks)
  • test/defk_test.exs (0 hunks)
  • test/matmul_test.exs (3 hunks)
💤 Files with no reviewable changes (1)
  • test/defk_test.exs
🧰 Additional context used
🧬 Code graph analysis (1)
test/matmul_test.exs (2)
bench/gpu/matmul.ex (3)
  • random_matrix (225-227)
  • random_list (116-118)
  • random_list (359-361)
test/support/cuda_test_helper.ex (1)
  • run_cuda_test (4-41)
🔇 Additional comments (11)
lib/charms/gpu.ex (1)

69-73: LGTM!

Good refactoring. The normalize_dims helper centralizes dimension handling, making the launch function cleaner and more maintainable.

lib/charms/jit.ex (1)

73-79: LGTM!

The changes correctly propagate the gpu.container_module attribute and expand the operation set to include gpu.module. This ensures GPU modules are properly cloned when merging MLIR modules.

test/matmul_test.exs (1)

73-87: LGTM!

The new 2D indexing matmul test follows the same pattern as existing tests and correctly:

  • Extracts dimensions at compile-time using module attributes
  • Generates appropriately-sized random input matrices
  • Validates against the CPU reference implementation with suitable tolerance
bench/gpu/matmul.ex (2)

251-254: Grid dimension calculation may cause incomplete coverage.

The grid dimensions are calculated as:

  • @grid_dim_x = ceil(@m / @block_dim_x) → maps to rows
  • @grid_dim_y = ceil(@n / @block_dim_y) → maps to columns

With @m=64, @block_dim_x=16: grid_dim_x = 4 ✓
With @n=32, @block_dim_y=16: grid_dim_y = 2 ✓

The current values work correctly. However, ensure this pattern holds for non-evenly-divisible dimensions in future changes.


257-296: LGTM!

The 2D matmul kernel correctly implements:

  • 2D thread/block coordinate mapping
  • Proper bounds checking with row < @m && col < @n
  • Row-major memory access patterns for A, B, and C matrices
  • Correct dot-product accumulation over the K dimension
bench/gpu/kernel_util.ex (2)

6-17: LGTM!

The copy_terms_as_floats function correctly:

  • Traverses an Elixir linked list using NIF semantics
  • Converts each term from f64 to f32 via truncation
  • Tracks array index for sequential writes

The callers (matmul kernels) allocate appropriately-sized buffers before calling this function, so bounds checking is handled at the call site.


14-14: Precision loss from f64→f32 truncation is intentional.

The conversion path Term.to_f64!arith.truncff32 intentionally trades precision for GPU performance. This is appropriate for the matmul use case.

bench/gpu/vec_add.ex (4)

7-13: Vector-add kernel indexing and bounds check look correct

The i = GPU.block_id() * @block_size + GPU.thread_id() pattern combined with if i < @size correctly guards against overrun when @grid_size * @block_size > @size, and the c[i] = a[i] + b[i] write is straightforward.

No issues from a correctness perspective.


24-49: Host–device allocation and copy pipeline is consistent and resource-safe

The flow in main/3—allocate a/b/c plus a host_shared buffer, defer-deallocate all of them, reuse a movable_list_ptr to feed KernelUtil.copy_terms_as_floats/3, and then GPU.memcpy(... ) |> GPU.await() into a and b—is coherent and appears resource-safe.

Reusing the same buffer for both inputs is a nice touch; just ensure upstream callers always provide lists compatible with @size/buffer semantics as expected by KernelUtil.copy_terms_as_floats/3.

Please verify that KernelUtil.copy_terms_as_floats/3 guards against overreading when the input list is shorter than @size and against overwriting when it’s longer.


55-68: Output copy-back and Term/list conversion looks type-correct

  • GPU.memcpy(buffer, c) |> GPU.await() mirrors the earlier host→device copies.
  • for_loop {element, i} <- {buffer, size} followed by arith.extf to f64() and enif_make_double/2 is a sensible f32→f64 path.
  • Using new! Term.t(), size with defer free! arr plus enif_make_list_from_array(env, arr, size) ensures BEAM owns the final list and native memory is freed.

Using arith.trunci(size) :: i32() for the final length is also safe given @size = 200_000.

Looks solid overall.


70-72: random_floats/0 helper is simple and appropriate for a bench

Generating @size random floats in [0.0, 10.0) via Enum.map/2 and :rand.uniform/0 is fine for a benchmark helper. No issues here.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
bench/sort_util.ex (1)

41-48: Consider unifying zero initialization style in merge/4

Here i_ptr[0] is initialized with literal 0, while j_ptr[0] uses the zero = const 0 :: i32() temp which is now single‑use. For consistency and to drop an extra temp, you could either:

  • Inline 0 for j_ptr as well and remove zero, or
  • Use zero for both i_ptr and j_ptr.

Purely cosmetic; behavior is fine as‑is.

bench/vec_add_int_list.ex (1)

9-9: Minor style inconsistency: mixing literal and typed constant.

Line 9 now uses the literal 0 while lines 12-19 still use the typed zero constant for arithmetic. Consider using zero consistently throughout for clarity, or document why the literal is preferred for initialization.

If you prefer consistency, apply this diff:

-    set! i_ptr[0], 0
+    set! i_ptr[0], zero
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fa0eaa2 and 8221370.

📒 Files selected for processing (3)
  • bench/gpu/kernel_util.ex (1 hunks)
  • bench/sort_util.ex (2 hunks)
  • bench/vec_add_int_list.ex (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • bench/gpu/kernel_util.ex
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: Build and test (1.18)
  • GitHub Check: Build and test (1.17)
🔇 Additional comments (1)
bench/sort_util.ex (1)

6-23: Literal 0 initialization in copy_terms/3 looks good

Using set! i_ptr[0], 0 directly simplifies the setup without changing behavior; the loop logic remains correct and clear.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant