Add 2D indexing matmul test case by jackalcooper · Pull Request #81 · beaver-lodge/charms

jackalcooper · 2025-11-25T16:18:32Z

Summary by CodeRabbit

New Features
- Added GPU vector-add workflow, a reusable KernelUtil for copying terms to float buffers, and a new 2D-indexing matmul variant.
Refactor
- Reorganized matmul into Index1D/Index2D modules and centralized grid/block normalization; JIT cloning now preserves GPU container metadata; added an extra optimization pass in the IR pipeline.
Tests
- Updated tests to match module reorganization and removed the duplicated VecAddKernel test module.
Chores
- Minor cleanups replacing temporary zero variables with literals.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2025-11-25T16:18:41Z

Walkthrough

A new KernelUtil.copy_terms_as_floats helper was added. MatMul kernels were reorganized into MatMulKernel.Index1D, MatMulKernel.Index2D, and MatMulKernel.Square.Index1D. A new VecAddKernel was added. GPU launch dim normalization and JIT cloning for gpu.module/gpu.container_module were introduced. Tests updated to new module paths.

Changes

Cohort / File(s)	Summary
KernelUtil Module `bench/gpu/kernel_util.ex`	New module providing `copy_terms_as_floats(env, tail :: Pointer.t(Term.t()), arr :: Pointer.t(f32()))` that iterates a Term list, converts elements to float (f32) and writes them into a device/host pointer buffer.
MatMulKernel Reorganization `bench/gpu/matmul.ex`	Split/renamed kernels: 1D and square kernels moved to `MatMulKernel.Index1D` and `MatMulKernel.Square.Index1D`; added `MatMulKernel.Index2D`. Replaced prior in-module term-copy helper with `KernelUtil.copy_terms_as_floats` and adjusted launch/memory logic for 2D grid/block.
VecAddKernel Module `bench/gpu/vec_add.ex`	New `VecAddKernel` offering `defk vec_add/3`, `defk noop/0`, `defk barrier/0`, and `defm main/2`; allocates GPU memory, uses `KernelUtil.copy_terms_as_floats` for inputs, launches kernels, copies results back, and returns an Elixir list; includes `random_floats/0`.
GPU Launch Normalization `lib/charms/gpu.ex`	Added private `normalize_dims(dims, ir)` and replaced inline grid/block normalization in `GPU.launch` with this helper.
JIT Cloning Enhancements `lib/charms/jit.ex`	Propagates `gpu.container_module` into cloned module when present and includes `gpu.module` in the set of MLIR ops cloned (expanded from only `func.func` and `memref.global`).
Tests: removal and updates `test/defk_test.exs`, `test/matmul_test.exs`	Removed `VecAddKernel` definitions from `test/defk_test.exs`. Updated `test/matmul_test.exs` to reference `MatMulKernel.Index1D`, `MatMulKernel.Square.Index1D`, and new `MatMulKernel.Index2D` APIs and adjusted 2D signatures and call sites.
Pointer/Allocator API extension `lib/charms/pointer.ex`	Added `extra_arguments \\\\ []` parameter to private `do_allocate/6` and passed it into allocator invocations (first argument) to support extended allocator metadata.
Definition pass insertion `lib/charms/definition.ex`	Inserted `Beaver.Composer.nested("func.func", "promote-buffers-to-stack")` into the composer chain (new nested Beaver composer pass after canonicalization).
Small local refactors `bench/sort_util.ex`, `bench/vec_add_int_list.ex`	Replaced temporary typed-zero locals with literal `0` assignments in a couple of places; no public API changes.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant Host as Host/Test
    participant VM as Elixir VM (defm)
    participant Util as KernelUtil
    participant GPU as GPU Runtime

    Host->>VM: call main(env, l_a, l_b)
    activate VM

    VM->>GPU: GPU.alloc (device a,b,c and host_buf)
    GPU-->>VM: device pointers

    VM->>Util: copy_terms_as_floats(env, l_a_tail, a_ptr)
    activate Util
    Util->>Util: traverse list -> Term.to_f64! -> truncate -> f32
    Util->>GPU: write f32 values into a_ptr
    Util-->>VM: done
    deactivate Util

    VM->>Util: copy_terms_as_floats(env, l_b_tail, b_ptr)
    activate Util
    Util->>GPU: write f32 values into b_ptr
    Util-->>VM: done
    deactivate Util

    VM->>GPU: GPU.launch(kernel, normalized_dims)
    activate GPU
    GPU->>GPU: kernel compute (vec_add / matmul)
    GPU-->>VM: kernel complete
    deactivate GPU

    VM->>GPU: GPU.memcpy(device c_ptr -> host_buf)
    GPU-->>VM: host_buf populated

    VM->>VM: build Elixir list from host_buf
    VM-->>Host: return result list
    deactivate VM

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Pay extra attention to:
- Indexing, boundary checks, and address calculations in MatMulKernel.Index2D and MatMulKernel.Index1D.
- Numeric conversion, truncation, and pointer writes in KernelUtil.copy_terms_as_floats.
- Correctness of normalize_dims assumptions and its interaction with existing launch code.
- JIT cloning logic around gpu.container_module and cloning of gpu.module operations.
- Tests updated to new module paths and any mismatches in expected signatures.

Possibly related PRs

Clearer allocation API and add matmul test #80 — Overlaps with MatMul refactor and extraction/usage of copy_terms_as_floats into a shared utility.
Zig-like defer macro #71 — Introduced ENIF/Term helpers and deferral semantics used by KernelUtil.copy_terms_as_floats and the new kernels.

Poem

🐰 I hop through lists and make them light,
I turn each term into a float just right.
Kernels leap in rows and grids,
buffers hum where data bids—
a tiny rabbit smiles at bytes ✨

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Title check	⚠️ Warning	The PR title accurately reflects the main objective: adding 2D indexing support for matmul testing. However, the changes are significantly broader, including new GPU modules (VecAddKernel, KernelUtil), architectural refactoring (1D/2D kernel organization), IR optimization updates, and infrastructure improvements across multiple files.	Consider a more comprehensive title like 'Refactor GPU kernels with 2D indexing support and add KernelUtil module' or 'Reorganize matmul kernels into 1D/2D variants and extract common utilities' to better reflect the substantial scope of changes.
Docstring Coverage	⚠️ Warning	Docstring coverage is 12.50% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch allow-gpu-mod-merge-another-mod

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d29fd16 and cbcfac6.

📒 Files selected for processing (2)

lib/charms/definition.ex (1 hunks)
lib/charms/pointer.ex (2 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: Build and test (1.17)
GitHub Check: Build and test (1.18)

🔇 Additional comments (2)

lib/charms/definition.ex (1)

407-407: Excellent! I've now gathered sufficient information to complete the verification. Let me provide the final rewritten review comment:

The pass name is valid and properly placed; the connection to 2D matmul test objectives is justified.

The "promote-buffers-to-stack" pass is a legitimate MLIR bufferization optimization that converts heap allocations to stack allocations when safe. This directly benefits GPU kernels like the 2D matmul implementation in bench/gpu/matmul.ex, where reducing heap allocations improves performance. The pipeline placement after canonicalization is correct, and this optimization is appropriate for the MatMul kernel reorganization mentioned in the PR summary.

lib/charms/pointer.ex (1)

11-11: The extra_arguments parameter design has already been validated by existing tests.

This is a new module added with the extra_arguments feature intentionally designed in from the start (not a retrofit). The test file test/string_test.exs exercises this code path by calling Pointer.allocate(Term.t()), which internally invokes do_allocate with the default extra_arguments=[]. Since the code compiles and tests pass, Beaver v0.4.7's MemRef.alloc/1 and MemRef.alloca/1 functions already support this calling convention. No additional verification is needed.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (4)

lib/charms/gpu.ex (1)
41-56: Missing single-element tuple pattern {x}.

The helper handles [x] but not {x}. For consistency with other tuple patterns ({x, y}, {x, y, z}), consider adding the single-element tuple case:
         [x, y] -> {x, y, 1}
         [x] -> {x, 1, 1}
+        {x} -> {x, 1, 1}
         val -> {val, 1, 1}
bench/gpu/matmul.ex (1)
116-118: Consider consolidating random_list/1 into KernelUtil.

The random_list/1 function is duplicated in both Index1D and Index2D. Since KernelUtil already serves as a shared utility module, this helper could be moved there to reduce duplication.
# In bench/gpu/kernel_util.ex
def random_list(size) do
  Enum.map(1..size, fn _ -> :rand.uniform() end)
end
Also applies to: 359-361
bench/gpu/vec_add.ex (2)
5-6: Prefer integer math for @grid_size to avoid float/ceil/1 subtleties

@grid_size ceil(@size / @block_size) is mathematically fine but goes through float division and depends on ceil/1 being in scope for module attributes. A purely integer, “GPU-style” formula is more robust and avoids any surprises with numeric types:
-  @size 200_000
-  @block_size 1024
+  @size 200_000
+  @block_size 1024
@@
-  @grid_size ceil(@size / @block_size)
+  @grid_size div(@size + @block_size - 1, @block_size)
This keeps @grid_size as an integer, matches the classic ceil_div pattern, and removes the float dependency.

Please double‑check this against your target Elixir/Charms versions to ensure it matches how other kernels compute grid sizes.

Also applies to: 23-23

15-22: Clarify the necessity of noop and barrier kernel launches

Defining noop/0 and barrier/0 as tiny kernels and launching them after vec_add/3 is harmless, but it’s not obvious from this file what behavior they’re verifying (host-level sync vs. in-kernel control-flow semantics).

If they’re only here for experimentation or internal benchmarking, consider adding a short comment, or drop the extra launches if they’re not needed to keep the benchmark focused on vec_add/3.

Please confirm that these extra launches are intentional for exercising Charms/GPU behavior rather than leftover scaffolding.

Also applies to: 51-53

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a5daf50 and fe9c3b7.

📒 Files selected for processing (7)

bench/gpu/kernel_util.ex (1 hunks)
bench/gpu/matmul.ex (9 hunks)
bench/gpu/vec_add.ex (1 hunks)
lib/charms/gpu.ex (2 hunks)
lib/charms/jit.ex (1 hunks)
test/defk_test.exs (0 hunks)
test/matmul_test.exs (3 hunks)

💤 Files with no reviewable changes (1)

test/defk_test.exs

🧰 Additional context used

🧬 Code graph analysis (1)

test/matmul_test.exs (2)

bench/gpu/matmul.ex (3)

random_matrix (225-227)

random_list (116-118)

random_list (359-361)

test/support/cuda_test_helper.ex (1)

run_cuda_test (4-41)

🔇 Additional comments (11)

lib/charms/gpu.ex (1)

69-73: LGTM!

Good refactoring. The normalize_dims helper centralizes dimension handling, making the launch function cleaner and more maintainable.

lib/charms/jit.ex (1)

73-79: LGTM!

The changes correctly propagate the gpu.container_module attribute and expand the operation set to include gpu.module. This ensures GPU modules are properly cloned when merging MLIR modules.

test/matmul_test.exs (1)

73-87: LGTM!

The new 2D indexing matmul test follows the same pattern as existing tests and correctly:

Extracts dimensions at compile-time using module attributes

Generates appropriately-sized random input matrices

Validates against the CPU reference implementation with suitable tolerance

bench/gpu/matmul.ex (2)

251-254: Grid dimension calculation may cause incomplete coverage.

The grid dimensions are calculated as:

@grid_dim_x = ceil(@m / @block_dim_x) → maps to rows

@grid_dim_y = ceil(@n / @block_dim_y) → maps to columns

With @m=64, @block_dim_x=16: grid_dim_x = 4 ✓
With @n=32, @block_dim_y=16: grid_dim_y = 2 ✓

The current values work correctly. However, ensure this pattern holds for non-evenly-divisible dimensions in future changes.

257-296: LGTM!

The 2D matmul kernel correctly implements:

2D thread/block coordinate mapping

Proper bounds checking with row < @m && col < @n

Row-major memory access patterns for A, B, and C matrices

Correct dot-product accumulation over the K dimension

bench/gpu/kernel_util.ex (2)

6-17: LGTM!

The copy_terms_as_floats function correctly:

Traverses an Elixir linked list using NIF semantics

Converts each term from f64 to f32 via truncation

Tracks array index for sequential writes

The callers (matmul kernels) allocate appropriately-sized buffers before calling this function, so bounds checking is handled at the call site.

14-14: Precision loss from f64→f32 truncation is intentional.

The conversion path Term.to_f64! → arith.truncf → f32 intentionally trades precision for GPU performance. This is appropriate for the matmul use case.

bench/gpu/vec_add.ex (4)

7-13: Vector-add kernel indexing and bounds check look correct

The i = GPU.block_id() * @block_size + GPU.thread_id() pattern combined with if i < @size correctly guards against overrun when @grid_size * @block_size > @size, and the c[i] = a[i] + b[i] write is straightforward.

No issues from a correctness perspective.

24-49: Host–device allocation and copy pipeline is consistent and resource-safe

The flow in main/3—allocate a/b/c plus a host_shared buffer, defer-deallocate all of them, reuse a movable_list_ptr to feed KernelUtil.copy_terms_as_floats/3, and then GPU.memcpy(... ) |> GPU.await() into a and b—is coherent and appears resource-safe.

Reusing the same buffer for both inputs is a nice touch; just ensure upstream callers always provide lists compatible with @size/buffer semantics as expected by KernelUtil.copy_terms_as_floats/3.

Please verify that KernelUtil.copy_terms_as_floats/3 guards against overreading when the input list is shorter than @size and against overwriting when it’s longer.

55-68: Output copy-back and Term/list conversion looks type-correct

GPU.memcpy(buffer, c) |> GPU.await() mirrors the earlier host→device copies.

for_loop {element, i} <- {buffer, size} followed by arith.extf to f64() and enif_make_double/2 is a sensible f32→f64 path.

Using new! Term.t(), size with defer free! arr plus enif_make_list_from_array(env, arr, size) ensures BEAM owns the final list and native memory is freed.

Using arith.trunci(size) :: i32() for the final length is also safe given @size = 200_000.

Looks solid overall.

70-72: random_floats/0 helper is simple and appropriate for a bench

Generating @size random floats in [0.0, 10.0) via Enum.map/2 and :rand.uniform/0 is fine for a benchmark helper. No issues here.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

bench/sort_util.ex (1)

41-48: Consider unifying zero initialization style in merge/4

Here i_ptr[0] is initialized with literal 0, while j_ptr[0] uses the zero = const 0 :: i32() temp which is now single‑use. For consistency and to drop an extra temp, you could either:

Inline 0 for j_ptr as well and remove zero, or

Use zero for both i_ptr and j_ptr.

Purely cosmetic; behavior is fine as‑is.
bench/vec_add_int_list.ex (1)
9-9: Minor style inconsistency: mixing literal and typed constant.

Line 9 now uses the literal 0 while lines 12-19 still use the typed zero constant for arithmetic. Consider using zero consistently throughout for clarity, or document why the literal is preferred for initialization.

If you prefer consistency, apply this diff:
-    set! i_ptr[0], 0
+    set! i_ptr[0], zero

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fa0eaa2 and 8221370.

📒 Files selected for processing (3)

bench/gpu/kernel_util.ex (1 hunks)
bench/sort_util.ex (2 hunks)
bench/vec_add_int_list.ex (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

bench/gpu/kernel_util.ex

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: Build and test (1.18)
GitHub Check: Build and test (1.17)

🔇 Additional comments (1)

bench/sort_util.ex (1)

6-23: Literal 0 initialization in copy_terms/3 looks good

Using set! i_ptr[0], 0 directly simplifies the setup without changing behavior; the loop logic remains correct and clear.

jackalcooper added 2 commits November 26, 2025 00:11

rm VecAddKernel's copy

a4e522b

MatMulKernel.Index2D

fe9c3b7

Update matmul.ex

fa0eaa2

coderabbitai bot reviewed Nov 25, 2025

View reviewed changes

set 0

8221370

coderabbitai bot reviewed Nov 26, 2025

View reviewed changes

jackalcooper added 4 commits November 26, 2025 15:43

use zip_reduce

a062fc1

zip_reduce

d29fd16

extra_arguments

2553781

promote-buffers-to-stack

cbcfac6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add 2D indexing matmul test case#81

Add 2D indexing matmul test case#81
jackalcooper wants to merge 8 commits intomainfrom
allow-gpu-mod-merge-another-mod

jackalcooper commented Nov 25, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Nov 25, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jackalcooper commented Nov 25, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jackalcooper commented Nov 25, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 25, 2025 •

edited

Loading