Conversation
WalkthroughA new KernelUtil.copy_terms_as_floats helper was added. MatMul kernels were reorganized into Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Host as Host/Test
participant VM as Elixir VM (defm)
participant Util as KernelUtil
participant GPU as GPU Runtime
Host->>VM: call main(env, l_a, l_b)
activate VM
VM->>GPU: GPU.alloc (device a,b,c and host_buf)
GPU-->>VM: device pointers
VM->>Util: copy_terms_as_floats(env, l_a_tail, a_ptr)
activate Util
Util->>Util: traverse list -> Term.to_f64! -> truncate -> f32
Util->>GPU: write f32 values into a_ptr
Util-->>VM: done
deactivate Util
VM->>Util: copy_terms_as_floats(env, l_b_tail, b_ptr)
activate Util
Util->>GPU: write f32 values into b_ptr
Util-->>VM: done
deactivate Util
VM->>GPU: GPU.launch(kernel, normalized_dims)
activate GPU
GPU->>GPU: kernel compute (vec_add / matmul)
GPU-->>VM: kernel complete
deactivate GPU
VM->>GPU: GPU.memcpy(device c_ptr -> host_buf)
GPU-->>VM: host_buf populated
VM->>VM: build Elixir list from host_buf
VM-->>Host: return result list
deactivate VM
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes
Possibly related PRs
Poem
Pre-merge checks and finishing touches❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (2)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
🔇 Additional comments (2)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (4)
lib/charms/gpu.ex (1)
41-56: Missing single-element tuple pattern{x}.The helper handles
[x]but not{x}. For consistency with other tuple patterns ({x, y},{x, y, z}), consider adding the single-element tuple case:[x, y] -> {x, y, 1} [x] -> {x, 1, 1} + {x} -> {x, 1, 1} val -> {val, 1, 1}bench/gpu/matmul.ex (1)
116-118: Consider consolidatingrandom_list/1intoKernelUtil.The
random_list/1function is duplicated in bothIndex1DandIndex2D. SinceKernelUtilalready serves as a shared utility module, this helper could be moved there to reduce duplication.# In bench/gpu/kernel_util.ex def random_list(size) do Enum.map(1..size, fn _ -> :rand.uniform() end) endAlso applies to: 359-361
bench/gpu/vec_add.ex (2)
5-6: Prefer integer math for@grid_sizeto avoid float/ceil/1subtleties
@grid_size ceil(@size / @block_size)is mathematically fine but goes through float division and depends onceil/1being in scope for module attributes. A purely integer, “GPU-style” formula is more robust and avoids any surprises with numeric types:- @size 200_000 - @block_size 1024 + @size 200_000 + @block_size 1024 @@ - @grid_size ceil(@size / @block_size) + @grid_size div(@size + @block_size - 1, @block_size)This keeps
@grid_sizeas an integer, matches the classicceil_divpattern, and removes the float dependency.Please double‑check this against your target Elixir/Charms versions to ensure it matches how other kernels compute grid sizes.
Also applies to: 23-23
15-22: Clarify the necessity ofnoopandbarrierkernel launchesDefining
noop/0andbarrier/0as tiny kernels and launching them aftervec_add/3is harmless, but it’s not obvious from this file what behavior they’re verifying (host-level sync vs. in-kernel control-flow semantics).If they’re only here for experimentation or internal benchmarking, consider adding a short comment, or drop the extra launches if they’re not needed to keep the benchmark focused on
vec_add/3.Please confirm that these extra launches are intentional for exercising Charms/GPU behavior rather than leftover scaffolding.
Also applies to: 51-53
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
bench/gpu/kernel_util.ex(1 hunks)bench/gpu/matmul.ex(9 hunks)bench/gpu/vec_add.ex(1 hunks)lib/charms/gpu.ex(2 hunks)lib/charms/jit.ex(1 hunks)test/defk_test.exs(0 hunks)test/matmul_test.exs(3 hunks)
💤 Files with no reviewable changes (1)
- test/defk_test.exs
🧰 Additional context used
🧬 Code graph analysis (1)
test/matmul_test.exs (2)
bench/gpu/matmul.ex (3)
random_matrix(225-227)random_list(116-118)random_list(359-361)test/support/cuda_test_helper.ex (1)
run_cuda_test(4-41)
🔇 Additional comments (11)
lib/charms/gpu.ex (1)
69-73: LGTM!Good refactoring. The
normalize_dimshelper centralizes dimension handling, making the launch function cleaner and more maintainable.lib/charms/jit.ex (1)
73-79: LGTM!The changes correctly propagate the
gpu.container_moduleattribute and expand the operation set to includegpu.module. This ensures GPU modules are properly cloned when merging MLIR modules.test/matmul_test.exs (1)
73-87: LGTM!The new 2D indexing matmul test follows the same pattern as existing tests and correctly:
- Extracts dimensions at compile-time using module attributes
- Generates appropriately-sized random input matrices
- Validates against the CPU reference implementation with suitable tolerance
bench/gpu/matmul.ex (2)
251-254: Grid dimension calculation may cause incomplete coverage.The grid dimensions are calculated as:
@grid_dim_x = ceil(@m / @block_dim_x)→ maps to rows@grid_dim_y = ceil(@n / @block_dim_y)→ maps to columnsWith
@m=64,@block_dim_x=16: grid_dim_x = 4 ✓
With@n=32,@block_dim_y=16: grid_dim_y = 2 ✓The current values work correctly. However, ensure this pattern holds for non-evenly-divisible dimensions in future changes.
257-296: LGTM!The 2D matmul kernel correctly implements:
- 2D thread/block coordinate mapping
- Proper bounds checking with
row < @m && col < @n- Row-major memory access patterns for A, B, and C matrices
- Correct dot-product accumulation over the K dimension
bench/gpu/kernel_util.ex (2)
6-17: LGTM!The
copy_terms_as_floatsfunction correctly:
- Traverses an Elixir linked list using NIF semantics
- Converts each term from f64 to f32 via truncation
- Tracks array index for sequential writes
The callers (matmul kernels) allocate appropriately-sized buffers before calling this function, so bounds checking is handled at the call site.
14-14: Precision loss from f64→f32 truncation is intentional.The conversion path
Term.to_f64!→arith.truncf→f32intentionally trades precision for GPU performance. This is appropriate for the matmul use case.bench/gpu/vec_add.ex (4)
7-13: Vector-add kernel indexing and bounds check look correctThe
i = GPU.block_id() * @block_size + GPU.thread_id()pattern combined withif i < @sizecorrectly guards against overrun when@grid_size * @block_size > @size, and thec[i] = a[i] + b[i]write is straightforward.No issues from a correctness perspective.
24-49: Host–device allocation and copy pipeline is consistent and resource-safeThe flow in
main/3—allocatea/b/cplus a host_sharedbuffer,defer-deallocate all of them, reuse amovable_list_ptrto feedKernelUtil.copy_terms_as_floats/3, and thenGPU.memcpy(... ) |> GPU.await()intoaandb—is coherent and appears resource-safe.Reusing the same
bufferfor both inputs is a nice touch; just ensure upstream callers always provide lists compatible with@size/buffersemantics as expected byKernelUtil.copy_terms_as_floats/3.Please verify that
KernelUtil.copy_terms_as_floats/3guards against overreading when the input list is shorter than@sizeand against overwriting when it’s longer.
55-68: Output copy-back and Term/list conversion looks type-correct
GPU.memcpy(buffer, c) |> GPU.await()mirrors the earlier host→device copies.for_loop {element, i} <- {buffer, size}followed byarith.extftof64()andenif_make_double/2is a sensible f32→f64 path.- Using
new! Term.t(), sizewithdefer free! arrplusenif_make_list_from_array(env, arr, size)ensures BEAM owns the final list and native memory is freed.Using
arith.trunci(size) :: i32()for the final length is also safe given@size = 200_000.Looks solid overall.
70-72:random_floats/0helper is simple and appropriate for a benchGenerating
@sizerandom floats in[0.0, 10.0)viaEnum.map/2and:rand.uniform/0is fine for a benchmark helper. No issues here.
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (2)
bench/sort_util.ex (1)
41-48: Consider unifying zero initialization style inmerge/4Here
i_ptr[0]is initialized with literal0, whilej_ptr[0]uses thezero = const 0 :: i32()temp which is now single‑use. For consistency and to drop an extra temp, you could either:
- Inline
0forj_ptras well and removezero, or- Use
zerofor bothi_ptrandj_ptr.Purely cosmetic; behavior is fine as‑is.
bench/vec_add_int_list.ex (1)
9-9: Minor style inconsistency: mixing literal and typed constant.Line 9 now uses the literal
0while lines 12-19 still use the typedzeroconstant for arithmetic. Consider usingzeroconsistently throughout for clarity, or document why the literal is preferred for initialization.If you prefer consistency, apply this diff:
- set! i_ptr[0], 0 + set! i_ptr[0], zero
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
bench/gpu/kernel_util.ex(1 hunks)bench/sort_util.ex(2 hunks)bench/vec_add_int_list.ex(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- bench/gpu/kernel_util.ex
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: Build and test (1.18)
- GitHub Check: Build and test (1.17)
🔇 Additional comments (1)
bench/sort_util.ex (1)
6-23: Literal0initialization incopy_terms/3looks goodUsing
set! i_ptr[0], 0directly simplifies the setup without changing behavior; the loop logic remains correct and clear.
Summary by CodeRabbit
New Features
Refactor
Tests
Chores
✏️ Tip: You can customize this high-level summary in your review settings.