Basic addition of int4 functionality by jrajala6 · Pull Request #343 · cactus-compute/cactus

jrajala6 · 2026-02-12T08:55:24Z

No description provided.

cactus/engine/engine.h

cactus/kernel/kernel_matmul.cpp

Copilot

Pull request overview

Adds initial INT4 (4-bit) grouped-weight matmul support end-to-end (Python export → graph loader/executor → NEON kernel), plus a new kernel correctness test.

Changes:

Introduces INT4 GEMV/GEMM kernels with packed (nibble) weight format and integrates dispatch in graph matmul.
Updates Python tensor export to pack INT4 weights (planar layout) and avoids INT4 for embedding weights.
Adjusts graph byte sizing logic to use packed sizes for INT4 across buffer sizing, I/O, and debug capture.

Reviewed changes

Copilot reviewed 12 out of 14 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
tests/test_kernel.cpp	Adds INT4 matmul correctness test (GEMV + GEMM).
python/src/tensor_io.py	Changes INT4 packing to planar layout; forces embedding weights to INT8 when INT4 requested.
cactus/kernel/kernel_quants.cpp	Updates INT4 nibble decode to offset-binary; (unpack layout needs follow-up).
cactus/kernel/kernel_matmul.cpp	Adds INT4 GEMV/GEMM/matmul implementation (packed weights + optional bias correction).
cactus/kernel/kernel.h	Exposes INT4 matmul APIs.
cactus/graph/graph_ops_tensor.cpp	Uses packed byte sizing for gather copies (INT4-aware sizing).
cactus/graph/graph_ops_nn.cpp	Routes grouped INT4 RHS to cactus_matmul_int4 and updates error messages.
cactus/graph/graph_io.cpp	Uses packed byte sizing; adds INT4 handling in mmap paths; (INT4 unpack + save scales need follow-up).
cactus/graph/graph_execute.cpp	Uses packed byte sizing for debug capture writes.
cactus/graph/graph_core.cpp	Computes BufferDesc::byte_size using packed byte sizing.
cactus/graph/graph.h	Adds is_grouped_int4 and packed_size_of() helper used across graph.
cactus/engine/engine.h	Adds constructors to index structs.
.gitignore	Adds _profile.txt and normalizes .bin entry.

Comments suppressed due to low confidence (2)

cactus/graph/graph_io.cpp:33

This INT4 unpack helper assumes pairwise nibble packing (alternating low/high outputs). The Python INT4 writer now uses planar packing (first 16 values in low nibbles, next 16 in high nibbles), so this unpacking will reorder values incorrectly when enable_int4_packing is disabled. Update it to expand each byte into two outputs placed 16 apart within each 16-byte block (or otherwise reconstruct low[0..15] then high[0..15]).

    inline void unpack_int4_to_int8(const uint8_t* packed, int8_t* unpacked, size_t packed_size) {
        for (size_t i = 0; i < packed_size; i++) {
            uint8_t byte = packed[i];

            int8_t low = static_cast<int8_t>((byte & 0x0F) - 8);

            int8_t high = static_cast<int8_t>(((byte >> 4) & 0x0F) - 8);

            unpacked[i * 2] = low;
            unpacked[i * 2 + 1] = high;
        }

cactus/graph/graph_io.cpp:192

save_node() only sets FLAG_HAS_SCALES / group_size / num_groups for grouped INT8 buffers. Group-wise INT4 tensors also have scales (and are now supported elsewhere), so saving an INT4 node will drop its scales metadata and produce an invalid/incomplete file. Extend has_scales to include grouped INT4 (and ensure the header encodes the group params/scales bytes for INT4 too).

    size_t byte_size = PrecisionTraits::packed_size_of(precision, total_elements);

    bool has_scales = (precision == Precision::INT8 && buffer.is_grouped_int8() && buffer.scales_data);
    size_t N = shape.size() >= 1 ? shape[0] : 1;
    size_t scales_bytes = has_scales ? (N * buffer.num_groups * sizeof(__fp16)) : 0;

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

cactus/kernel/kernel_quants.cpp

cactus/graph/graph_execute.cpp

Signed-off-by: Jisha Rajala <jisharajala@gmail.com>

Signed-off-by: Noah Cylich <noahcylich@gmail.com>

Signed-off-by: Jisha Rajala <jisharajala@gmail.com>

Signed-off-by: Noah Cylich <noahcylich@gmail.com>

…zation Signed-off-by: Jisha Rajala <jisharajala@gmail.com>

Signed-off-by: Noah Cylich <noahcylich@gmail.com>

…n from it for a colocated kernel though Signed-off-by: Noah Cylich <noahcylich@gmail.com>

Signed-off-by: Noah Cylich <noahcylich@gmail.com>

Copilot

Pull request overview

Copilot reviewed 13 out of 15 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-18T22:40:02Z

cactus/graph/graph_execute.cpp

+                            size_t byte_idx = i / 2;
+                            int8_t val = (i % 2 == 0)
+                                ? static_cast<int8_t>((packed[byte_idx] & 0x0F) - 8)
+                                : static_cast<int8_t>(((packed[byte_idx] >> 4) & 0x0F) - 8);


The INT4 unpacking logic here assumes sequential packing where consecutive elements are packed in low/high nibbles of each byte. However, the pack_int4_pairs function in python/src/tensor_io.py (lines 107-111) uses planar packing: for each group of 32 elements, the first 16 are stored in low nibbles and the next 16 in high nibbles of 16 consecutive bytes. To correctly unpack INT4 data here, the logic should account for this planar layout: for element i, determine which 32-element group it belongs to (group = i/32), then compute byte_idx = (i/32)*16 + (i%32)%16, and nibble_select based on whether (i%32) < 16.

Suggested change

size_t byte_idx = i / 2;

int8_t val = (i % 2 == 0)

? static_cast<int8_t>((packed[byte_idx] & 0x0F) - 8)

: static_cast<int8_t>(((packed[byte_idx] >> 4) & 0x0F) - 8);

// Unpack INT4 values using planar packing:

// For each group of 32 elements, the first 16 are stored in the low nibbles

// and the next 16 in the high nibbles of 16 consecutive bytes.

size_t group = i / 32;

size_t offset_in_group = i % 32;

size_t byte_in_group = offset_in_group % 16;

size_t byte_idx = group * 16 + byte_in_group;

uint8_t byte = packed[byte_idx];

uint8_t nibble = (offset_in_group < 16)

? (byte & 0x0F)

: static_cast<uint8_t>((byte >> 4) & 0x0F);

int8_t val = static_cast<int8_t>(nibble) - 8;

cactus/kernel/kernel_quants.cpp

cactus/kernel/kernel_matmul.cpp

Signed-off-by: Noah Cylich <noahcylich@gmail.com>

ncylich force-pushed the int4_kernel_unpacking branch 2 times, most recently from 795220b to f574155 Compare February 13, 2026 05:35

jrajala6 commented Feb 13, 2026

View reviewed changes

cactus/engine/engine.h Outdated Show resolved Hide resolved

jrajala6 commented Feb 13, 2026

View reviewed changes

cactus/kernel/kernel_matmul.cpp Show resolved Hide resolved

ncylich force-pushed the int4_kernel_unpacking branch 3 times, most recently from b6b38e4 to ee85502 Compare February 18, 2026 21:31

ncylich marked this pull request as ready for review February 18, 2026 21:36

Copilot AI review requested due to automatic review settings February 18, 2026 21:36

Copilot started reviewing on behalf of ncylich February 18, 2026 21:36 View session

Copilot AI reviewed Feb 18, 2026

View reviewed changes

cactus/kernel/kernel_quants.cpp Outdated Show resolved Hide resolved

cactus/kernel/kernel_quants.cpp Outdated Show resolved Hide resolved

cactus/graph/graph_execute.cpp Outdated Show resolved Hide resolved

ncylich force-pushed the int4_kernel_unpacking branch from 9f8284c to fd9d491 Compare February 18, 2026 21:47

ncylich closed this Feb 18, 2026

ncylich reopened this Feb 18, 2026

jrajala6 and others added 12 commits February 18, 2026 14:08

Basic addition of int4 functionality

a432dc1

Signed-off-by: Jisha Rajala <jisharajala@gmail.com>

basic working/optimized int4 gemv with testing

abd92a1

Signed-off-by: Noah Cylich <noahcylich@gmail.com>

Added gemm functionality

7bbe7ee

Signed-off-by: Jisha Rajala <jisharajala@gmail.com>

int4 is working!!!

a4beb5b

Signed-off-by: Noah Cylich <noahcylich@gmail.com>

semi-working int4 (too slow still)

ca790ad

Signed-off-by: Noah Cylich <noahcylich@gmail.com>

gemma matmul benchmark

3183733

Signed-off-by: Noah Cylich <noahcylich@gmail.com>

Added e2e gemma 3 model testing with int4 quantization vs int8 quanti…

2c7f228

…zation Signed-off-by: Jisha Rajala <jisharajala@gmail.com>

further optimized alu ops using imm8 simd instructions

a294712

Signed-off-by: Noah Cylich <noahcylich@gmail.com>

adding synthetic fake int4 testing option

3bc4e5b

Signed-off-by: Noah Cylich <noahcylich@gmail.com>

trying to copy ggml int4 - but it has worse perf, did take inspiratio…

345d1db

…n from it for a colocated kernel though Signed-off-by: Noah Cylich <noahcylich@gmail.com>

optimized int4 further

71d6eb8

Signed-off-by: Noah Cylich <noahcylich@gmail.com>

cleaned code in preparation for proper pr

2e71199

Signed-off-by: Noah Cylich <noahcylich@gmail.com>

ncylich force-pushed the int4_kernel_unpacking branch from c41080b to 51f44fb Compare February 18, 2026 22:11

fixed unpacking

3ba0048

Signed-off-by: Noah Cylich <noahcylich@gmail.com>

ncylich force-pushed the int4_kernel_unpacking branch from 383194f to 3ba0048 Compare February 18, 2026 22:30

ncylich requested a review from Copilot February 18, 2026 22:32

Copilot started reviewing on behalf of ncylich February 18, 2026 22:32 View session

Copilot AI reviewed Feb 18, 2026

View reviewed changes

fixed execution graph basic unpacking and analysis

ab6f7e5

Signed-off-by: Noah Cylich <noahcylich@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Basic addition of int4 functionality#343

Basic addition of int4 functionality#343
jrajala6 wants to merge 14 commits intocactus-compute:mainfrom
jrajala6:int4_kernel_unpacking

jrajala6 commented Feb 12, 2026

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 18, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

-                            size_t byte_idx = i / 2;
-                            int8_t val = (i % 2 == 0)
-                                ? static_cast<int8_t>((packed[byte_idx] & 0x0F) - 8)
-                                : static_cast<int8_t>(((packed[byte_idx] >> 4) & 0x0F) - 8);
+                            // Unpack INT4 values using planar packing:
+                            // For each group of 32 elements, the first 16 are stored in the low nibbles
+                            // and the next 16 in the high nibbles of 16 consecutive bytes.
+                            size_t group = i / 32;
+                            size_t offset_in_group = i % 32;
+                            size_t byte_in_group = offset_in_group % 16;
+                            size_t byte_idx = group * 16 + byte_in_group;
+                            uint8_t byte = packed[byte_idx];
+                            uint8_t nibble = (offset_in_group < 16)
+                                ? (byte & 0x0F)
+                                : static_cast<uint8_t>((byte >> 4) & 0x0F);
+                            int8_t val = static_cast<int8_t>(nibble) - 8;

Conversation

jrajala6 commented Feb 12, 2026

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments