Skip to content

Basic addition of int4 functionality#343

Open
jrajala6 wants to merge 14 commits intocactus-compute:mainfrom
jrajala6:int4_kernel_unpacking
Open

Basic addition of int4 functionality#343
jrajala6 wants to merge 14 commits intocactus-compute:mainfrom
jrajala6:int4_kernel_unpacking

Conversation

@jrajala6
Copy link
Contributor

No description provided.

@ncylich ncylich force-pushed the int4_kernel_unpacking branch 2 times, most recently from 795220b to f574155 Compare February 13, 2026 05:35
@ncylich ncylich force-pushed the int4_kernel_unpacking branch 3 times, most recently from b6b38e4 to ee85502 Compare February 18, 2026 21:31
@ncylich ncylich marked this pull request as ready for review February 18, 2026 21:36
Copilot AI review requested due to automatic review settings February 18, 2026 21:36
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds initial INT4 (4-bit) grouped-weight matmul support end-to-end (Python export → graph loader/executor → NEON kernel), plus a new kernel correctness test.

Changes:

  • Introduces INT4 GEMV/GEMM kernels with packed (nibble) weight format and integrates dispatch in graph matmul.
  • Updates Python tensor export to pack INT4 weights (planar layout) and avoids INT4 for embedding weights.
  • Adjusts graph byte sizing logic to use packed sizes for INT4 across buffer sizing, I/O, and debug capture.

Reviewed changes

Copilot reviewed 12 out of 14 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
tests/test_kernel.cpp Adds INT4 matmul correctness test (GEMV + GEMM).
python/src/tensor_io.py Changes INT4 packing to planar layout; forces embedding weights to INT8 when INT4 requested.
cactus/kernel/kernel_quants.cpp Updates INT4 nibble decode to offset-binary; (unpack layout needs follow-up).
cactus/kernel/kernel_matmul.cpp Adds INT4 GEMV/GEMM/matmul implementation (packed weights + optional bias correction).
cactus/kernel/kernel.h Exposes INT4 matmul APIs.
cactus/graph/graph_ops_tensor.cpp Uses packed byte sizing for gather copies (INT4-aware sizing).
cactus/graph/graph_ops_nn.cpp Routes grouped INT4 RHS to cactus_matmul_int4 and updates error messages.
cactus/graph/graph_io.cpp Uses packed byte sizing; adds INT4 handling in mmap paths; (INT4 unpack + save scales need follow-up).
cactus/graph/graph_execute.cpp Uses packed byte sizing for debug capture writes.
cactus/graph/graph_core.cpp Computes BufferDesc::byte_size using packed byte sizing.
cactus/graph/graph.h Adds is_grouped_int4 and packed_size_of() helper used across graph.
cactus/engine/engine.h Adds constructors to index structs.
.gitignore Adds *_profile.txt and normalizes *.bin entry.
Comments suppressed due to low confidence (2)

cactus/graph/graph_io.cpp:33

  • This INT4 unpack helper assumes pairwise nibble packing (alternating low/high outputs). The Python INT4 writer now uses planar packing (first 16 values in low nibbles, next 16 in high nibbles), so this unpacking will reorder values incorrectly when enable_int4_packing is disabled. Update it to expand each byte into two outputs placed 16 apart within each 16-byte block (or otherwise reconstruct low[0..15] then high[0..15]).
    inline void unpack_int4_to_int8(const uint8_t* packed, int8_t* unpacked, size_t packed_size) {
        for (size_t i = 0; i < packed_size; i++) {
            uint8_t byte = packed[i];

            int8_t low = static_cast<int8_t>((byte & 0x0F) - 8);

            int8_t high = static_cast<int8_t>(((byte >> 4) & 0x0F) - 8);

            unpacked[i * 2] = low;
            unpacked[i * 2 + 1] = high;
        }

cactus/graph/graph_io.cpp:192

  • save_node() only sets FLAG_HAS_SCALES / group_size / num_groups for grouped INT8 buffers. Group-wise INT4 tensors also have scales (and are now supported elsewhere), so saving an INT4 node will drop its scales metadata and produce an invalid/incomplete file. Extend has_scales to include grouped INT4 (and ensure the header encodes the group params/scales bytes for INT4 too).
    size_t byte_size = PrecisionTraits::packed_size_of(precision, total_elements);

    bool has_scales = (precision == Precision::INT8 && buffer.is_grouped_int8() && buffer.scales_data);
    size_t N = shape.size() >= 1 ? shape[0] : 1;
    size_t scales_bytes = has_scales ? (N * buffer.num_groups * sizeof(__fp16)) : 0;

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@ncylich ncylich force-pushed the int4_kernel_unpacking branch from 9f8284c to fd9d491 Compare February 18, 2026 21:47
@ncylich ncylich closed this Feb 18, 2026
@ncylich ncylich reopened this Feb 18, 2026
jrajala6 and others added 12 commits February 18, 2026 14:08
Signed-off-by: Jisha Rajala <jisharajala@gmail.com>
Signed-off-by: Noah Cylich <noahcylich@gmail.com>
Signed-off-by: Jisha Rajala <jisharajala@gmail.com>
Signed-off-by: Noah Cylich <noahcylich@gmail.com>
Signed-off-by: Noah Cylich <noahcylich@gmail.com>
Signed-off-by: Noah Cylich <noahcylich@gmail.com>
…zation

Signed-off-by: Jisha Rajala <jisharajala@gmail.com>
Signed-off-by: Noah Cylich <noahcylich@gmail.com>
Signed-off-by: Noah Cylich <noahcylich@gmail.com>
…n from it for a colocated kernel though

Signed-off-by: Noah Cylich <noahcylich@gmail.com>
Signed-off-by: Noah Cylich <noahcylich@gmail.com>
Signed-off-by: Noah Cylich <noahcylich@gmail.com>
@ncylich ncylich force-pushed the int4_kernel_unpacking branch from c41080b to 51f44fb Compare February 18, 2026 22:11
Signed-off-by: Noah Cylich <noahcylich@gmail.com>
@ncylich ncylich force-pushed the int4_kernel_unpacking branch from 383194f to 3ba0048 Compare February 18, 2026 22:30
@ncylich ncylich requested a review from Copilot February 18, 2026 22:32
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 15 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 617 to 620
size_t byte_idx = i / 2;
int8_t val = (i % 2 == 0)
? static_cast<int8_t>((packed[byte_idx] & 0x0F) - 8)
: static_cast<int8_t>(((packed[byte_idx] >> 4) & 0x0F) - 8);
Copy link

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The INT4 unpacking logic here assumes sequential packing where consecutive elements are packed in low/high nibbles of each byte. However, the pack_int4_pairs function in python/src/tensor_io.py (lines 107-111) uses planar packing: for each group of 32 elements, the first 16 are stored in low nibbles and the next 16 in high nibbles of 16 consecutive bytes. To correctly unpack INT4 data here, the logic should account for this planar layout: for element i, determine which 32-element group it belongs to (group = i/32), then compute byte_idx = (i/32)*16 + (i%32)%16, and nibble_select based on whether (i%32) < 16.

Suggested change
size_t byte_idx = i / 2;
int8_t val = (i % 2 == 0)
? static_cast<int8_t>((packed[byte_idx] & 0x0F) - 8)
: static_cast<int8_t>(((packed[byte_idx] >> 4) & 0x0F) - 8);
// Unpack INT4 values using planar packing:
// For each group of 32 elements, the first 16 are stored in the low nibbles
// and the next 16 in the high nibbles of 16 consecutive bytes.
size_t group = i / 32;
size_t offset_in_group = i % 32;
size_t byte_in_group = offset_in_group % 16;
size_t byte_idx = group * 16 + byte_in_group;
uint8_t byte = packed[byte_idx];
uint8_t nibble = (offset_in_group < 16)
? (byte & 0x0F)
: static_cast<uint8_t>((byte >> 4) & 0x0F);
int8_t val = static_cast<int8_t>(nibble) - 8;

Copilot uses AI. Check for mistakes.
Signed-off-by: Noah Cylich <noahcylich@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments