in this part, The core computation logic in both kernels is nearly identical: the calculation logic in iqp_encode_kernel (lines 80-92) and iqp_encode_batch_kernel (lines 123-134) is almost the same, with the batch version only adding index calculations. could you consider extracting it into a __device__ function or something that both kernels call.
Originally posted by @rich7420 in #868 (comment)