What
The batch amplitude encoding and batched L2 norm CUDA kernels assume each sample base is aligned for double2 / float2 vector loads. That assumption does not hold when sample_len is odd and sample_idx > 0.
In those cases:
input_batch + sample_idx * sample_len is only naturally aligned to the scalar type
- reinterpreting that address as
double2* or float2* can produce misaligned accesses
- CUDA may surface this as
CUDA_ERROR_MISALIGNED_ADDRESS
Affected kernels
amplitude_encode_batch_kernel
l2_norm_batch_kernel
l2_norm_batch_kernel_f32