Description
The current batched L2 norm CUDA kernel performs vectorized loads starting from input_batch + base, where:
|
const size_t base = sample_idx * sample_len; |
|
|
When sample_len is odd and sample_idx is also odd, base becomes odd, meaning input_batch + base is only 8-byte aligned. This violates the alignment requirements of vectorized loads: double2 requires 16-byte alignment.
Although this does not cause incorrect results, it introduces avoidable performance penalties and makes the kernel’s memory access behavior architecture-dependent.
reference: comment