Skip to content

[Bug] Misaligned double2 vector loads in batched L2norm kernel when the length of sample is odd #940

@viiccwen

Description

@viiccwen

Description

The current batched L2 norm CUDA kernel performs vectorized loads starting from input_batch + base, where:

const size_t base = sample_idx * sample_len;

When sample_len is odd and sample_idx is also odd, base becomes odd, meaning input_batch + base is only 8-byte aligned. This violates the alignment requirements of vectorized loads: double2 requires 16-byte alignment.

Although this does not cause incorrect results, it introduces avoidable performance penalties and makes the kernel’s memory access behavior architecture-dependent.

reference: comment

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions