[Bug] Misaligned double2 vector loads in batched L2norm kernel when the length of sample is odd

### Description

The current batched L2 norm CUDA kernel performs vectorized loads starting from input_batch + base, where:

https://github.com/apache/mahout/blob/d5fc18bd01bd4fbaa6c947252b142f38c9d8b720/qdp/qdp-kernels/src/amplitude.cu#L367-L368

When sample_len is odd and sample_idx is also odd, base becomes odd, meaning input_batch + base is only 8-byte aligned. This violates the alignment requirements of vectorized loads: double2 requires 16-byte alignment.

Although this does not cause incorrect results, it introduces avoidable performance penalties and makes the kernel’s memory access behavior architecture-dependent.

reference: [comment](https://github.com/apache/mahout/pull/918#discussion_r2724245973)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Misaligned double2 vector loads in batched L2norm kernel when the length of sample is odd #940

Description

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	const size_t base = sample_idx * sample_len;

[Bug] Misaligned double2 vector loads in batched L2norm kernel when the length of sample is odd #940

Description

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions