[Bug] Fix invalid CUDA kernel launch when num_samples exceeds grid dimension limit

### Description

reference: [comment](https://github.com/apache/mahout/pull/918#discussion_r2724248036)

the `launch_l2_norm_batch` function can attempt to launch an invalid CUDA kernel when `num_samples` exceeds `CUDA_MAX_GRID_DIM_1D` (65535).

### Root Cause

When `num_samples > 65535`, even with `blocks_per_sample = 1`, the calculated `gridSize = num_samples * 1 = num_samples` still exceeds the CUDA 1D grid dimension limit (65535), leading to an invalid kernel launch.

The existing code attempts to reduce `blocks_per_sample` when `gridSize > max_grid`:

https://github.com/apache/mahout/blob/ef00f92eb236414d2ae15c01f4a32944f8d4eb2a/qdp/qdp-kernels/src/amplitude.cu#L613-L620

However, when `num_samples > max_grid`, even with `blocks_per_sample = 1`, `gridSize = num_samples` still exceeds the limit, causing a CUDA error.


	const size_t max_grid = CUDA_MAX_GRID_DIM_1D; // CUDA grid dimension limit for 1D launch
	if (gridSize > max_grid) {
	blocks_per_sample = max_grid / num_samples;
	if (blocks_per_sample == 0) {
	blocks_per_sample = 1;
	}
	gridSize = num_samples * blocks_per_sample;
	}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Fix invalid CUDA kernel launch when num_samples exceeds grid dimension limit #967

Description

Root Cause

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] Fix invalid CUDA kernel launch when num_samples exceeds grid dimension limit #967

Description

Description

Root Cause

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions