Skip to content

[Bug] Fix invalid CUDA kernel launch when num_samples exceeds grid dimension limit #967

@viiccwen

Description

@viiccwen

Description

reference: comment

the launch_l2_norm_batch function can attempt to launch an invalid CUDA kernel when num_samples exceeds CUDA_MAX_GRID_DIM_1D (65535).

Root Cause

When num_samples > 65535, even with blocks_per_sample = 1, the calculated gridSize = num_samples * 1 = num_samples still exceeds the CUDA 1D grid dimension limit (65535), leading to an invalid kernel launch.

The existing code attempts to reduce blocks_per_sample when gridSize > max_grid:

const size_t max_grid = CUDA_MAX_GRID_DIM_1D; // CUDA grid dimension limit for 1D launch
if (gridSize > max_grid) {
blocks_per_sample = max_grid / num_samples;
if (blocks_per_sample == 0) {
blocks_per_sample = 1;
}
gridSize = num_samples * blocks_per_sample;
}

However, when num_samples > max_grid, even with blocks_per_sample = 1, gridSize = num_samples still exceeds the limit, causing a CUDA error.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions