Skip to content

fix: add __launch_bounds__ to winograd kernels for Blackwell GPUs#2391

Open
stondo wants to merge 1 commit intoLeelaChessZero:masterfrom
stondo:fix/blackwell-winograd-launch-bounds
Open

fix: add __launch_bounds__ to winograd kernels for Blackwell GPUs#2391
stondo wants to merge 1 commit intoLeelaChessZero:masterfrom
stondo:fix/blackwell-winograd-launch-bounds

Conversation

@stondo
Copy link
Copy Markdown

@stondo stondo commented Feb 21, 2026

Summary

  • Add __launch_bounds__(1024) to InputTransform_kernel and OutputTransform_kernel in winograd_helper.inc
  • Fixes "CUDA error: too many resources requested for launch" on Blackwell GPUs (sm_120/sm_121)

Details

Blackwell (sm_120/sm_121) has different per-SM resource limits compared to previous architectures. Without explicit __launch_bounds__, the CUDA compiler over-allocates registers for the Winograd transform kernels, causing them to exceed per-block resource limits at launch time.

This is the same class of issue documented in pytorch/pytorch#150266.

Adding __launch_bounds__(1024) constrains the compiler's register allocation without affecting kernel behavior. The value 1024 matches cudaDeviceProp::maxThreadsPerBlock and is the standard approach for this category of Blackwell compatibility fix.

Without this fix, lc0 crashes immediately on any Blackwell GPU (RTX 5090, RTX 5080, RTX 5070 Ti, GB10, GB200, etc.).

Benchmarks (NVIDIA GB10, sm_121, CUDA 13.0)

Network Backend Peak NPS Batch
T78 (320ch) cuda-fp16 2,466 40
BT4-1740 (1024ch, transformer) cuda-fp16 2,583 24

Test plan

  • Verified crash without fix on GB10 Blackwell (sm_121)
  • Verified fix resolves crash on GB10 Blackwell
  • Benchmarked T78 and BT4-1740 networks (cuda-fp16)
  • No regression expected on pre-Blackwell (launch_bounds is a compiler hint, not a hard constraint on older GPUs)

InputTransform_kernel and OutputTransform_kernel exceed Blackwell
sm_121 per-block resource limits without explicit launch bounds,
causing "too many resources requested for launch" at runtime.

Adding __launch_bounds__(1024) constrains register allocation and
fixes the crash. No impact on pre-Blackwell architectures.

Same class of fix as PyTorch #150266 for Blackwell compatibility.

Tested on NVIDIA GB10 (sm_121) with T78 (2,466 NPS) and BT4-1740
(2,583 NPS) networks, cuda-fp16 backend.
// - producing 4 x 6x6 elements
template <typename T, bool nhcw>
__global__ void InputTransform_kernel(int N, int C, const T* input, T* output) {
__global__ __launch_bounds__(1024) void InputTransform_kernel(int N, int C, const T* input, T* output) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1024 is too high limit. It will limit kernel to 64 registers. T78 uses only 512 channels which uses block size 512. Kernels are a little faster when using 128 registers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants