fix: add __launch_bounds__ to winograd kernels for Blackwell GPUs by stondo · Pull Request #2391 · LeelaChessZero/lc0

stondo · 2026-02-21T16:14:57Z

Summary

Add __launch_bounds__(1024) to InputTransform_kernel and OutputTransform_kernel in winograd_helper.inc
Fixes "CUDA error: too many resources requested for launch" on Blackwell GPUs (sm_120/sm_121)

Details

Blackwell (sm_120/sm_121) has different per-SM resource limits compared to previous architectures. Without explicit __launch_bounds__, the CUDA compiler over-allocates registers for the Winograd transform kernels, causing them to exceed per-block resource limits at launch time.

This is the same class of issue documented in pytorch/pytorch#150266.

Adding __launch_bounds__(1024) constrains the compiler's register allocation without affecting kernel behavior. The value 1024 matches cudaDeviceProp::maxThreadsPerBlock and is the standard approach for this category of Blackwell compatibility fix.

Without this fix, lc0 crashes immediately on any Blackwell GPU (RTX 5090, RTX 5080, RTX 5070 Ti, GB10, GB200, etc.).

Benchmarks (NVIDIA GB10, sm_121, CUDA 13.0)

Network	Backend	Peak NPS	Batch
T78 (320ch)	cuda-fp16	2,466	40
BT4-1740 (1024ch, transformer)	cuda-fp16	2,583	24

Test plan

Verified crash without fix on GB10 Blackwell (sm_121)
Verified fix resolves crash on GB10 Blackwell
Benchmarked T78 and BT4-1740 networks (cuda-fp16)
No regression expected on pre-Blackwell (launch_bounds is a compiler hint, not a hard constraint on older GPUs)

InputTransform_kernel and OutputTransform_kernel exceed Blackwell sm_121 per-block resource limits without explicit launch bounds, causing "too many resources requested for launch" at runtime. Adding __launch_bounds__(1024) constrains register allocation and fixes the crash. No impact on pre-Blackwell architectures. Same class of fix as PyTorch #150266 for Blackwell compatibility. Tested on NVIDIA GB10 (sm_121) with T78 (2,466 NPS) and BT4-1740 (2,583 NPS) networks, cuda-fp16 backend.

Menkib64 · 2026-02-28T11:02:46Z

 // - producing 4 x 6x6 elements
 template <typename T, bool nhcw>
-__global__ void InputTransform_kernel(int N, int C, const T* input, T* output) {
+__global__ __launch_bounds__(1024) void InputTransform_kernel(int N, int C, const T* input, T* output) {


1024 is too high limit. It will limit kernel to 64 registers. T78 uses only 512 channels which uses block size 512. Kernels are a little faster when using 128 registers.

Menkib64 requested changes Feb 28, 2026

View reviewed changes

Menkib64 mentioned this pull request Apr 25, 2026

Limit register allocations fit requirements #2411

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: add __launch_bounds__ to winograd kernels for Blackwell GPUs#2391

fix: add __launch_bounds__ to winograd kernels for Blackwell GPUs#2391
stondo wants to merge 1 commit intoLeelaChessZero:masterfrom
stondo:fix/blackwell-winograd-launch-bounds

stondo commented Feb 21, 2026

Uh oh!

Menkib64 Feb 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

stondo commented Feb 21, 2026

Summary

Details

Benchmarks (NVIDIA GB10, sm_121, CUDA 13.0)

Test plan

Uh oh!

Menkib64 Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants