Description
Symptom / Motivation
Conv2d and Conv3d generic implicit-GEMM kernels can fail to compile with TileLang 0.1.9 for selected fp16/bf16 cases. The failure happens during CUDA codegen before numerical comparison:
tl::ptx_cp_async requires a final PTX byte width in {4, 8, 16}, but got 2
This was listed as a remaining convolution blocker in #1071.
Root Cause Analysis
The generic Conv2d/Conv3d kernels flatten convolution weights into the GEMM K dimension and load each weight tile with T.copy(weight_flat[k_iter * block_k, bx * block_n], weight_shared). When k_total = kernel_size_product * c_in is not divisible by block_k, the final K tile is partial. Under TileLang 0.1.9, that global-to-shared T.copy path can lower the tail copy to ptx_cp_async; for fp16/bf16 the final transfer can be only 2 bytes, which is illegal for cp.async.
The kernel should explicitly guard tail K and output-channel bounds and zero-fill invalid shared-memory elements instead of relying on a full-tile T.copy for the weight tile.
Related Files
tileops/kernels/convolution.py
tests/ops/test_convolution.py
Goal
Fix Conv2d/Conv3d TileLang 0.1.9 CUDA codegen failures for partial tail weight tiles and restore the skipped convolution coverage.
Plan
- Replace generic Conv2d/Conv3d weight-tile
T.copy calls with explicit guarded shared-memory loads.
- Zero-fill out-of-bounds tail K and output-channel elements before GEMM.
- Remove the temporary Conv2d/Conv3d skips that covered this cp.async failure.
- Verify
tests/ops/test_convolution.py on GPU.
Constraints
Keep the public Conv2d/Conv3d operator API unchanged.
Acceptance Criteria
Description
Symptom / Motivation
Conv2d and Conv3d generic implicit-GEMM kernels can fail to compile with TileLang 0.1.9 for selected fp16/bf16 cases. The failure happens during CUDA codegen before numerical comparison:
This was listed as a remaining convolution blocker in #1071.
Root Cause Analysis
The generic Conv2d/Conv3d kernels flatten convolution weights into the GEMM K dimension and load each weight tile with
T.copy(weight_flat[k_iter * block_k, bx * block_n], weight_shared). Whenk_total = kernel_size_product * c_inis not divisible byblock_k, the final K tile is partial. Under TileLang 0.1.9, that global-to-sharedT.copypath can lower the tail copy toptx_cp_async; for fp16/bf16 the final transfer can be only 2 bytes, which is illegal for cp.async.The kernel should explicitly guard tail K and output-channel bounds and zero-fill invalid shared-memory elements instead of relying on a full-tile
T.copyfor the weight tile.Related Files
tileops/kernels/convolution.pytests/ops/test_convolution.pyGoal
Fix Conv2d/Conv3d TileLang 0.1.9 CUDA codegen failures for partial tail weight tiles and restore the skipped convolution coverage.
Plan
T.copycalls with explicit guarded shared-memory loads.tests/ops/test_convolution.pyon GPU.Constraints
Keep the public Conv2d/Conv3d operator API unchanged.
Acceptance Criteria
tests/ops/test_convolution.py.