[BUG][CONV] fix cp.async codegen failure for tail weight tiles

## Description
### Symptom / Motivation
Conv2d and Conv3d generic implicit-GEMM kernels can fail to compile with TileLang 0.1.9 for selected fp16/bf16 cases. The failure happens during CUDA codegen before numerical comparison:

```text
tl::ptx_cp_async requires a final PTX byte width in {4, 8, 16}, but got 2
```

This was listed as a remaining convolution blocker in #1071.

### Root Cause Analysis
The generic Conv2d/Conv3d kernels flatten convolution weights into the GEMM K dimension and load each weight tile with `T.copy(weight_flat[k_iter * block_k, bx * block_n], weight_shared)`. When `k_total = kernel_size_product * c_in` is not divisible by `block_k`, the final K tile is partial. Under TileLang 0.1.9, that global-to-shared `T.copy` path can lower the tail copy to `ptx_cp_async`; for fp16/bf16 the final transfer can be only 2 bytes, which is illegal for cp.async.

The kernel should explicitly guard tail K and output-channel bounds and zero-fill invalid shared-memory elements instead of relying on a full-tile `T.copy` for the weight tile.

### Related Files
- `tileops/kernels/convolution.py`
- `tests/ops/test_convolution.py`

## Goal
Fix Conv2d/Conv3d TileLang 0.1.9 CUDA codegen failures for partial tail weight tiles and restore the skipped convolution coverage.

## Plan

1. Replace generic Conv2d/Conv3d weight-tile `T.copy` calls with explicit guarded shared-memory loads.
2. Zero-fill out-of-bounds tail K and output-channel elements before GEMM.
3. Remove the temporary Conv2d/Conv3d skips that covered this cp.async failure.
4. Verify `tests/ops/test_convolution.py` on GPU.

## Constraints
Keep the public Conv2d/Conv3d operator API unchanged.

## Acceptance Criteria
- [ ] Conv2d and Conv3d generic paths no longer fail codegen with the cp.async byte-width error.
- [ ] Previously skipped Conv2d/Conv3d convolution cases run in `tests/ops/test_convolution.py`.
- [ ] Modified files pass unit tests.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG][CONV] fix cp.async codegen failure for tail weight tiles #1105

Description

Symptom / Motivation

Root Cause Analysis

Related Files

Goal

Plan

Constraints

Acceptance Criteria

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG][CONV] fix cp.async codegen failure for tail weight tiles #1105

Description

Description

Symptom / Motivation

Root Cause Analysis

Related Files

Goal

Plan

Constraints

Acceptance Criteria

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions