Skip to content

Commit 4d95e0f

Browse files
committed
sparse conv doc update, cache inspection doc
1 parent d9082e8 commit 4d95e0f

File tree

4 files changed

+238
-98
lines changed

4 files changed

+238
-98
lines changed

README.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,10 @@ git clone https://github.com/NVlabs/WarpConvNet.git
2929
cd WarpConvNet
3030
git submodule update --init 3rdparty/cutlass
3131
pip install .
32+
33+
# If this fails, please create an issue on https://github.com/NVlabs/WarpConvNet/issues and try running the following commands:
34+
cd WarpConvNet
35+
pip install -e .
3236
```
3337

3438
Available optional dependency groups:
Lines changed: 185 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,185 @@
1+
## Inspecting the Benchmark Cache
2+
3+
WarpConvNet benchmarks sparse convolution algorithms at runtime and caches the results for fast reuse across sessions. This page documents the `scripts/inspect_benchmark_cache.py` helper script, which pretty-prints the cached results so you can understand what was tried and which configurations performed best.
4+
5+
### What it shows
6+
7+
- **Namespaces**: Logical groups of cached results (e.g., `sparse_conv_forward`, `sparse_conv_backward`, low-level kernels).
8+
- **Per-configuration results**: For each input configuration (number of coords, channels, kernel volume, dtype), the script lists algorithms that were benchmarked and the measured time.
9+
- **Ordering**: Results are shown best-first by default within each configuration.
10+
11+
## Quick start
12+
13+
Run without arguments to see the namespace tree:
14+
15+
```bash
16+
python scripts/inspect_benchmark_cache.py
17+
```
18+
19+
Show details for a specific namespace (e.g., forward sparse conv):
20+
21+
```bash
22+
python scripts/inspect_benchmark_cache.py namespace=sparse_conv_forward
23+
```
24+
25+
Only show the best algorithm per configuration:
26+
27+
```bash
28+
python scripts/inspect_benchmark_cache.py namespace=sparse_conv_forward --best-only
29+
```
30+
31+
Show the top K results per configuration:
32+
33+
```bash
34+
python scripts/inspect_benchmark_cache.py namespace=sparse_conv_forward --top-k 3
35+
```
36+
37+
Search namespaces or keys when passing extra arguments:
38+
39+
```bash
40+
# List namespaces then search for entries containing "wmma"
41+
python scripts/inspect_benchmark_cache.py wmma
42+
43+
# Search inside a specific namespace
44+
python scripts/inspect_benchmark_cache.py namespace=sparse_conv_forward wmma
45+
```
46+
47+
## Sample output
48+
49+
Below is an excerpt from a real run inspecting the `sparse_conv_forward` namespace. Times are in milliseconds; lower is better.
50+
51+
```text
52+
Loading benchmark cache...
53+
Cache file location: /home/<user>/.cache/warpconvnet/benchmark_cache_generic.pkl
54+
Cache file size: 44,320 bytes
55+
Last modified: 2025-09-08 13:33:35
56+
57+
============================================================
58+
NAMESPACE TREE
59+
============================================================
60+
Total namespaces: 6
61+
62+
- implicit_gemm_AD_gather_scatter: 37 entry(ies)
63+
- implicit_gemm_trAB_gather: 24 entry(ies)
64+
- sparse_conv_backward: 6 entry(ies)
65+
- sparse_conv_forward: 11 entry(ies)
66+
- wmma_implicit_gemm_sm80: 23 entry(ies)
67+
- wmma_split_k_implicit_gemm_sm80: 13 entry(ies)
68+
69+
============================================================
70+
NAMESPACE: SPARSE_CONV_FORWARD
71+
============================================================
72+
Total configurations: 11
73+
74+
----------------------------------------
75+
Configuration 1:
76+
----------------------------------------
77+
Config Parameters:
78+
log_num_in_coords: 21
79+
log_num_out_coords: 21
80+
in_channels: 3
81+
out_channels: 32
82+
kernel_volume: 27
83+
in_dtype: torch.float16
84+
85+
Results:
86+
[
87+
[
88+
"implicit_gemm"
89+
{
90+
fwd_block_size: 16
91+
}
92+
4.149
93+
]
94+
[
95+
"implicit_gemm"
96+
{
97+
fwd_block_size: 32
98+
}
99+
7.833
100+
]
101+
["wmma_implicit_gemm", {}, 10.814]
102+
["explicit_gemm", {}, 13.789]
103+
[
104+
"implicit_gemm"
105+
{
106+
fwd_block_size: 4
107+
}
108+
15.120
109+
]
110+
]
111+
112+
----------------------------------------
113+
Configuration 2:
114+
----------------------------------------
115+
Config Parameters:
116+
log_num_in_coords: 21
117+
log_num_out_coords: 21
118+
in_channels: 32
119+
out_channels: 32
120+
kernel_volume: 27
121+
in_dtype: torch.float16
122+
123+
Results:
124+
[
125+
["cutlass_implicit_gemm", {}, 4.613]
126+
[
127+
"implicit_gemm"
128+
{
129+
fwd_block_size: 16
130+
}
131+
8.107
132+
]
133+
["wmma_implicit_gemm", {}, 14.126]
134+
["explicit_gemm", {}, 19.792]
135+
]
136+
```
137+
138+
## Interpreting results
139+
140+
- **Configuration**: A unique combination of problem shape and dtype: `num_in_coords`, `num_out_coords` (logged as powers of 2 for brevity), `in_channels`, `out_channels`, `kernel_volume`, and `in_dtype`.
141+
- **Algorithms**: Each entry is `[algo_name, params, time_ms]`.
142+
- `implicit_gemm` may include parameters like `fwd_block_size`, `gemm_block_size`, `split_k_threads_per_block`, `split_k_factor` depending on forward/backward.
143+
- `cutlass_implicit_gemm` and `wmma_implicit_gemm` typically list `{}` because they auto-tune internally.
144+
- `explicit_gemm` uses a dense path and lists `{}`.
145+
- **Best-first**: The first result per configuration is the fastest among those benchmarked.
146+
147+
## Relationship to environment variables
148+
149+
The cache reflects runs filtered by your environment variable settings in `warpconvnet/constants.py`:
150+
151+
- `WARPCONVNET_FWD_ALGO_MODE`
152+
- `WARPCONVNET_BWD_ALGO_MODE`
153+
154+
These accept a single algorithm (e.g., `implicit_gemm`, `cutlass_implicit_gemm`, `wmma_implicit_gemm`, `explicit_gemm`) or a list like `[implicit_gemm,wmma_implicit_gemm,cutlass_implicit_gemm]`. The inspector shows what was actually benchmarked inside that filtered search space.
155+
156+
See the sparse convolutions guide for details on recommended settings and AUTO mode.
157+
158+
## Benchmark cache management
159+
160+
The benchmark cache is automatically managed:
161+
162+
- **Persistent Storage**: Results are saved to `~/.cache/warpconvnet/`
163+
- **Configuration-Specific**: Different cache entries exist for different input sizes, channels, kernel volumes, and dtypes
164+
- **Background Saving**: Cache updates can happen in background threads
165+
- **Manual Reset**: Clear cache with `rm -rf ~/.cache/warpconvnet/` if needed
166+
167+
## Tips and troubleshooting
168+
169+
- **Clear cache** when switching GPUs or after significant software changes:
170+
```bash
171+
rm -rf ~/.cache/warpconvnet/
172+
```
173+
- **Algorithm availability** depends on your GPU and toolchain:
174+
- CUTLASS requires compatible compute capability.
175+
- WMMA requires Tensor Cores and compatible compute capability.
176+
- **First run is slower**: Benchmarking is performed once per unique configuration; subsequent runs reuse the cached best.
177+
- **Focus the search**: Use env var lists to limit benchmarking to known-good algorithms during development.
178+
179+
## Script location
180+
181+
The inspector script lives at:
182+
183+
- `scripts/inspect_benchmark_cache.py`
184+
185+
You can open it for more flags and formatting logic, or invoke it directly as shown above.

0 commit comments

Comments
 (0)