Conversation
6a367e7 to
4607a15
Compare
CodSpeed Performance ReportMerging this PR will improve performance by 11.51%Comparing Summary
Performance Changes
Footnotes |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1472 +/- ##
==========================================
- Coverage 88.43% 88.38% -0.05%
==========================================
Files 96 97 +1
Lines 19183 19432 +249
==========================================
+ Hits 16964 17175 +211
- Misses 2219 2257 +38 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Have pushed some changes which avoid doing a basic loop. Still need some cleanup and looking at the code with fresher eyes. Also, some basic support for dask arrays and cupy is likely needed... Roughly what has been done is (will try to write more details in the original post, closer to this PR being mergeable):
Some initial benchmark, haven't really reviewed the benchmark code as it is AI generated. Also need to double-check that the tests actually go down the right codepath. Benchmark script
"""
Validate that 3D quadmesh produces identical results to running 2D quadmesh
in a loop over each band, and benchmark the performance difference.
Uses the same sizes and setup as datashader/tests/benchmarks/test_quadmesh.py
"""
import numpy as np
import xarray as xr
import datashader as ds
from datashader.reductions import sum as ds_sum, mean, count
import time
# Sizes from benchmarks/test_quadmesh.py
DATA_SIZES = (256, 512, 1024, 2048, 4096, 8192)
CANVAS_SIZE = (1024, 1024)
# Test different quadmesh types and reductions
def test_correctness(mesh_type, reduction, size, nz=3, benchmark_iters=5, canvas_size=CANVAS_SIZE):
"""
Test that 3D quadmesh matches 2D quadmesh run in a loop, and benchmark performance.
Args:
mesh_type: 'raster', 'rectilinear', or 'curvilinear'
reduction: reduction function (e.g., ds_sum, mean)
size: grid size
nz: number of bands
benchmark_iters: number of iterations for benchmarking (after warmup)
Returns:
tuple: (passed, time_3d_ms, time_2d_ms, speedup)
"""
print(f"\n{'='*60}")
print(f"Testing {mesh_type} quadmesh with {reduction.__name__} reduction")
print(f"Data size: {nz} bands × {size}×{size}, Canvas: {canvas_size[0]}×{canvas_size[1]}")
print(f"{'='*60}")
# Use coordinate system from benchmarks/test_quadmesh.py
west = 3125000.0
south = 3250000.0
east = 4250000.0
north = 4375000.0
x_range = (west, east)
y_range = (south, north)
# Create test data with values that make it easy to verify correctness
# Each band has different values: band i has values from i*100 to i*100+size*size
rng = np.random.default_rng(seed=42) # Fixed seed for reproducibility
data_3d = np.zeros((nz, size, size))
for z in range(nz):
data_3d[z] = rng.random((size, size)) * 100 + z * 100
if mesh_type == 'raster':
# Evenly spaced coordinates (matches benchmark setup)
lon_coords = np.linspace(3123580.0, 4250380.0, size)
lat_coords = np.linspace(4376200.0, 3249400.0, size)
data_xr = xr.DataArray(
data_3d,
dims=("band", "y", "x"),
coords={
"lon": ("x", lon_coords),
"lat": ("y", lat_coords),
"band": list(range(nz)),
},
name="test_data",
)
# Swap dims for raster (matches benchmark)
data_xr = data_xr.swap_dims({"y": "lat", "x": "lon"})
x_name, y_name = "lon", "lat"
elif mesh_type == 'rectilinear':
# Non-uniformly spaced 1D coordinates (matches benchmark setup)
lon_coords = np.linspace(3123580.0, 4250380.0, size)
lat_coords = np.linspace(4376200.0, 3249400.0, size)
# Add random deltas to make it non-uniform (matches benchmark)
dy = (y_range[1] - y_range[0]) / size
deltas = rng.uniform(-dy/2, dy/2, size)
lat_coords = lat_coords + deltas
data_xr = xr.DataArray(
data_3d,
dims=("band", "y", "x"),
coords={
"lon": ("x", lon_coords),
"lat": ("y", lat_coords),
"band": list(range(nz)),
},
name="test_data",
)
# Swap dims for rectilinear (matches benchmark)
data_xr = data_xr.swap_dims({"y": "lat", "x": "lon"})
x_name, y_name = "lon", "lat"
elif mesh_type == 'curvilinear':
# 2D coordinate arrays (matches benchmark setup with broadcast)
lon_1d = np.linspace(3123580.0, 4250380.0, size)
lat_1d = np.linspace(4376200.0, 3249400.0, size)
# Create base DataArray with dims (y, x, band) to match test setup
data_base = xr.DataArray(
data_3d.transpose(1, 2, 0), # Transpose from (nz, size, size) to (size, size, nz)
dims=("y", "x", "band"),
coords={
"x": lon_1d,
"y": lat_1d,
"band": list(range(nz)),
},
name="test_data",
)
# Broadcast to create 2D coordinate arrays (matches benchmark)
lon_coord, lat_coord = xr.broadcast(data_base.x, data_base.y)
data_base = data_base.assign_coords({"lon": lon_coord, "lat": lat_coord})
# Transpose to (band, y, x) for 3D processing
data_xr = data_base.transpose(..., "y", "x")
x_name, y_name = "lon", "lat"
else:
raise ValueError(f"Unknown mesh_type: {mesh_type}")
# Setup canvas (use provided canvas_size)
cvs = ds.Canvas(plot_width=canvas_size[0], plot_height=canvas_size[1],
x_range=x_range, y_range=y_range)
# Method 1: Run 3D quadmesh directly
print("\n1. Running 3D quadmesh (optimized)...")
# Warmup run
result_3d = cvs.quadmesh(data_xr, x=x_name, y=y_name, agg=reduction("test_data"))
print(f" Result shape: {result_3d.shape}")
# Benchmark runs
print(f" Benchmarking ({benchmark_iters} iterations)...")
times_3d = []
for _ in range(benchmark_iters):
t0 = time.perf_counter()
_ = cvs.quadmesh(data_xr, x=x_name, y=y_name, agg=reduction("test_data"))
t1 = time.perf_counter()
times_3d.append((t1 - t0) * 1000) # Convert to ms
time_3d_ms = np.mean(times_3d)
print(f" Average time: {time_3d_ms:.3f} ms (±{np.std(times_3d):.3f} ms)")
# Method 2: Run 2D quadmesh in a loop for each band
print("\n2. Running 2D quadmesh in loop (reference)...")
# Function to run 2D loop
def run_2d_loop():
results = []
for z in range(nz):
# Extract single band
# For curvilinear, slice from transposed data to ensure lon/lat coords have consistent dims
data_2d = data_xr.isel(band=z)
result_2d = cvs.quadmesh(data_2d, x=x_name, y=y_name, agg=reduction("test_data"))
results.append(result_2d.values)
return np.stack(results, axis=0)
# Warmup run
result_2d_stacked = run_2d_loop()
print(f" Stacked shape: {result_2d_stacked.shape}")
# Benchmark runs
print(f" Benchmarking ({benchmark_iters} iterations)...")
times_2d = []
for _ in range(benchmark_iters):
t0 = time.perf_counter()
_ = run_2d_loop()
t1 = time.perf_counter()
times_2d.append((t1 - t0) * 1000) # Convert to ms
time_2d_ms = np.mean(times_2d)
print(f" Average time: {time_2d_ms:.3f} ms (±{np.std(times_2d):.3f} ms)")
# Compare results
print("\n3. Comparing results...")
speedup = time_2d_ms / time_3d_ms if time_3d_ms > 0 else 0.0
if result_3d.shape != result_2d_stacked.shape:
print(f" ❌ FAIL: Shape mismatch!")
print(f" 3D: {result_3d.shape}")
print(f" 2D: {result_2d_stacked.shape}")
return False, time_3d_ms, time_2d_ms, speedup
# Compare values (accounting for NaN)
result_3d_vals = result_3d.values
result_2d_vals = result_2d_stacked
# Check NaN locations match
nan_mask_3d = np.isnan(result_3d_vals)
nan_mask_2d = np.isnan(result_2d_vals)
if not np.array_equal(nan_mask_3d, nan_mask_2d):
print(f" ❌ FAIL: NaN locations don't match!")
print(f" 3D NaN count: {nan_mask_3d.sum()}")
print(f" 2D NaN count: {nan_mask_2d.sum()}")
return False, time_3d_ms, time_2d_ms, speedup
# Compare non-NaN values
valid_mask = ~nan_mask_3d
diff = np.abs(result_3d_vals[valid_mask] - result_2d_vals[valid_mask])
max_diff = diff.max() if diff.size > 0 else 0
# Use relative tolerance for floating point comparison
atol = 1e-10
rtol = 1e-10
close = np.allclose(result_3d_vals[valid_mask], result_2d_vals[valid_mask],
atol=atol, rtol=rtol)
if close:
print(f" ✅ PASS: Results match perfectly!")
print(f" Max absolute difference: {max_diff:.2e}")
print(f" Valid pixels: {valid_mask.sum()}/{valid_mask.size}")
print(f"\n4. Performance comparison:")
print(f" 3D optimized: {time_3d_ms:.3f} ms")
print(f" 2D loop: {time_2d_ms:.3f} ms")
print(f" Speedup: {speedup:.2f}x")
return True, time_3d_ms, time_2d_ms, speedup
else:
print(f" ❌ FAIL: Results don't match!")
print(f" Max absolute difference: {max_diff:.2e}")
print(f" Valid pixels: {valid_mask.sum()}/{valid_mask.size}")
# Show some examples of differences
diff_locs = np.where(diff > atol + rtol * np.abs(result_2d_vals[valid_mask]))
if len(diff_locs[0]) > 0:
print(f" First 5 mismatches:")
for idx in range(min(5, len(diff_locs[0]))):
i = diff_locs[0][idx]
print(f" 3D: {result_3d_vals[valid_mask][i]:.6f}, "
f"2D: {result_2d_vals[valid_mask][i]:.6f}, "
f"diff: {diff[i]:.2e}")
return False, time_3d_ms, time_2d_ms, speedup
def main():
"""Run all validation tests."""
print("\n" + "="*80)
print("3D QUADMESH CORRECTNESS VALIDATION & BENCHMARKS")
print("="*80)
print("\nThis validates that the optimized 3D quadmesh (which computes")
print("coordinates once and loops over bands in numba) produces identical")
print("results to running 2D quadmesh separately for each band.")
print(f"\nUsing benchmark sizes from test_quadmesh.py: {DATA_SIZES}")
print(f"Canvas size: {CANVAS_SIZE}")
# Test configurations organized by quadmesh type
# Use 3 bands to simulate RGB data (the main use case)
nz = 3
test_configs = []
# All quadmesh types now support 3D
for mesh_type in ['raster', 'rectilinear', 'curvilinear']:
for size in DATA_SIZES:
# Test with sum (simple reduction)
test_configs.append((mesh_type, ds_sum, size, nz))
test_configs.append((mesh_type, mean, size, nz))
results = []
for mesh_type, reduction, size, nz in test_configs:
passed, time_3d, time_2d, speedup = test_correctness(mesh_type, reduction, size, nz)
results.append((mesh_type, reduction.__name__, size, nz, passed, time_3d, time_2d, speedup))
# Summary - organize by quadmesh type
print("\n" + "="*80)
print("SUMMARY - ORGANIZED BY QUADMESH TYPE")
print("="*80)
total = len(results)
passed_count = sum(1 for r in results if r[4])
# Group results by mesh type
for mesh_type in ['raster', 'rectilinear', 'curvilinear']:
type_results = [r for r in results if r[0] == mesh_type]
if not type_results:
continue
print(f"\n{mesh_type.upper()} QUADMESH:")
print(f"{'Size':<12} {'Reduction':<10} {'Status':<10} {'3D (ms)':<12} {'2D (ms)':<12} {'Speedup':<10}")
print("-" * 80)
for _, red_name, size, nz, status, time_3d, time_2d, speedup in type_results:
status_str = "✅ PASS" if status else "❌ FAIL"
size_str = f"{size}×{size}"
print(f"{size_str:<12} {red_name:<10} {status_str:<10} {time_3d:>10.1f} {time_2d:>10.1f} {speedup:>8.2f}x")
# Calculate average speedup for this type
type_speedups = [speedup for _, _, _, _, status, _, _, speedup in type_results if status]
if type_speedups:
avg_speedup = np.mean(type_speedups)
print(f"\n Average speedup for {mesh_type}: {avg_speedup:.2f}x")
print("\n" + "="*80)
print(f"Total: {passed_count}/{total} tests passed")
# Calculate overall average speedup for passed tests
if passed_count > 0:
avg_speedup = np.mean([speedup for _, _, _, _, status, _, _, speedup in results if status])
print(f"Overall average speedup: {avg_speedup:.2f}x")
if passed_count == total:
print("\n🎉 All tests passed! 3D quadmesh optimization is working correctly.")
return 0
else:
print(f"\n⚠️ {total - passed_count} test(s) failed!")
return 1
if __name__ == "__main__":
import sys
sys.exit(main())================================================================================
SUMMARY - ORGANIZED BY QUADMESH TYPE
================================================================================
RASTER QUADMESH:
Size Reduction Status 3D (ms) 2D (ms) Speedup
--------------------------------------------------------------------------------
256×256 sum ✅ PASS 7.6 7.1 0.93x
256×256 mean ✅ PASS 6.8 8.3 1.23x
512×512 sum ✅ PASS 6.6 6.8 1.03x
512×512 mean ✅ PASS 6.7 8.5 1.27x
1024×1024 sum ✅ PASS 6.6 7.8 1.17x
1024×1024 mean ✅ PASS 9.2 7.5 0.81x
2048×2048 sum ✅ PASS 11.3 10.4 0.92x
2048×2048 mean ✅ PASS 19.2 20.4 1.07x
4096×4096 sum ✅ PASS 18.8 21.0 1.12x
4096×4096 mean ✅ PASS 27.2 29.0 1.07x
8192×8192 sum ✅ PASS 55.5 59.4 1.07x
8192×8192 mean ✅ PASS 63.1 70.1 1.11x
Average speedup for raster: 1.06x
RECTILINEAR QUADMESH:
Size Reduction Status 3D (ms) 2D (ms) Speedup
--------------------------------------------------------------------------------
256×256 sum ✅ PASS 5.2 13.2 2.53x
256×256 mean ✅ PASS 15.2 21.1 1.39x
512×512 sum ✅ PASS 6.7 20.8 3.12x
512×512 mean ✅ PASS 22.7 33.9 1.50x
1024×1024 sum ✅ PASS 10.6 48.4 4.56x
1024×1024 mean ✅ PASS 29.2 63.2 2.17x
2048×2048 sum ✅ PASS 25.7 159.9 6.22x
2048×2048 mean ✅ PASS 78.8 205.0 2.60x
4096×4096 sum ✅ PASS 124.9 748.5 5.99x
4096×4096 mean ✅ PASS 315.3 923.5 2.93x
8192×8192 sum ✅ PASS 971.0 3377.3 3.48x
8192×8192 mean ✅ PASS 1296.6 3784.4 2.92x
Average speedup for rectilinear: 3.28x
CURVILINEAR QUADMESH:
Size Reduction Status 3D (ms) 2D (ms) Speedup
--------------------------------------------------------------------------------
256×256 sum ✅ PASS 20.8 44.7 2.15x
256×256 mean ✅ PASS 25.4 46.7 1.84x
512×512 sum ✅ PASS 20.2 61.3 3.03x
512×512 mean ✅ PASS 26.8 72.1 2.69x
1024×1024 sum ✅ PASS 39.8 128.9 3.24x
1024×1024 mean ✅ PASS 56.1 147.7 2.63x
2048×2048 sum ✅ PASS 126.8 360.4 2.84x
2048×2048 mean ✅ PASS 147.4 398.4 2.70x
4096×4096 sum ✅ PASS 449.0 1298.1 2.89x
4096×4096 mean ✅ PASS 502.3 1457.6 2.90x
8192×8192 sum ✅ PASS 1714.0 5055.6 2.95x
8192×8192 mean ✅ PASS 1931.7 5653.7 2.93x
Average speedup for curvilinear: 2.73x
================================================================================
Total: 36/36 tests passed
Overall average speedup: 2.36x
🎉 All tests passed! 3D quadmesh optimization is working correctly. |
Implement GPU parallelization for 3D quadmesh using CUDA streams, mirroring the CPU factory pattern with prange. Changes: - Add _CUDAStreamPool class for managing reusable CUDA streams - Add _make_3d_from_2d_cuda() factory that returns factory_3d(grid_shape), matching the CPU pattern where runtime parameters (n_arrays for CPU, grid_shape for GPU) are passed at usage site - Apply to all three quadmesh types: * QuadMeshRaster: upsample_cuda_3d and downsample_cuda_3d * QuadMeshRectilinear: extend_cuda_3d * QuadMeshCurvilinear: extend_cuda_3d - Fix CuPy compatibility: * Use .data instead of .values to preserve CuPy arrays * Use xp.clip() instead of np.clip() for array module compatibility - Remove NotImplementedError for 3D CUDA support - Z-slices execute in parallel across CUDA streams (up to 16 concurrent) Implementation pattern: CPU: do_extend = extend_cpu_3d(n_arrays=len(aggs_and_cols)) GPU: do_extend = extend_cuda_3d(grid_shape=(grid_w, grid_h)) Test changes: - Fix rectilinear coord generation to use xp.array() for CuPy - Use close=True for floating-point comparison in downsample cases Results: - All GPU tests pass: 6/6 ✅ - All CPU tests pass: 12/12 ✅ (no regression) - Parallel execution of independent z-slices on GPU 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This comment was marked as outdated.
This comment was marked as outdated.
c60c257 to
49c9145
Compare
49c9145 to
ccd9319
Compare
| dask_glyph_dispatch = Dispatcher() | ||
|
|
||
|
|
||
| def _flatten_dask_keys(keys_array): |
There was a problem hiding this comment.
could use or vendor dask.base.flatten (I think that's the import path
There was a problem hiding this comment.
Done in 6ddb6f7.
As I may have overlooked some nuances, you are welcome to review or test the PR.
|
|
||
| This is the GPU equivalent of _make_3d_from_2d, creating wrappers that launch | ||
| 2D CUDA kernels in parallel streams for each z-slice. This achieves true | ||
| parallelism on GPU, similar to how prange provides parallelism on CPU. |
philippjfr
left a comment
There was a problem hiding this comment.
Some clarification questions but as far as I'm able to tell this looks great. Checking the 3D implementation against the 2D implementation is fine as long as the existing test coverage for the 2D case is good (which I didn't confirm).

Resolves #1463
This PR adds support for bandwise 2D data for quadmesh, where the "new" dimension is an independent dimension, e.g., same as looping over the 2D data, e.g.,
cvs.quadmesh(da, x='x', y='y').isel(band=0) == cvs.quadmesh(da.isel(band=0), x='x', y='y'). This is done in the following steps:glyph_dispatchfordata_libraries/xarray.pyanddata_libraries/dask_xarray.pyunderstand the third dimension, and have the aggregate converted be a bandwise 2D array.qlyph/quadmesh.pyto convert a 2D calculation to a 3D calculation. This is done in two ways: one for the CPU and one for the CUDA.d[:2]tod[-2:].For 2), this is done differently for CPU and CUDA.
For the CPU, we dynamically generate a function that iterates over the bands. This is done so we can handle different size of
*aggs_and_cols, these can differ based on the reduction typerd.meanhas two (summing and counting) where asrd.sumonly has 1 summing. This function is then numba compiled and cached for future use.For CUDA, we use
cuda.streams(), which was mainly generated by Claude Code. The performance appears to be as expected: 3 bands achieve a slightly smaller than 3x time reduction compared to 2D+loop.Performance
Results from ccd9319 (10 iterations)
Script
GPU Test