You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Deliver a fast, pleasant experience for developers and users querying iSamples geospatial data over the web. Current approach (DuckDB-WASM + remote parquet via HTTP range requests) is already good, but geospatial queries have room for optimization.
User experience we're targeting:
Sub-second map rendering at any zoom level
Responsive bounding box filters
Smooth zoom-dependent clustering
Minimal data transfer for spatial queries
The Problem
Geospatial queries on remote parquet are expensive:
Query Type
Current Behavior
Bounding box filter
Scans all rows, evaluates lat/lon conditions
"Near this point"
Distance calculation on every row
Zoom-level clustering
Full aggregation, no pre-computation
Faceted geo query
Combines the above costs
With 6.7M+ samples (20M rows in wide format), this adds up.
SELECT*,
h3_latlng_to_cell(latitude, longitude, 4) as h3_res4, -- Continental
h3_latlng_to_cell(latitude, longitude, 6) as h3_res6, -- Regional
h3_latlng_to_cell(latitude, longitude, 8) as h3_res8 -- LocalFROM wide_parquet
WHERE latitude IS NOT NULL
Benefits:
Filter by cell ID = simple integer match (fast)
Hierarchical: zoom-appropriate aggregation via GROUP BY h3_res6
Goal
Deliver a fast, pleasant experience for developers and users querying iSamples geospatial data over the web. Current approach (DuckDB-WASM + remote parquet via HTTP range requests) is already good, but geospatial queries have room for optimization.
User experience we're targeting:
The Problem
Geospatial queries on remote parquet are expensive:
With 6.7M+ samples (20M rows in wide format), this adds up.
Optimization Strategies
1. H3 Pre-computation (Recommended Starting Point)
Add H3 hexagonal grid cell IDs at multiple resolutions:
Benefits:
GROUP BY h3_res62. Partitioned Parquet by H3
Split into Hive-style partitions:
Benefits: Range requests skip irrelevant partitions entirely
Trade-off: More complex file management, needs manifest
3. Pre-aggregated Tiles
Create summary parquets at each zoom level:
Benefits: Appropriate data volume per zoom
Trade-off: Storage overhead, staleness concerns
4. GeoParquet Bounding Box Metadata
Ensure row group bounding boxes in parquet footer for predicate pushdown.
5. Edge Compute (Future)
Cloudflare Workers + DuckDB for server-side filtering closer to users.
Interaction with PQG Schema
Key principle: Core spec stays clean, enhancements are optional
Parquet is additive-friendly. We can add columns without breaking existing queries:
Conventions to consider:
_idx_h3_res4,_geo_*)isamples_wide.parquet(core only, smaller)isamples_wide_geo.parquet(with H3, geometry)Exploration Strategy
Phase 1: Local Python Baseline
Phase 2: File Size Impact
Phase 3: Remote (R2) Performance
Phase 4: Browser (DuckDB-WASM) Validation
Phase 5: Document & Decide
Benchmark Matrix (To Fill In)
Success Criteria
Related
cc @smrgeoinfo @datadavev