Skip to content

Explore geospatial query optimizations via H3 pre-computation #17

@rdhyee

Description

@rdhyee

Goal

Deliver a fast, pleasant experience for developers and users querying iSamples geospatial data over the web. Current approach (DuckDB-WASM + remote parquet via HTTP range requests) is already good, but geospatial queries have room for optimization.

User experience we're targeting:

  • Sub-second map rendering at any zoom level
  • Responsive bounding box filters
  • Smooth zoom-dependent clustering
  • Minimal data transfer for spatial queries

The Problem

Geospatial queries on remote parquet are expensive:

Query Type Current Behavior
Bounding box filter Scans all rows, evaluates lat/lon conditions
"Near this point" Distance calculation on every row
Zoom-level clustering Full aggregation, no pre-computation
Faceted geo query Combines the above costs

With 6.7M+ samples (20M rows in wide format), this adds up.


Optimization Strategies

1. H3 Pre-computation (Recommended Starting Point)

Add H3 hexagonal grid cell IDs at multiple resolutions:

SELECT *, 
       h3_latlng_to_cell(latitude, longitude, 4) as h3_res4,  -- Continental
       h3_latlng_to_cell(latitude, longitude, 6) as h3_res6,  -- Regional  
       h3_latlng_to_cell(latitude, longitude, 8) as h3_res8   -- Local
FROM wide_parquet
WHERE latitude IS NOT NULL

Benefits:

  • Filter by cell ID = simple integer match (fast)
  • Hierarchical: zoom-appropriate aggregation via GROUP BY h3_res6
  • DuckDB has native H3 extension
  • Minimal infrastructure change (same single-file approach)

2. Partitioned Parquet by H3

Split into Hive-style partitions:

data/h3_res4=841f9ffffffffff/part-0.parquet
data/h3_res4=841fbffffffffff/part-0.parquet

Benefits: Range requests skip irrelevant partitions entirely
Trade-off: More complex file management, needs manifest

3. Pre-aggregated Tiles

Create summary parquets at each zoom level:

zoom_4_summaries.parquet   # ~1K rows
zoom_8_summaries.parquet   # ~100K rows  
full_data.parquet          # 20M rows

Benefits: Appropriate data volume per zoom
Trade-off: Storage overhead, staleness concerns

4. GeoParquet Bounding Box Metadata

Ensure row group bounding boxes in parquet footer for predicate pushdown.

5. Edge Compute (Future)

Cloudflare Workers + DuckDB for server-side filtering closer to users.


Interaction with PQG Schema

Key principle: Core spec stays clean, enhancements are optional

Parquet is additive-friendly. We can add columns without breaking existing queries:

Column Type Examples Required?
Core PQG row_id, otype, s, p, o, n, label, pid, latitude, longitude, p__* Yes
Enhancement h3_res4, h3_res6, h3_res8, geometry No

Conventions to consider:

  • Prefix enhancement columns? (_idx_h3_res4, _geo_*)
  • Document in PQG spec (see Define formal PQG Parquet Schema specification #16)
  • Possibly publish multiple variants:
    • isamples_wide.parquet (core only, smaller)
    • isamples_wide_geo.parquet (with H3, geometry)

Exploration Strategy

Phase 1: Local Python Baseline

  • Benchmark current query speeds (bbox, point-radius, aggregation)
  • Add H3 columns to local parquet
  • Measure improvement

Phase 2: File Size Impact

  • Measure size increase per H3 resolution
  • Find sweet spot (e.g., res4 + res7 only?)
  • Test compression ratios

Phase 3: Remote (R2) Performance

  • Upload test files to R2
  • Measure HTTP range request behavior
  • Compare bytes transferred: H3 filter vs lat/lon filter
  • Test different row group sizes

Phase 4: Browser (DuckDB-WASM) Validation

  • Create test harness in Quarto
  • Confirm optimizations work in browser
  • Measure memory usage

Phase 5: Document & Decide

  • Benchmark report with numbers
  • Recommendation on which resolutions
  • Update PQG spec with enhancement columns
  • Push optimized parquet to R2 if results are good

Benchmark Matrix (To Fill In)

Query Type Local Remote Python Remote WASM Notes
Bbox (baseline) ? ms ? ms / ? MB ? ms / ? MB
Bbox (H3) ? ms ? ms / ? MB ? ms / ? MB
Aggregation (baseline) ? ms ? ms / ? MB ? ms / ? MB
Aggregation (H3 GROUP BY) ? ms ? ms / ? MB ? ms / ? MB
Point radius 50km ? ms ? ms / ? MB ? ms / ? MB

Success Criteria

  • Bbox queries 5x+ faster with H3
  • Aggregation queries support zoom-appropriate clustering
  • File size increase < 20% for meaningful improvement
  • Works in DuckDB-WASM (browser) without special handling
  • Enhancement columns documented in PQG spec

Related

cc @smrgeoinfo @datadavev

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions