Explore geospatial query optimizations via H3 pre-computation

## Goal

Deliver a **fast, pleasant experience** for developers and users querying iSamples geospatial data over the web. Current approach (DuckDB-WASM + remote parquet via HTTP range requests) is already good, but geospatial queries have room for optimization.

**User experience we're targeting:**
- Sub-second map rendering at any zoom level
- Responsive bounding box filters
- Smooth zoom-dependent clustering
- Minimal data transfer for spatial queries

---

## The Problem

Geospatial queries on remote parquet are expensive:

| Query Type | Current Behavior |
|------------|------------------|
| Bounding box filter | Scans all rows, evaluates lat/lon conditions |
| "Near this point" | Distance calculation on every row |
| Zoom-level clustering | Full aggregation, no pre-computation |
| Faceted geo query | Combines the above costs |

With 6.7M+ samples (20M rows in wide format), this adds up.

---

## Optimization Strategies

### 1. H3 Pre-computation (Recommended Starting Point)

Add [H3 hexagonal grid](https://h3geo.org/) cell IDs at multiple resolutions:

```sql
SELECT *, 
       h3_latlng_to_cell(latitude, longitude, 4) as h3_res4,  -- Continental
       h3_latlng_to_cell(latitude, longitude, 6) as h3_res6,  -- Regional  
       h3_latlng_to_cell(latitude, longitude, 8) as h3_res8   -- Local
FROM wide_parquet
WHERE latitude IS NOT NULL
```

**Benefits:**
- Filter by cell ID = simple integer match (fast)
- Hierarchical: zoom-appropriate aggregation via `GROUP BY h3_res6`
- DuckDB has native H3 extension
- Minimal infrastructure change (same single-file approach)

### 2. Partitioned Parquet by H3

Split into Hive-style partitions:
```
data/h3_res4=841f9ffffffffff/part-0.parquet
data/h3_res4=841fbffffffffff/part-0.parquet
```

**Benefits:** Range requests skip irrelevant partitions entirely  
**Trade-off:** More complex file management, needs manifest

### 3. Pre-aggregated Tiles

Create summary parquets at each zoom level:
```
zoom_4_summaries.parquet   # ~1K rows
zoom_8_summaries.parquet   # ~100K rows  
full_data.parquet          # 20M rows
```

**Benefits:** Appropriate data volume per zoom  
**Trade-off:** Storage overhead, staleness concerns

### 4. GeoParquet Bounding Box Metadata

Ensure row group bounding boxes in parquet footer for predicate pushdown.

### 5. Edge Compute (Future)

Cloudflare Workers + DuckDB for server-side filtering closer to users.

---

## Interaction with PQG Schema

**Key principle: Core spec stays clean, enhancements are optional**

Parquet is additive-friendly. We can add columns without breaking existing queries:

| Column Type | Examples | Required? |
|-------------|----------|-----------|
| **Core PQG** | row_id, otype, s, p, o, n, label, pid, latitude, longitude, p__* | Yes |
| **Enhancement** | h3_res4, h3_res6, h3_res8, geometry | No |

**Conventions to consider:**
- Prefix enhancement columns? (`_idx_h3_res4`, `_geo_*`)
- Document in PQG spec (see #16)
- Possibly publish multiple variants:
  - `isamples_wide.parquet` (core only, smaller)
  - `isamples_wide_geo.parquet` (with H3, geometry)

---

## Exploration Strategy

### Phase 1: Local Python Baseline
- Benchmark current query speeds (bbox, point-radius, aggregation)
- Add H3 columns to local parquet
- Measure improvement

### Phase 2: File Size Impact
- Measure size increase per H3 resolution
- Find sweet spot (e.g., res4 + res7 only?)
- Test compression ratios

### Phase 3: Remote (R2) Performance
- Upload test files to R2
- Measure HTTP range request behavior
- Compare bytes transferred: H3 filter vs lat/lon filter
- Test different row group sizes

### Phase 4: Browser (DuckDB-WASM) Validation
- Create test harness in Quarto
- Confirm optimizations work in browser
- Measure memory usage

### Phase 5: Document & Decide
- Benchmark report with numbers
- Recommendation on which resolutions
- Update PQG spec with enhancement columns
- Push optimized parquet to R2 if results are good

---

## Benchmark Matrix (To Fill In)

| Query Type | Local | Remote Python | Remote WASM | Notes |
|------------|-------|---------------|-------------|-------|
| Bbox (baseline) | ? ms | ? ms / ? MB | ? ms / ? MB | |
| Bbox (H3) | ? ms | ? ms / ? MB | ? ms / ? MB | |
| Aggregation (baseline) | ? ms | ? ms / ? MB | ? ms / ? MB | |
| Aggregation (H3 GROUP BY) | ? ms | ? ms / ? MB | ? ms / ? MB | |
| Point radius 50km | ? ms | ? ms / ? MB | ? ms / ? MB | |

---

## Success Criteria

- [ ] Bbox queries 5x+ faster with H3
- [ ] Aggregation queries support zoom-appropriate clustering
- [ ] File size increase < 20% for meaningful improvement
- [ ] Works in DuckDB-WASM (browser) without special handling
- [ ] Enhancement columns documented in PQG spec

---

## Related

- #16 - PQG Parquet Schema specification (where to document enhancement columns)
- [H3 documentation](https://h3geo.org/)
- [DuckDB H3 extension](https://duckdb.org/docs/extensions/h3.html)
- [GeoParquet spec](https://geoparquet.org/)

cc @smrgeoinfo @datadavev

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Explore geospatial query optimizations via H3 pre-computation #17

Goal

The Problem

Optimization Strategies

1. H3 Pre-computation (Recommended Starting Point)

2. Partitioned Parquet by H3

3. Pre-aggregated Tiles

4. GeoParquet Bounding Box Metadata

5. Edge Compute (Future)

Interaction with PQG Schema

Exploration Strategy

Phase 1: Local Python Baseline

Phase 2: File Size Impact

Phase 3: Remote (R2) Performance

Phase 4: Browser (DuckDB-WASM) Validation

Phase 5: Document & Decide

Benchmark Matrix (To Fill In)

Success Criteria

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Query Type	Current Behavior
Bounding box filter	Scans all rows, evaluates lat/lon conditions
"Near this point"	Distance calculation on every row
Zoom-level clustering	Full aggregation, no pre-computation
Faceted geo query	Combines the above costs

Column Type	Examples	Required?
Core PQG	row_id, otype, s, p, o, n, label, pid, latitude, longitude, p__*	Yes
Enhancement	h3_res4, h3_res6, h3_res8, geometry	No

Query Type	Local	Remote Python	Remote WASM
Bbox (baseline)	? ms	? ms / ? MB	? ms / ? MB
Bbox (H3)	? ms	? ms / ? MB	? ms / ? MB
Aggregation (baseline)	? ms	? ms / ? MB	? ms / ? MB
Aggregation (H3 GROUP BY)	? ms	? ms / ? MB	? ms / ? MB
Point radius 50km	? ms	? ms / ? MB	? ms / ? MB

Explore geospatial query optimizations via H3 pre-computation #17

Description

Goal

The Problem

Optimization Strategies

1. H3 Pre-computation (Recommended Starting Point)

2. Partitioned Parquet by H3

3. Pre-aggregated Tiles

4. GeoParquet Bounding Box Metadata

5. Edge Compute (Future)

Interaction with PQG Schema

Exploration Strategy

Phase 1: Local Python Baseline

Phase 2: File Size Impact

Phase 3: Remote (R2) Performance

Phase 4: Browser (DuckDB-WASM) Validation

Phase 5: Document & Decide

Benchmark Matrix (To Fill In)

Success Criteria

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions