Goal
Complement geospatial optimizations (#17) with fast faceted metadata queries - the kind that power dashboards, filter dropdowns, and exploratory UI.
User experience we're targeting:
- Instant facet counts on page load ("1.2M samples from OPENCONTEXT, 3.1M from SESAR...")
- Responsive filter refinement (select material type → see updated counts)
- Fast cross-entity label lookups (show material names without slow joins)
The Problem
Even though parquet has good native optimizations (dictionary encoding, column stats), some queries still require full scans:
| Query Type |
Current Behavior |
Pain Point |
| Facet counts |
GROUP BY n scans all rows |
Slow on 20M rows |
| Filter + count |
WHERE material='Rock' GROUP BY source |
Compounds the cost |
| Label lookups |
Join through p__* arrays |
Array membership is slow |
| Dashboard totals |
Multiple aggregations |
Repeated scans |
What Parquet Gives Us (Free)
| Feature |
Benefit |
| Dictionary encoding |
n, otype filters are fast (int comparison) |
| Min/max stats |
Range predicates skip row groups |
| Column pruning |
Only reads columns in SELECT |
Already fast: WHERE n = 'OPENCONTEXT' (equality on dictionary column)
Still slow: SELECT n, COUNT(*) GROUP BY n (must scan to count)
Pre-computation Strategies
1. Facet Count Summary Table (Recommended First)
Pre-compute common aggregations as a tiny parquet:
-- facet_summaries.parquet (~1KB)
| facet_type | facet_value | count |
|------------|-------------------|---------|
| source | OPENCONTEXT | 1,200,000 |
| source | SESAR | 3,100,000 |
| source | GEOME | 1,500,000 |
| material | Rock | 2,500,000 |
| material |Ite | 800,000 |
| material |Ite | 600,000 |
| object_type| Specimen | 4,000,000 |
| ... | ... | ... |
Benefits:
- Instant dashboard load (fetch 1KB instead of scanning 280MB)
- Can include combinations:
source + material cross-tab
- Easy to regenerate when data updates
Query pattern:
-- Instead of slow:
SELECT n, COUNT(*) FROM wide WHERE otype='MaterialSampleRecord' GROUP BY n
-- Fast lookup:
SELECT * FROM facet_summaries WHERE facet_type = 'source'
2. Cross-Facet Intersection Counts
For "how many Rock samples from OPENCONTEXT?":
-- facet_intersections.parquet
| source | material | count |
|-------------|----------|-------|
| OPENCONTEXT | Rock | 50,000|
| OPENCONTEXT | Ite | 30,000|
| SESAR | Rock | 800,000|
Trade-off: Combinatorial explosion if too many facets. Limit to top N values per facet.
3. Denormalized Label Columns
Avoid joins for display by flattening lookups:
-- Current: requires join through p__has_material_category array
SELECT msr.label, ic.label as material
FROM wide msr, wide ic
WHERE ic.row_id = ANY(msr.p__has_material_category)
-- Pre-computed: labels already present
| row_id | label | material_labels | context_labels |
|--------|-------|-----------------|----------------|
| 123 | "Pottery sherd" | ["Rock", "ite"] | ["Earth surface"] |
Trade-off: Increases file size, duplicates data. Best for frequently-displayed fields only.
4. Bloom Filters on High-Cardinality Columns
For "does this parquet contain samples from project X?":
# Enable bloom filter when writing parquet
pq.write_table(table, 'file.parquet',
write_statistics=True,
column_encoding={'project': 'BLOOM_FILTER'})
Benefits: Fast negative lookups ("this row group definitely doesn't have X")
Interaction with PQG Schema
Same principle as #17: core spec stays clean, enhancements are optional
| File |
Purpose |
Size |
isamples_wide.parquet |
Core data |
280MB |
isamples_facet_summaries.parquet |
Pre-computed counts |
~10KB |
isamples_facet_intersections.parquet |
Cross-tab counts |
~100KB |
isamples_wide_denormalized.parquet |
With label columns |
~350MB |
Clients choose which files to use based on their needs.
Exploration Strategy
Phase 1: Identify Hot Queries
- What facets does the UI actually need?
- What's the current query time for each?
- Which cause the most pain?
Phase 2: Build Facet Summary Table
- Generate
facet_summaries.parquet
- Measure dashboard load improvement
- Decide which cross-facet intersections matter
Phase 3: Evaluate Denormalization
- Prototype
material_labels column
- Measure size increase vs query speedup
- Decide if worth the trade-off
Phase 4: Document & Ship
- Add to PQG spec as optional enhancement files
- Update data pipeline to regenerate on refresh
- Publish alongside core parquet
Benchmark Matrix (To Fill In)
| Query |
Current (scan) |
With Summary Table |
Notes |
| Source facet counts |
? ms |
? ms |
|
| Material facet counts |
? ms |
? ms |
|
| Source × Material cross-tab |
? ms |
? ms |
|
| Sample with material labels |
? ms |
? ms |
With denorm |
Success Criteria
Related
cc @smrgeoinfo @datadavev
Goal
Complement geospatial optimizations (#17) with fast faceted metadata queries - the kind that power dashboards, filter dropdowns, and exploratory UI.
User experience we're targeting:
The Problem
Even though parquet has good native optimizations (dictionary encoding, column stats), some queries still require full scans:
GROUP BY nscans all rowsWHERE material='Rock' GROUP BY sourcep__*arraysWhat Parquet Gives Us (Free)
n,otypefilters are fast (int comparison)Already fast:
WHERE n = 'OPENCONTEXT'(equality on dictionary column)Still slow:
SELECT n, COUNT(*) GROUP BY n(must scan to count)Pre-computation Strategies
1. Facet Count Summary Table (Recommended First)
Pre-compute common aggregations as a tiny parquet:
Benefits:
source + materialcross-tabQuery pattern:
2. Cross-Facet Intersection Counts
For "how many Rock samples from OPENCONTEXT?":
Trade-off: Combinatorial explosion if too many facets. Limit to top N values per facet.
3. Denormalized Label Columns
Avoid joins for display by flattening lookups:
Trade-off: Increases file size, duplicates data. Best for frequently-displayed fields only.
4. Bloom Filters on High-Cardinality Columns
For "does this parquet contain samples from project X?":
Benefits: Fast negative lookups ("this row group definitely doesn't have X")
Interaction with PQG Schema
Same principle as #17: core spec stays clean, enhancements are optional
isamples_wide.parquetisamples_facet_summaries.parquetisamples_facet_intersections.parquetisamples_wide_denormalized.parquetClients choose which files to use based on their needs.
Exploration Strategy
Phase 1: Identify Hot Queries
Phase 2: Build Facet Summary Table
facet_summaries.parquetPhase 3: Evaluate Denormalization
material_labelscolumnPhase 4: Document & Ship
Benchmark Matrix (To Fill In)
Success Criteria
Related
cc @smrgeoinfo @datadavev