Conversation
The Databricks runner now resolves dataset_size via the shared _factory helper instead of taking it as a parameter, removing the redundant DatasetSize.SMALL default that node entrypoints had to forward.
Convert point-in-polygon-lookup and attribute-spatial-compound-filter
(duckdb + postgis) to the factory pattern. The outer entrypoint reads
dataset_size from DI; the builder closes over it and bakes the resolved
query_id into the @monitor-decorated inner. DuckDB modules thread
dataset_size into the parquet path; PostGIS modules switch from the
hardcoded buildings_small table to buildings_{size}.
…me counties to municipalities
Convert national_scale_spatial_join_{duckdb,postgis} to the factory
pattern with dataset_size resolved from DI. Phase 1 swapped the boundary
file to municipalities.parquet, so the join SQL identifiers now match:
DuckDB CTE counties -> municipalities, column alias county_name ->
municipality_name; PostGIS seed helper _seed_counties -> _seed_municipalities,
target table counties -> municipalities. The PostGIS benchmark SQL also
parameterises the buildings table by dataset_size.
Per Table 4.2.1 the neighborhood-scale bbox row is THE bbox row, so the
neighborhood entrypoints become the canonical bbox entrypoints:
bbox_filtering_result_set_sizes_neighborhood_{duckdb,postgis,local}.py
-> bbox_filtering_{duckdb,postgis,local}.py. All three switch to the
factory pattern. The local variant raises ValueError when dataset_size
is not SMALL, since the Shapefile target is small-only per Table 4.2.1.
The __init__ re-exports, app_config wiring, benchmark_runner dispatch,
and benchmarks.yml entries are updated to match the new names.
There was a problem hiding this comment.
Pull request overview
Refactors several "surviving" benchmark entrypoints (point-in-polygon, national-scale spatial join, attribute+spatial compound filter, bbox filtering at neighborhood scale) to a factory pattern that pulls the dataset size from the DI container and parameterises the buildings table / parquet path accordingly. Renames the three bbox_filtering_result_set_sizes_neighborhood_* entrypoints to plain bbox_filtering_*, renames counties to municipalities in the non-Databricks national-scale join scripts (matching the metadata file), and drops the explicit dataset_size parameter from the Databricks node entrypoints. A new _factory module centralises _get_dataset_size / _build_query_id.
Changes:
- New
_factory._get_dataset_size+_build_query_idhelpers; all surviving entrypoints now build their@monitorquery_id and table/path from the DI-injected size. - Consolidated three neighborhood bbox entrypoints (
*_duckdb/*_postgis/*_local) intobbox_filtering_{duckdb,postgis,local}, withbbox_filtering_localrejecting non-SMALL sizes. - Renamed
counties→municipalitiesinnational_scale_spatial_join_{duckdb,postgis}.py; Databricks node entrypoints now resolvedataset_sizevia DI instead of taking it as a parameter.
Reviewed changes
Copilot reviewed 21 out of 21 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| src/presentation/entrypoints/_factory.py | New helpers _get_dataset_size/_build_query_id. |
| src/presentation/entrypoints/_databricks_benchmark_runner.py | Drops dataset_size param; resolves it via _get_dataset_size; uses _build_query_id. |
| src/presentation/entrypoints/point_in_polygon_lookup_{duckdb,postgis}.py | Builds _benchmark inside _build_benchmark_fn parameterised by size. |
| src/presentation/entrypoints/national_scale_spatial_join_{duckdb,postgis}.py | Same factory pattern; renames counties → municipalities (table, column, vars, log strings). |
| src/presentation/entrypoints/national_scale_spatial_join_databricks_{2,4,8}_nodes.py | Removes dataset_size parameter and import. |
| src/presentation/entrypoints/attribute_spatial_compound_filter_{duckdb,postgis}.py | Factory pattern with size-parameterised buildings table/path. |
| src/presentation/entrypoints/bbox_filtering_{duckdb,postgis,local}.py | New consolidated entrypoints; _local validates SMALL-only. |
| src/presentation/entrypoints/bbox_filtering_result_set_sizes_neighborhood_*.py | Deleted (replaced by the consolidated entrypoints). |
| src/presentation/entrypoints/init.py | Re-exports updated to new module names. |
| src/presentation/configuration/app_config.py | Adds _factory and new bbox modules to DI wiring. |
| benchmarks.yml | Renames the three neighborhood experiments and their related_script_ids cross-references. |
| benchmark_runner.py | CLI cases updated to new IDs; drops explicit dataset_size=DatasetSize.SMALL for Databricks calls. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
The helper in monitor_utils.py had zero callers; entrypoints resolve dataset_size via _factory._get_dataset_size. Drop the dead duplicate plus its now-unused DatasetSize import so there is one source of truth.
_build_query_id now appends -{size} for every dataset size, including
SMALL. Every run is self-describing about the tier it ran against, at
the cost of detaching new SMALL run keys from historical SMALL keys
written under the bare base. Refactor db_scan_blob_storage and
db_scan_postgis to the factory pattern so --dataset-size actually
flows into their parquet path / buildings table reference and produces
the size-suffixed query_id.
This pull request refactors several benchmark entrypoints to generalize and simplify the handling of dataset sizes, and consolidates multiple "neighborhood" benchmark scripts into single parameterized entrypoints. The changes improve maintainability by reducing code duplication and centralizing dataset size logic, while updating configuration and dependency injection wiring accordingly.
Refactoring and consolidation of benchmark entrypoints:
duckdb,postgis,local) into new, generalized entrypoints:bbox_filtering_duckdb,bbox_filtering_postgis, andbbox_filtering_local. These now accept the dataset size as a parameter, reducing code duplication and simplifying the codebase. [1] [2] [3] [4] [5] [6] [7] [8]Generalization of dataset size handling:
_get_dataset_sizeand_build_query_idhelper functions insrc/presentation/entrypoints/_factory.pyto resolve the current dataset size from the DI container and compose query IDs accordingly. This allows all relevant benchmarks to dynamically use the correct dataset size and produce consistent query IDs. [1] [2] [3] [4] [5] [6] [7] [8]Enhancements to attribute and bbox filtering benchmarks:
attribute_spatial_compound_filter_duckdbandattribute_spatial_compound_filter_postgisto parameterize the dataset size, allowing them to operate on any configured dataset size rather than just "small". [1] [2]bbox_filtering_duckdbentrypoint that is parameterized by dataset size, replacing the previous neighborhood-specific version.Configuration updates:
benchmarks.ymlto use the new generalized entrypoint IDs and Docker images, and to reflect the new relationships between benchmark scripts. [1] [2] [3]_factorymodule to the dependency injection initialization.These changes make the benchmarking framework more flexible, easier to extend, and less error-prone when adding new dataset sizes or benchmark types.