Skip to content

Feature/272 refactor surviving query entrypoints to factory pattern and consume dataset size from di#277

Merged
jathavaan merged 8 commits into
mainfrom
feature/272-refactor-surviving-query-entrypoints-to-factory-pattern-and-consume-dataset_size-from-di
May 19, 2026
Merged

Feature/272 refactor surviving query entrypoints to factory pattern and consume dataset size from di#277
jathavaan merged 8 commits into
mainfrom
feature/272-refactor-surviving-query-entrypoints-to-factory-pattern-and-consume-dataset_size-from-di

Conversation

@jathavaan
Copy link
Copy Markdown
Collaborator

This pull request refactors several benchmark entrypoints to generalize and simplify the handling of dataset sizes, and consolidates multiple "neighborhood" benchmark scripts into single parameterized entrypoints. The changes improve maintainability by reducing code duplication and centralizing dataset size logic, while updating configuration and dependency injection wiring accordingly.

Refactoring and consolidation of benchmark entrypoints:

  • Consolidated the three "bbox-filtering-result-set-sizes-neighborhood" entrypoints (duckdb, postgis, local) into new, generalized entrypoints: bbox_filtering_duckdb, bbox_filtering_postgis, and bbox_filtering_local. These now accept the dataset size as a parameter, reducing code duplication and simplifying the codebase. [1] [2] [3] [4] [5] [6] [7] [8]
  • Updated imports and dependency injection wiring throughout the codebase to use the new entrypoint names and locations. [1] [2]

Generalization of dataset size handling:

  • Introduced _get_dataset_size and _build_query_id helper functions in src/presentation/entrypoints/_factory.py to resolve the current dataset size from the DI container and compose query IDs accordingly. This allows all relevant benchmarks to dynamically use the correct dataset size and produce consistent query IDs. [1] [2] [3] [4] [5] [6] [7] [8]
  • Refactored the Databricks national-scale spatial join benchmarks to use the new dataset size resolution logic, removing the need to pass the dataset size as an explicit parameter. [1] [2] [3] [4]

Enhancements to attribute and bbox filtering benchmarks:

  • Updated attribute_spatial_compound_filter_duckdb and attribute_spatial_compound_filter_postgis to parameterize the dataset size, allowing them to operate on any configured dataset size rather than just "small". [1] [2]
  • Added a new bbox_filtering_duckdb entrypoint that is parameterized by dataset size, replacing the previous neighborhood-specific version.

Configuration updates:

  • Updated benchmarks.yml to use the new generalized entrypoint IDs and Docker images, and to reflect the new relationships between benchmark scripts. [1] [2] [3]
  • Added the new _factory module to the dependency injection initialization.

These changes make the benchmarking framework more flexible, easier to extend, and less error-prone when adding new dataset sizes or benchmark types.

jathavaan added 5 commits May 19, 2026 08:58
The Databricks runner now resolves dataset_size via the shared _factory
helper instead of taking it as a parameter, removing the redundant
DatasetSize.SMALL default that node entrypoints had to forward.
Convert point-in-polygon-lookup and attribute-spatial-compound-filter
(duckdb + postgis) to the factory pattern. The outer entrypoint reads
dataset_size from DI; the builder closes over it and bakes the resolved
query_id into the @monitor-decorated inner. DuckDB modules thread
dataset_size into the parquet path; PostGIS modules switch from the
hardcoded buildings_small table to buildings_{size}.
…me counties to municipalities

Convert national_scale_spatial_join_{duckdb,postgis} to the factory
pattern with dataset_size resolved from DI. Phase 1 swapped the boundary
file to municipalities.parquet, so the join SQL identifiers now match:
DuckDB CTE counties -> municipalities, column alias county_name ->
municipality_name; PostGIS seed helper _seed_counties -> _seed_municipalities,
target table counties -> municipalities. The PostGIS benchmark SQL also
parameterises the buildings table by dataset_size.
Per Table 4.2.1 the neighborhood-scale bbox row is THE bbox row, so the
neighborhood entrypoints become the canonical bbox entrypoints:
bbox_filtering_result_set_sizes_neighborhood_{duckdb,postgis,local}.py
-> bbox_filtering_{duckdb,postgis,local}.py. All three switch to the
factory pattern. The local variant raises ValueError when dataset_size
is not SMALL, since the Shapefile target is small-only per Table 4.2.1.
The __init__ re-exports, app_config wiring, benchmark_runner dispatch,
and benchmarks.yml entries are updated to match the new names.
@jathavaan jathavaan self-assigned this May 19, 2026
Copilot AI review requested due to automatic review settings May 19, 2026 07:03
@jathavaan jathavaan enabled auto-merge May 19, 2026 07:05
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Refactors several "surviving" benchmark entrypoints (point-in-polygon, national-scale spatial join, attribute+spatial compound filter, bbox filtering at neighborhood scale) to a factory pattern that pulls the dataset size from the DI container and parameterises the buildings table / parquet path accordingly. Renames the three bbox_filtering_result_set_sizes_neighborhood_* entrypoints to plain bbox_filtering_*, renames counties to municipalities in the non-Databricks national-scale join scripts (matching the metadata file), and drops the explicit dataset_size parameter from the Databricks node entrypoints. A new _factory module centralises _get_dataset_size / _build_query_id.

Changes:

  • New _factory._get_dataset_size + _build_query_id helpers; all surviving entrypoints now build their @monitor query_id and table/path from the DI-injected size.
  • Consolidated three neighborhood bbox entrypoints (*_duckdb/*_postgis/*_local) into bbox_filtering_{duckdb,postgis,local}, with bbox_filtering_local rejecting non-SMALL sizes.
  • Renamed countiesmunicipalities in national_scale_spatial_join_{duckdb,postgis}.py; Databricks node entrypoints now resolve dataset_size via DI instead of taking it as a parameter.

Reviewed changes

Copilot reviewed 21 out of 21 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
src/presentation/entrypoints/_factory.py New helpers _get_dataset_size/_build_query_id.
src/presentation/entrypoints/_databricks_benchmark_runner.py Drops dataset_size param; resolves it via _get_dataset_size; uses _build_query_id.
src/presentation/entrypoints/point_in_polygon_lookup_{duckdb,postgis}.py Builds _benchmark inside _build_benchmark_fn parameterised by size.
src/presentation/entrypoints/national_scale_spatial_join_{duckdb,postgis}.py Same factory pattern; renames countiesmunicipalities (table, column, vars, log strings).
src/presentation/entrypoints/national_scale_spatial_join_databricks_{2,4,8}_nodes.py Removes dataset_size parameter and import.
src/presentation/entrypoints/attribute_spatial_compound_filter_{duckdb,postgis}.py Factory pattern with size-parameterised buildings table/path.
src/presentation/entrypoints/bbox_filtering_{duckdb,postgis,local}.py New consolidated entrypoints; _local validates SMALL-only.
src/presentation/entrypoints/bbox_filtering_result_set_sizes_neighborhood_*.py Deleted (replaced by the consolidated entrypoints).
src/presentation/entrypoints/init.py Re-exports updated to new module names.
src/presentation/configuration/app_config.py Adds _factory and new bbox modules to DI wiring.
benchmarks.yml Renames the three neighborhood experiments and their related_script_ids cross-references.
benchmark_runner.py CLI cases updated to new IDs; drops explicit dataset_size=DatasetSize.SMALL for Databricks calls.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/presentation/entrypoints/_factory.py
jathavaan added 2 commits May 19, 2026 09:10
The helper in monitor_utils.py had zero callers; entrypoints resolve
dataset_size via _factory._get_dataset_size. Drop the dead duplicate
plus its now-unused DatasetSize import so there is one source of truth.
_build_query_id now appends -{size} for every dataset size, including
SMALL. Every run is self-describing about the tier it ran against, at
the cost of detaching new SMALL run keys from historical SMALL keys
written under the bare base. Refactor db_scan_blob_storage and
db_scan_postgis to the factory pattern so --dataset-size actually
flows into their parquet path / buildings table reference and produces
the size-suffixed query_id.
Copilot AI review requested due to automatic review settings May 19, 2026 07:13
db-scan-blob-storage and db-scan-postgis are out of #272 scope; their
refactor will be applied alongside the Issue 5 cleanup. Restore the
original SMALL=bare semantics of _build_query_id so historical SMALL
run keys remain accessible per #272.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 22 out of 22 changed files in this pull request and generated 1 comment.

Comment thread src/presentation/entrypoints/_databricks_benchmark_runner.py
@jathavaan jathavaan merged commit 20fa155 into main May 19, 2026
36 checks passed
@jathavaan jathavaan deleted the feature/272-refactor-surviving-query-entrypoints-to-factory-pattern-and-consume-dataset_size-from-di branch May 19, 2026 07:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Refactor surviving query entrypoints to factory pattern and consume dataset_size from DI

2 participants