Feature/272 refactor surviving query entrypoints to factory pattern and consume dataset size from di by jathavaan · Pull Request #277 · kartAI/doppa

jathavaan · 2026-05-19T07:03:53Z

This pull request refactors several benchmark entrypoints to generalize and simplify the handling of dataset sizes, and consolidates multiple "neighborhood" benchmark scripts into single parameterized entrypoints. The changes improve maintainability by reducing code duplication and centralizing dataset size logic, while updating configuration and dependency injection wiring accordingly.

Refactoring and consolidation of benchmark entrypoints:

Consolidated the three "bbox-filtering-result-set-sizes-neighborhood" entrypoints (duckdb, postgis, local) into new, generalized entrypoints: bbox_filtering_duckdb, bbox_filtering_postgis, and bbox_filtering_local. These now accept the dataset size as a parameter, reducing code duplication and simplifying the codebase. [1] [2] [3] [4] [5] [6] [7] [8]
Updated imports and dependency injection wiring throughout the codebase to use the new entrypoint names and locations. [1] [2]

Generalization of dataset size handling:

Introduced _get_dataset_size and _build_query_id helper functions in src/presentation/entrypoints/_factory.py to resolve the current dataset size from the DI container and compose query IDs accordingly. This allows all relevant benchmarks to dynamically use the correct dataset size and produce consistent query IDs. [1] [2] [3] [4] [5] [6] [7] [8]
Refactored the Databricks national-scale spatial join benchmarks to use the new dataset size resolution logic, removing the need to pass the dataset size as an explicit parameter. [1] [2] [3] [4]

Enhancements to attribute and bbox filtering benchmarks:

Updated attribute_spatial_compound_filter_duckdb and attribute_spatial_compound_filter_postgis to parameterize the dataset size, allowing them to operate on any configured dataset size rather than just "small". [1] [2]
Added a new bbox_filtering_duckdb entrypoint that is parameterized by dataset size, replacing the previous neighborhood-specific version.

Configuration updates:

Updated benchmarks.yml to use the new generalized entrypoint IDs and Docker images, and to reflect the new relationships between benchmark scripts. [1] [2] [3]
Added the new _factory module to the dependency injection initialization.

These changes make the benchmarking framework more flexible, easier to extend, and less error-prone when adding new dataset sizes or benchmark types.

The Databricks runner now resolves dataset_size via the shared _factory helper instead of taking it as a parameter, removing the redundant DatasetSize.SMALL default that node entrypoints had to forward.

Convert point-in-polygon-lookup and attribute-spatial-compound-filter (duckdb + postgis) to the factory pattern. The outer entrypoint reads dataset_size from DI; the builder closes over it and bakes the resolved query_id into the @monitor-decorated inner. DuckDB modules thread dataset_size into the parquet path; PostGIS modules switch from the hardcoded buildings_small table to buildings_{size}.

…me counties to municipalities Convert national_scale_spatial_join_{duckdb,postgis} to the factory pattern with dataset_size resolved from DI. Phase 1 swapped the boundary file to municipalities.parquet, so the join SQL identifiers now match: DuckDB CTE counties -> municipalities, column alias county_name -> municipality_name; PostGIS seed helper _seed_counties -> _seed_municipalities, target table counties -> municipalities. The PostGIS benchmark SQL also parameterises the buildings table by dataset_size.

Per Table 4.2.1 the neighborhood-scale bbox row is THE bbox row, so the neighborhood entrypoints become the canonical bbox entrypoints: bbox_filtering_result_set_sizes_neighborhood_{duckdb,postgis,local}.py -> bbox_filtering_{duckdb,postgis,local}.py. All three switch to the factory pattern. The local variant raises ValueError when dataset_size is not SMALL, since the Shapefile target is small-only per Table 4.2.1. The __init__ re-exports, app_config wiring, benchmark_runner dispatch, and benchmarks.yml entries are updated to match the new names.

Copilot

Pull request overview

Refactors several "surviving" benchmark entrypoints (point-in-polygon, national-scale spatial join, attribute+spatial compound filter, bbox filtering at neighborhood scale) to a factory pattern that pulls the dataset size from the DI container and parameterises the buildings table / parquet path accordingly. Renames the three bbox_filtering_result_set_sizes_neighborhood_* entrypoints to plain bbox_filtering_*, renames counties to municipalities in the non-Databricks national-scale join scripts (matching the metadata file), and drops the explicit dataset_size parameter from the Databricks node entrypoints. A new _factory module centralises _get_dataset_size / _build_query_id.

Changes:

New _factory._get_dataset_size + _build_query_id helpers; all surviving entrypoints now build their @monitor query_id and table/path from the DI-injected size.
Consolidated three neighborhood bbox entrypoints (*_duckdb/*_postgis/*_local) into bbox_filtering_{duckdb,postgis,local}, with bbox_filtering_local rejecting non-SMALL sizes.
Renamed counties → municipalities in national_scale_spatial_join_{duckdb,postgis}.py; Databricks node entrypoints now resolve dataset_size via DI instead of taking it as a parameter.

Reviewed changes

Copilot reviewed 21 out of 21 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
src/presentation/entrypoints/_factory.py	New helpers `_get_dataset_size`/`_build_query_id`.
src/presentation/entrypoints/_databricks_benchmark_runner.py	Drops `dataset_size` param; resolves it via `_get_dataset_size`; uses `_build_query_id`.
src/presentation/entrypoints/point_in_polygon_lookup_{duckdb,postgis}.py	Builds `_benchmark` inside `_build_benchmark_fn` parameterised by size.
src/presentation/entrypoints/national_scale_spatial_join_{duckdb,postgis}.py	Same factory pattern; renames `counties` → `municipalities` (table, column, vars, log strings).
src/presentation/entrypoints/national_scale_spatial_join_databricks_{2,4,8}_nodes.py	Removes `dataset_size` parameter and import.
src/presentation/entrypoints/attribute_spatial_compound_filter_{duckdb,postgis}.py	Factory pattern with size-parameterised buildings table/path.
src/presentation/entrypoints/bbox_filtering_{duckdb,postgis,local}.py	New consolidated entrypoints; `_local` validates SMALL-only.
src/presentation/entrypoints/bbox_filtering_result_set_sizes_neighborhood_*.py	Deleted (replaced by the consolidated entrypoints).
src/presentation/entrypoints/init.py	Re-exports updated to new module names.
src/presentation/configuration/app_config.py	Adds `_factory` and new bbox modules to DI wiring.
benchmarks.yml	Renames the three neighborhood experiments and their `related_script_ids` cross-references.
benchmark_runner.py	CLI cases updated to new IDs; drops explicit `dataset_size=DatasetSize.SMALL` for Databricks calls.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

The helper in monitor_utils.py had zero callers; entrypoints resolve dataset_size via _factory._get_dataset_size. Drop the dead duplicate plus its now-unused DatasetSize import so there is one source of truth.

_build_query_id now appends -{size} for every dataset size, including SMALL. Every run is self-describing about the tier it ran against, at the cost of detaching new SMALL run keys from historical SMALL keys written under the bare base. Refactor db_scan_blob_storage and db_scan_postgis to the factory pattern so --dataset-size actually flows into their parquet path / buildings table reference and produces the size-suffixed query_id.

db-scan-blob-storage and db-scan-postgis are out of #272 scope; their refactor will be applied alongside the Issue 5 cleanup. Restore the original SMALL=bare semantics of _build_query_id so historical SMALL run keys remain accessible per #272.

Copilot

Pull request overview

Copilot reviewed 22 out of 22 changed files in this pull request and generated 1 comment.

jathavaan added 5 commits May 19, 2026 08:58

#272 Add factory helpers for dataset_size and query_id

8eb6c5e

#272 Pull dataset_size from DI in Databricks runner

4cf4e9a

The Databricks runner now resolves dataset_size via the shared _factory helper instead of taking it as a parameter, removing the redundant DatasetSize.SMALL default that node entrypoints had to forward.

jathavaan self-assigned this May 19, 2026

Copilot AI review requested due to automatic review settings May 19, 2026 07:03

jathavaan linked an issue May 19, 2026 that may be closed by this pull request

Refactor surviving query entrypoints to factory pattern and consume dataset_size from DI #272

Closed

10 tasks

Copilot started reviewing on behalf of jathavaan May 19, 2026 07:04 View session

jathavaan enabled auto-merge May 19, 2026 07:05

Copilot AI reviewed May 19, 2026

View reviewed changes

Comment thread src/presentation/entrypoints/_factory.py

jathavaan added 2 commits May 19, 2026 09:10

#272 Drop unused _get_dataset_size duplicate from monitor_utils

eebfa8b

The helper in monitor_utils.py had zero callers; entrypoints resolve dataset_size via _factory._get_dataset_size. Drop the dead duplicate plus its now-unused DatasetSize import so there is one source of truth.

Copilot AI review requested due to automatic review settings May 19, 2026 07:13

Copilot started reviewing on behalf of jathavaan May 19, 2026 07:14 View session

Copilot AI reviewed May 19, 2026

View reviewed changes

Comment thread src/presentation/entrypoints/_databricks_benchmark_runner.py

jathavaan merged commit 20fa155 into main May 19, 2026
36 checks passed

jathavaan deleted the feature/272-refactor-surviving-query-entrypoints-to-factory-pattern-and-consume-dataset_size-from-di branch May 19, 2026 07:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/272 refactor surviving query entrypoints to factory pattern and consume dataset size from di#277

Feature/272 refactor surviving query entrypoints to factory pattern and consume dataset size from di#277
jathavaan merged 8 commits into
mainfrom
feature/272-refactor-surviving-query-entrypoints-to-factory-pattern-and-consume-dataset_size-from-di

jathavaan commented May 19, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jathavaan commented May 19, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants