Feature/281 refactor test matrix#282
Merged
Merged
Conversation
Add the apples-to-apples Sedona `default` join-strategy variant (no
broadcast hint, no Sedona partitioner config; Spark CBO picks the plan)
and extend the existing broadcast and partitioned variants to 12- and
16-node clusters.
- src/config.py: add DATABRICKS_{LOCAL_SCRIPT,WORKSPACE_NOTEBOOK}_PATH_DEFAULT
- IDatabricksService Literal["broadcast","partitioned"] -> include "default"
- DatabricksService NotebookVariant + dispatcher helpers extended
- _databricks_benchmark_runner.NotebookVariant extended
- new notebook src/presentation/databricks/national_scale_spatial_join_default.py
- new entrypoints:
- national_scale_spatial_join_databricks_default_{2,4,8,12,16}_nodes
- national_scale_spatial_join_databricks_broadcast_{12,16}_nodes
- national_scale_spatial_join_databricks_partitioned_{12,16}_nodes
- wiring updates in entrypoints/__init__.py, app_config.py, benchmark_runner.py
Activate the consolidated experiment matrix: drop attribute-spatial compound filter and the medium tier from RQ1, omit the 2-node row at small for the broadcast and partitioned strategies in RQ2, and add 11 default-strategy rows plus 4 large-tier extension rows. Total stays at 52 experiments (15 RQ1 + 37 RQ2); 6 RQ1 + 34 RQ2 = 40 pair groups. - delete 3 attribute_spatial_compound_filter_* entrypoints + wiring - benchmarks.yml: -7 compound + -6 medium + -2 small-2-node + +11 default + +4 large - docker-compose.yml: -3 compound services + +9 sedona services - pull-request-tests.yml + push-containers-to-acr.yml matrices updated
Drop 20 entrypoints that were carried only as commented-out stubs in
benchmarks.yml, docker-compose.yml and the CI workflow matrices. None
were active in the current matrix; keeping them was no longer a
load-bearing escape hatch since the test design has stabilised.
- db-scan-{blob-storage,postgis}
- bbox-filtering-{simple-local,simple-blob-storage,advanced-duckdb,advanced-postgis}
- bbox-filtering-result-set-sizes-{municipality,county}-{duckdb,local,postgis}
- vector-tiles-{single-tile,100k}-{pmtiles,vmt}
- spatial-aggregation-grid-{duckdb,postgis}
- ordered-range-query-{duckdb,postgis}
Touches: entrypoint files (deleted), entrypoints/__init__.py, app_config.py
wiring list, benchmark_runner.py imports + cases, benchmarks.yml +
docker-compose.yml + both CI workflow files (commented-out blocks removed).
- RQ1 matrix: drop compound-filter row and medium-tier columns - RQ2 matrix: add Sedona default-strategy row, extend broadcast/partitioned to 12/16 nodes, note the small/2-node omission - Pair-group count: 33 -> 40 (Sedona singletons inclusive) - Engine table: list 5 Databricks cluster sizes - Quota note: target >=72 vCPU in Sweden Central; mention 12-node hedge - Databricks lifecycle section: list all three notebook variants - Research-gaps text: drop removed query patterns from the catalog - script-id flag example: refresh removed db-scan reference
Closed
39 tasks
Contributor
There was a problem hiding this comment.
Pull request overview
This PR refactors the benchmark/test matrix to match the currently active experiment suite by removing deprecated/out-of-scope query entrypoints and CI matrix entries, and expanding the Databricks/Sedona national-scale spatial-join benchmarks to include a new default strategy plus additional cluster sizes.
Changes:
- Removed multiple deprecated benchmark entrypoints (db scans, vector tiles, spatial aggregation, ordered range query, attribute+spatial compound filter, and various bbox variants) and their dispatch/import wiring.
- Added Databricks/Sedona national-scale spatial-join entrypoints for
defaultstrategy and extendedbroadcast/partitionedto 12/16-node clusters, plus a new Databricks notebook variant fordefault. - Updated
benchmarks.yml,docker-compose.yml, GitHub Actions workflows, and README documentation to reflect the new/trimmed matrix.
Reviewed changes
Copilot reviewed 45 out of 45 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| src/presentation/entrypoints/vector_tiles_single_tile_vmt.py | Removed deprecated vector-tile benchmark entrypoint |
| src/presentation/entrypoints/vector_tiles_single_tile_pmtiles.py | Removed deprecated vector-tile benchmark entrypoint |
| src/presentation/entrypoints/vector_tiles_100k_vmt.py | Removed deprecated vector-tile benchmark entrypoint |
| src/presentation/entrypoints/vector_tiles_100k_pmtiles.py | Removed deprecated vector-tile benchmark entrypoint |
| src/presentation/entrypoints/spatial_aggregation_grid_postgis.py | Removed deprecated spatial-aggregation entrypoint |
| src/presentation/entrypoints/spatial_aggregation_grid_duckdb.py | Removed deprecated spatial-aggregation entrypoint |
| src/presentation/entrypoints/ordered_range_query_postgis.py | Removed deprecated ordered-range-query entrypoint |
| src/presentation/entrypoints/ordered_range_query_duckdb.py | Removed deprecated ordered-range-query entrypoint |
| src/presentation/entrypoints/db_scan_postgis.py | Removed deprecated db-scan entrypoint |
| src/presentation/entrypoints/db_scan_blob_storage.py | Removed deprecated db-scan entrypoint |
| src/presentation/entrypoints/bbox_filtering_simple_local.py | Removed deprecated bbox variant entrypoint |
| src/presentation/entrypoints/bbox_filtering_simple_blob_storage.py | Removed deprecated bbox variant entrypoint |
| src/presentation/entrypoints/bbox_filtering_result_set_sizes_municipality_postgis.py | Removed deprecated bbox result-set-size entrypoint |
| src/presentation/entrypoints/bbox_filtering_result_set_sizes_municipality_local.py | Removed deprecated bbox result-set-size entrypoint |
| src/presentation/entrypoints/bbox_filtering_result_set_sizes_municipality_duckdb.py | Removed deprecated bbox result-set-size entrypoint |
| src/presentation/entrypoints/bbox_filtering_result_set_sizes_county_postgis.py | Removed deprecated bbox result-set-size entrypoint |
| src/presentation/entrypoints/bbox_filtering_result_set_sizes_county_local.py | Removed deprecated bbox result-set-size entrypoint |
| src/presentation/entrypoints/bbox_filtering_result_set_sizes_county_duckdb.py | Removed deprecated bbox result-set-size entrypoint |
| src/presentation/entrypoints/bbox_filtering_advanced_postgis.py | Removed deprecated bbox advanced entrypoint |
| src/presentation/entrypoints/bbox_filtering_advanced_duckdb.py | Removed deprecated bbox advanced entrypoint |
| src/presentation/entrypoints/attribute_spatial_compound_filter_postgis.py | Removed deprecated attribute+spatial compound filter entrypoint |
| src/presentation/entrypoints/attribute_spatial_compound_filter_local.py | Removed deprecated attribute+spatial compound filter entrypoint |
| src/presentation/entrypoints/attribute_spatial_compound_filter_duckdb.py | Removed deprecated attribute+spatial compound filter entrypoint |
| src/presentation/entrypoints/national_scale_spatial_join_databricks_partitioned_16_nodes.py | Added Databricks partitioned 16-worker entrypoint |
| src/presentation/entrypoints/national_scale_spatial_join_databricks_partitioned_12_nodes.py | Added Databricks partitioned 12-worker entrypoint |
| src/presentation/entrypoints/national_scale_spatial_join_databricks_default_8_nodes.py | Added Databricks default 8-worker entrypoint |
| src/presentation/entrypoints/national_scale_spatial_join_databricks_default_4_nodes.py | Added Databricks default 4-worker entrypoint |
| src/presentation/entrypoints/national_scale_spatial_join_databricks_default_2_nodes.py | Added Databricks default 2-worker entrypoint |
| src/presentation/entrypoints/national_scale_spatial_join_databricks_default_16_nodes.py | Added Databricks default 16-worker entrypoint |
| src/presentation/entrypoints/national_scale_spatial_join_databricks_default_12_nodes.py | Added Databricks default 12-worker entrypoint |
| src/presentation/entrypoints/national_scale_spatial_join_databricks_broadcast_16_nodes.py | Added Databricks broadcast 16-worker entrypoint |
| src/presentation/entrypoints/national_scale_spatial_join_databricks_broadcast_12_nodes.py | Added Databricks broadcast 12-worker entrypoint |
| src/presentation/entrypoints/_databricks_benchmark_runner.py | Extended notebook-variant typing to include default |
| src/presentation/entrypoints/init.py | Updated exported entrypoints (remove deprecated, add new Databricks variants) |
| src/presentation/databricks/national_scale_spatial_join_default.py | Added new Databricks notebook variant implementing “default” strategy |
| src/presentation/configuration/app_config.py | Updated DI wiring module list to match active entrypoints |
| src/infra/infrastructure/services/databricks_service.py | Added default notebook-variant routing to script/workspace paths |
| src/config.py | Added Config paths for the default Databricks notebook |
| src/application/contracts/databricks_service_interface.py | Updated interface typing/docs to include default notebook variant |
| README.md | Updated supported patterns and matrix/Databricks strategy documentation |
| docker-compose.yml | Removed deprecated benchmark services; added new Databricks variants |
| benchmarks.yml | Updated experiment definitions to match refactored matrix |
| benchmark_runner.py | Removed dispatch cases/imports for deprecated benchmarks; added new Databricks variants |
| .github/workflows/push-containers-to-acr.yml | Updated build/push matrix to match active services |
| .github/workflows/pull-request-tests.yml | Updated PR build matrix to match active services |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This pull request updates the README and CI/CD workflow files to reflect the current set of supported query patterns and Databricks cluster configurations, aligning documentation and automation with the active benchmark suite. The main changes involve removing references to deprecated or out-of-scope query types, updating the cluster sizes and join strategies for Apache Sedona on Databricks, and ensuring the workflows only include the relevant services.
Workflow and Service Updates:
attribute-spatial-compound-filterservices (DuckDB, PostGIS, and Shapefile variants) from both the pull request test workflow (pull-request-tests.yml) and the container push workflow (push-containers-to-acr.yml), as these query patterns are no longer supported. [1] [2]default,broadcast,partitioned) and at five cluster sizes (2, 4, 8, 12, and 16 nodes). These are now included in both workflows. [1] [2]Documentation Updates (
README.md):default,broadcast,partitioned) and their roles in the benchmarks. [1] [2]