Skip to content

chore: Consolidate TPC benchmark scripts#3538

Open
andygrove wants to merge 6 commits intoapache:mainfrom
andygrove:consolidate-benchmark-scripts
Open

chore: Consolidate TPC benchmark scripts#3538
andygrove wants to merge 6 commits intoapache:mainfrom
andygrove:consolidate-benchmark-scripts

Conversation

@andygrove
Copy link
Member

Summary

  • Consolidate individual per-engine shell scripts (spark-tpch.sh, comet-tpcds.sh, etc.) into a single Python runner (benchmarks/tpc/run.py) driven by TOML engine configs in engines/
  • Rename create-iceberg-tpch.py to create-iceberg-tables.py with a --benchmark {tpch,tpcds} flag to support converting both TPC-H and TPC-DS Parquet data to Iceberg tables
  • Add check_benchmark_env() in the runner to validate benchmark-specific env vars (TPCH_QUERIES / TPCDS_QUERIES, etc.) and default ICEBERG_DATABASE to the benchmark name
  • Remove hardcoded TPC-H assumptions from comet-iceberg.toml so it works for both benchmarks

Test plan

  • python3 run.py --engine comet-iceberg --benchmark tpch --dry-run produces correct command
  • python3 run.py --engine comet-iceberg --benchmark tpcds --dry-run produces correct command with --database tpcds and TPC-DS executor settings
  • python3 create-iceberg-tables.py --help shows both tpch and tpcds choices
  • Other engines (spark, comet, gluten, blaze) still work for both benchmarks

🤖 Generated with Claude Code

andygrove and others added 2 commits February 16, 2026 14:58
Replace 9 per-engine shell scripts with a single `run.py` that loads
per-engine TOML config files. This eliminates duplicated Spark conf
boilerplate and makes it easier to add new engines or modify shared
settings.

Usage: `python3 run.py --engine comet --benchmark tpch [--dry-run]`

Also moves benchmarks from `dev/benchmarks/` to `benchmarks/tpc/` and
updates all documentation references.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Rename create-iceberg-tpch.py to create-iceberg-tables.py with --benchmark flag
  supporting both tpch and tpcds table sets
- Remove hardcoded TPCH_QUERIES from comet-iceberg.toml required env vars
- Remove hardcoded ICEBERG_DATABASE default of "tpch" from comet-iceberg.toml
- Add check_benchmark_env() in run.py to validate benchmark-specific env vars
  and default ICEBERG_DATABASE to the benchmark name
- Update README with TPC-DS Iceberg table creation examples

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@andygrove andygrove changed the title Consolidate TPC benchmark scripts and add TPC-DS Iceberg support Consolidate TPC benchmark scripts [WIP] Feb 16, 2026
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@andygrove andygrove changed the title Consolidate TPC benchmark scripts [WIP] chore: Consolidate TPC benchmark scripts [WIP] Feb 16, 2026
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@andygrove andygrove changed the title chore: Consolidate TPC benchmark scripts [WIP] chore: Consolidate TPC benchmark scripts Feb 17, 2026
@andygrove andygrove marked this pull request as ready for review February 17, 2026 02:16
@andygrove andygrove marked this pull request as draft February 17, 2026 02:27
andygrove and others added 2 commits February 17, 2026 06:22
The script now configures the Iceberg catalog via SparkSession.builder
instead of requiring --conf flags on the spark-submit command line. This
adds --warehouse as a required CLI arg, makes --catalog optional
(default: local), and validates paths with clear error messages before
starting Spark.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@andygrove
Copy link
Member Author

@mbutrovich @comphead I have finished testing this PR and it is now ready for review

@andygrove andygrove marked this pull request as ready for review February 17, 2026 13:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

Comments