chore: Consolidate TPC benchmark scripts#3538
Open
andygrove wants to merge 6 commits intoapache:mainfrom
Open
Conversation
Replace 9 per-engine shell scripts with a single `run.py` that loads per-engine TOML config files. This eliminates duplicated Spark conf boilerplate and makes it easier to add new engines or modify shared settings. Usage: `python3 run.py --engine comet --benchmark tpch [--dry-run]` Also moves benchmarks from `dev/benchmarks/` to `benchmarks/tpc/` and updates all documentation references. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Rename create-iceberg-tpch.py to create-iceberg-tables.py with --benchmark flag supporting both tpch and tpcds table sets - Remove hardcoded TPCH_QUERIES from comet-iceberg.toml required env vars - Remove hardcoded ICEBERG_DATABASE default of "tpch" from comet-iceberg.toml - Add check_benchmark_env() in run.py to validate benchmark-specific env vars and default ICEBERG_DATABASE to the benchmark name - Update README with TPC-DS Iceberg table creation examples Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This was referenced Feb 16, 2026
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The script now configures the Iceberg catalog via SparkSession.builder instead of requiring --conf flags on the spark-submit command line. This adds --warehouse as a required CLI arg, makes --catalog optional (default: local), and validates paths with clear error messages before starting Spark. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Member
Author
|
@mbutrovich @comphead I have finished testing this PR and it is now ready for review |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
spark-tpch.sh,comet-tpcds.sh, etc.) into a single Python runner (benchmarks/tpc/run.py) driven by TOML engine configs inengines/create-iceberg-tpch.pytocreate-iceberg-tables.pywith a--benchmark {tpch,tpcds}flag to support converting both TPC-H and TPC-DS Parquet data to Iceberg tablescheck_benchmark_env()in the runner to validate benchmark-specific env vars (TPCH_QUERIES/TPCDS_QUERIES, etc.) and defaultICEBERG_DATABASEto the benchmark namecomet-iceberg.tomlso it works for both benchmarksTest plan
python3 run.py --engine comet-iceberg --benchmark tpch --dry-runproduces correct commandpython3 run.py --engine comet-iceberg --benchmark tpcds --dry-runproduces correct command with--database tpcdsand TPC-DS executor settingspython3 create-iceberg-tables.py --helpshows both tpch and tpcds choicesspark,comet,gluten,blaze) still work for both benchmarks🤖 Generated with Claude Code