feat: default to sort-merge join by andygrove · Pull Request #1651 · apache/datafusion-ballista

andygrove · 2026-05-03T16:00:01Z

Which issue does this PR close?

Closes #1648.

Rationale for this change

DataFusion's hash join has no spill support, so the build side must fit in memory per task. Ballista executors run multiple tasks per host, so per-task build sides aggregate and OOM the executor under realistic load — for example, the integration test suite (./dev/integration-tests.sh) does not complete on a typical machine without disabling hash joins.

DataFusion sets datafusion.optimizer.prefer_hash_join = true, which is fine for a single-node engine but a poor fit for a distributed multi-task one. This PR flips the Ballista default to sort-merge join, which spills, and leaves users a session-level knob to opt back into hash join when they know the build side fits.

What changes are included in this PR?

Set datafusion.optimizer.prefer_hash_join = false inside ballista_restricted_configuration (ballista/core/src/extension.rs), alongside the other soft DataFusion-default overrides. This is a soft default — users can override per session.
Update the existing should_support_sort_merge_join integration test to assert the default picks SortMergeJoinExec via EXPLAIN, instead of relying on a result-row check that would pass under either join algorithm.
Add should_support_hash_join_when_opted_in to verify users can still get HashJoinExec after SET datafusion.optimizer.prefer_hash_join = true.
Document the new default and the opt-in path in docs/source/user-guide/tuning-guide.md under a new "Join Strategy" subsection.

Are there any user-facing changes?

Yes — behaviour change. Queries that previously planned a HashJoinExec will now plan a SortMergeJoinExec by default. The old behaviour is one SET away. Documented in the tuning guide. No public Rust API change, so no api change label.

Replace fragile positional `.column(1)` with schema-based `index_of("plan")` and use `col.iter().flatten()` instead of an index loop over non-null values. Drop the now-unneeded `Array` trait import.

DataFusion's hash join has no spill support, so each parallel task on an executor must hold the full build side in memory. Default Ballista sessions to sort-merge join (which spills) until DataFusion gains a spilling hash join. Users can opt back in with `SET datafusion.optimizer.prefer_hash_join = true`. See apache#1648

andygrove added 6 commits May 2, 2026 17:32

test(client): assert sort-merge join is the default via EXPLAIN

e42f121

test(client): look up EXPLAIN 'plan' column by name in helper

8fd2dfb

Replace fragile positional `.column(1)` with schema-based `index_of("plan")` and use `col.iter().flatten()` instead of an index loop over non-null values. Drop the now-unneeded `Array` trait import.

test(client): cover hash-join opt-in via prefer_hash_join=true

6dce482

test(client): assert SortMergeJoinExec is absent after hash-join opt-in

112eb9e

docs(tuning): document sort-merge join as the new default

3038b23

github-actions Bot added the documentation Improvements or additions to documentation label May 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: default to sort-merge join#1651

feat: default to sort-merge join#1651
andygrove wants to merge 6 commits intoapache:mainfrom
andygrove:feat/default-sort-merge-join

andygrove commented May 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

andygrove commented May 3, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant