Split DistributedPhysicalOptimizerRule into multiple, more scoped ones. by gabotechs · Pull Request #357 · datafusion-contrib/datafusion-distributed

gabotechs · 2026-02-24T07:43:35Z

Closes #347

This PR ships two main things:

Split the `DistributedPhysicalOptimizerRule` into multiple, more scoped ones

Previously, all the planning in the project was done in one rule. Now, it's split in multiple more composable ones, so that users are free to insert their own rules in the middle:

StartDistributedContext
AddCoalesceOnTop
InsertBroadcast
InjectNetworkBoundaryPlaceholders
ApplyNetworkBoundaries
BatchCoalesceBelowBoundaries
EndDistributedContext

The big bulk of the work happens in:

InjectNetworkBoundaryPlaceholders which injects network boundary placeholders in the middle of the plan, that just contain information about the network boundary type and the number of input tasks.
ApplyNetworkBoundaries which reads the network boundary placeholders and inject the actual network boundaries.

Expose the `NetworkBoundaryPlaceholder` as part of the public API

In #347, we discussed about exposing the annotations system as part of the public API, and apache/datafusion#20396 upstream aims to provide such a public API in vanilla DataFusion.

While that's powerful, after some consideration, this project would have a hard time exposing a stable API if the full power of the annotations is exposed, so instead, another approach is attempted:

All the annotation system (AnnotatedPan struct, etc...) is made private, is not even exposed outside of the rule_4_inject_network_boundary_placeholders.rs file, so it's private even to this project itself.
The new API for customizing distributed plan is given in the form of NetworkBoundaryPlaceholders, which are planning-time ExecutionPlan implementations with a very narrow API:
- The NetworkBoundary kind to inject (coalesce, shuffle, or broadcast)
- The input task count of the stage below

The chain of optimizer rules use this same NetworkBoundaryPlaceholder API for injecting its network boundaries, which is what allows making AnnotatedPlan private even within the project, and at the same time, allows users to inject their own rules in between InjectNetworkBoundaryPlaceholders and ApplyNetworkBoundaries, or even replace InjectNetworkBoundaryPlaceholders all together with their own custom implementation.

This test demonstrates how to use the NetworkBoundaryPlaceholder API for customizing a distributed plan:
https://github.com/datafusion-contrib/datafusion-distributed/blob/2dd97539bf736e83c1697b763c5ba1c724a1e69d/tests/custom_network_boundary_placeholders.rs

gabrielkerr · 2026-02-24T21:57:24Z

Thank you for making this! Looking to review this over the next day.

…yPlaceholders

gabrielkerr

I like where this is headed! Looking forward to your thoughts on the feedback.

gabrielkerr · 2026-02-25T18:05:18Z

src/distributed_planner/network_boundary_placeholder.rs

+/// Note that there are restrictions around where can the [NetworkBoundaryPlaceholder]s be placed,
+/// for example:
+/// - A [NetworkBoundaryKind::Broadcast] needs to be placed right above a `BroadcastExec` node.
+/// - A [NetworkBoundaryKind::Shuffle] needs to be placed right about a `RepartitionExec` node.


Suggested change

/// - A [NetworkBoundaryKind::Shuffle] needs to be placed right about a `RepartitionExec` node.

/// - A [NetworkBoundaryKind::Shuffle] needs to be placed right above a `RepartitionExec` node.

gabrielkerr · 2026-02-25T18:05:40Z

src/distributed_planner/network_boundary_placeholder.rs

+/// where should the network boundaries be placed, and what task count should be used for the
+/// stage below.
+///
+/// Note that there are restrictions around where can the [NetworkBoundaryPlaceholder]s be placed,


Suggested change

/// Note that there are restrictions around where can the [NetworkBoundaryPlaceholder]s be placed,

/// Note that there are restrictions around where the [NetworkBoundaryPlaceholder]s can be placed,

gabrielkerr · 2026-02-25T18:10:34Z

src/distributed_planner/network_boundary_placeholder.rs

+    pub kind: NetworkBoundaryKind,
+    /// The task count for the input stage of this network boundary.
+    ///
+    /// Note that the task count for this network boundary is decided by the other network boundary


If this placeholder sits between Stage N (below) and Stage N+1 (above):

Field Refers to

input The ExecutionPlan for Stage N

input_task_count Task count for Stage N

"task count for this network boundary" Task count for Stage N+1 (output side)

"boundary immediately above" The boundary between Stage N+1 and N+2

Could we clarify the comments with explicit stage references?

input_task_count: The number of tasks in Stage N (the stage below this boundary that executes input)

The task count for Stage N+1 (this boundary's output/consumer stage) is determined by the input_task_count of the
NetworkBoundaryPlaceholder above this one, if any.

In this case input_task_count is the task count for Stage N+1, not Stage N. Network boundaries have no say in the amount of tasks they run on, they just control the amount of tasks below them.

gabrielkerr · 2026-02-25T18:34:23Z

tests/custom_network_boundary_placeholders.rs

+                    {
+                        return Ok(Transformed::yes(Arc::new(NetworkBoundaryPlaceholder {
+                            kind: NetworkBoundaryKind::Coalesce,
+                            input_task_count: nb.input_task_count.div_ceil(2),


Related to my comment on NetworkBoundaryPlaceholder above, task_count/2 is the number of tasks the Arc::new(NetworkBoundaryPlaceholder {...} is expecting to receive from the (now) child coalesce boundary? If so, maybe it makes more sense to rename input_task_count to output_task_count?

Please correct me I misunderstand 😄

🤔 I think it really is input_task_count the name we are looking for. It's the amount of tasks that will input data to the network boundary.

Maybe with the comment above is more clear?

gabrielkerr · 2026-02-25T18:52:27Z

tests/custom_network_boundary_placeholders.rs

+    use std::sync::Arc;
+
+    #[tokio::test]
+    async fn custom_network_boundary_placeholder() -> Result<(), Box<dyn Error>> {


This is very helpful, thank you for adding this!

gabrielkerr · 2026-02-25T19:07:05Z

src/distributed_planner/session_state_builder_ext.rs

+        self
+    }
+
+    fn set_distributed_physical_optimizer_rules(&mut self) {


It would be nice to have a method for inserting custom rules similarly to how you've done insert_before in the test. Could we have something like below? I imagine that it would automatically add rules between rules 4 and 5 since that is the intended order described in your comments.

fn with_custom_network_boundary_rules( self, rules: impl IntoIterator<Item = Arc<dyn PhysicalOptimizerRule + Send + Sync>> ) -> Self;

Used like

let builder = SessionStateBuilder::new() .with_distributed_physical_optimizer_rules() .with_custom_network_boundary_rules([Arc::new(TreeReductionRule::new(type_checker))]);

Thoughts?

Yeah, I think this might be worth it

gabrielkerr · 2026-02-25T19:13:54Z

tests/custom_network_boundary_placeholders.rs

+                plan: Arc<dyn ExecutionPlan>,
+                _: &ConfigOptions,
+            ) -> Result<Arc<dyn ExecutionPlan>> {
+                plan.transform_up(|plan| {


nit: If this is in the context of custom distributed rules, should this call Distributed::ensure? Even a comment describing that it should be called here might be enough to show the intended pattern if that is difficult to do in this test.

🤔 lets discuss first the approach and see if it makes sense to have a DistributedContext at all.

gabrielkerr · 2026-02-25T20:26:59Z

src/distributed_planner/rules/rule_5_apply_network_boundaries.rs

+/// free to place their own, either by having more rules in between, or straight await replacing the
+/// `InjectNetworkBoundaryPlaceholders` with a custom one.
+#[derive(Debug)]
+pub struct ApplyNetworkBoundaries;


Checking my understanding here:
Before rule_5, the user can insert custom PhysicalOptimizerRules. These custom rules can insert NetworkBoundaryPlaceholders and any ExecutionPlan nodes it likes. Then, rule_5 will convert the network boundary placeholders to the appropriate network boundary exec and leave all other ExecutionPlan nodes untouched. Do I have that right?

Do I have that right?

Mostly yes, but converting to the appropriate network boundary exec does perform further modifications to the plans.

For example, NetworkShuffleExec will scale up the output partitions of its RepartitionExec below.

gabrielkerr · 2026-02-25T22:55:26Z

src/distributed_planner/rules/mod.rs

+mod rule_6_batch_coalesce_below_boundaries;
+mod rule_7_end_distributed_context;
+
+pub use rule_1_start_distributed_context::StartDistributedContext;


I like that you've split up the rule into smaller rules with clear responsibilities. However, I feel like splitting this up into 7 different rules has introduced a lot of complexity when there is really a single extension point between rules 4 and 5. I think swinging back towards a single optimizer rule makes more sense. I'll try to explain my thoughts below.

The current interface places the responsibility on the user to add custom physical optimizer rules between rules 4 and 5. Users should only worry about creating their custom rule, with the library inserting it in the correct position between rules 4 and 5.

I think we can keep the seven rules, but hide them (make them private to the crate) from the user. I think a cleaner interface would be to wrap these 7 rules in a single DistributedPhysicalOptimizerRule and add a single method for adding custom rules. This abstracts away the non-customizable details from the user and delegates inserting the custom rules in the correct place to the library.

Maybe something like this:

pub struct DistributedPhysicalOptimizerRule { custom_boundary_rules: Vec<Arc<dyn PhysicalOptimizerRule + Send + Sync>>, } impl DistributedPhysicalOptimizerRule { pub fn new() -> Self { Self { custom_boundary_rules: vec![] } } pub fn with_custom_boundary_rule( mut self, rule: impl PhysicalOptimizerRule + Send + Sync + 'static ) -> Self { self.custom_boundary_rules.push(Arc::new(rule)); self } }

The internal optimize() would then:

Wrap in DistributedContext (current rule 1)

Add coalesce on top (current rule 2)

Insert broadcast (current rule 3)

Run each custom boundary rule in order

Apply network boundaries (current rule 5)

Batch coalesce below boundaries (current rule 6)

Unwrap DistributedContext (current rule 7)

You'll notice I've replaced rule 4 with the custom rules. The current rule 4 adds some placeholders that the user may not want, introducing complexity for removing these boundaries if the user wants to completely customize the network boundary placeholders from the start. In this case I think rule 4 is a reasonable default, but should be able to be replaced (or added to I suppose) by the users custom rules.

However, I feel like splitting this up into 7 different rules has introduced a lot of complexity when there is really a single extension point between rules 4 and 5

Yes... realistically speaking, I'd only expect users to either replace rule 4 with their own, or inject an additional one between 4 and 5.

I think we can keep the seven rules, but hide them (make them private to the crate) from the user. I think a cleaner interface would be to wrap these 7 rules in a single DistributedPhysicalOptimizerRule and add a single method for adding custom rules

👍 this sounds reasonable. It would significantly simplify the code, as we wouldn't need to do that DistributedContext workaround for propagating the original single-node plan.

kurtvolmar

Thank you for tackling this. A concern I have is that a custom PhysicalOptimizerRule using the NetworkBoundaryPlaceholder paradigm will likely collide with the InjectNetworkBoundaryPlaceholders. In the way this is set up now, we would need to at plan-time decide whether to use the custom rule or the default InjectNetworkBoundaryPlaceholders (this would probably be yet another PhysicalOptimizerRule we need to implement). I would love to explore either further extensibility into rule 4 or some checks to whether a previous rule has inserted NetworkBoundaryPlaceholders around a node before adding more.

kurtvolmar · 2026-02-25T19:15:53Z

src/distributed_planner/network_boundary_placeholder.rs

+    /// The input [ExecutionPlan] that will run remotely on the stage below.
+    pub input: Arc<dyn ExecutionPlan>,


If I understand correctly, will this allow a custom PhysicalOptimizerRule using the NetworkBoundaryPlaceholder to rewrite its children?

For our extensibility use case are looking to replace the shuffle with a hierarchical aggregation, which will require us to rewrite the children below the placeholder. I'm not sure if this is what you're intending here, but we will likely use this field to this.

If I understand correctly, will this allow a custom PhysicalOptimizerRule using the NetworkBoundaryPlaceholder to rewrite its children?

The scope of the NetworkBoundaryPlaceholder is just to inject network boundaries in the plan so that the ApplyNetworkBoundaries rule can then transform them to the equivalent Network*Exec, but nothing else.

Taking execution plans unrelated to Distributed DataFusion, like the classical Aggregate(partial) + RepartitionExec + Aggregate(final) and replacing them with a hierarchical pattern sounds like something that is better done in a previous rule, and at that point it should be a matter of just using the existing DataFusion tools for rewriting plans. I don't think this project has a say about that.

kurtvolmar · 2026-02-25T19:29:00Z

src/distributed_planner/rules/rule_4_inject_network_boundary_placeholders.rs

+#[derive(Debug)]
+pub struct InjectNetworkBoundaryPlaceholders;
+
+impl PhysicalOptimizerRule for InjectNetworkBoundaryPlaceholders {


With this model, if I create a PhysicalOptimizerRule with a NetworkBoundaryPlaceholder targeting a hash RepartitionExec and insert it before InjectNetworkBoundaryPlaceholders, this will cause a "collision" on on the custom NetworkBoundaryPlaceholder and the the Shuffle logic in this rule. Both rules will be targeting the same node and attempting to inject NetworkBoundaryPlaceholders.

I'd be interested in seeing some detection whether a NetworkBoundaryPlaceholder already exists for a node, so that custom rules can play nicely with InjectNetworkBoundaryPlaceholders.

Yeap, this would be a nice addition, although i'd argue that if you have a custom rule that is injecting its own shuffles in the plan... you probably don't want the default InjectNetworkBoundaryPlaceholders to kick it at all.

A valid stance that comes to mind is that it's the users responsibility to ensure the plan is correct once it reaches ApplyNetworkBoundaries

gabotechs · 2026-03-01T09:02:12Z

Thanks for the feedback guys! I'll get back to this one shortly.

gabotechs · 2026-03-03T11:02:06Z

Bringing back the discussion to #347

gene-bordegaray · 2026-03-09T10:32:01Z

src/distributed_planner/rules/rule_4_inject_network_boundary_placeholders.rs

-    pub(super) children: Vec<AnnotatedPlan>,
+    children: Vec<AnnotatedPlan>,

    // annotation fields


leftover comment I believe

gabotechs mentioned this pull request Feb 24, 2026

Need for Customizable Plan Annotation and Network Boundary Logic in DFD #347

Open

gabotechs added 4 commits February 25, 2026 07:31

Split in multiple rules

af34a96

Add a mechanism for allowing users to inject their own NetworkBoundar…

0aa5960

…yPlaceholders

Add custom_network_boundary_placeholders.rs test

f1fbe4c

Update docs

4512cb1

gabotechs force-pushed the gabrielmusat/multiple-rule-split branch from 070287d to 4512cb1 Compare February 25, 2026 06:31

gabrielkerr suggested changes Feb 25, 2026

View reviewed changes

kurtvolmar reviewed Feb 25, 2026

View reviewed changes

gene-bordegaray reviewed Mar 9, 2026

View reviewed changes

	/// - A [NetworkBoundaryKind::Shuffle] needs to be placed right about a `RepartitionExec` node.
	/// - A [NetworkBoundaryKind::Shuffle] needs to be placed right above a `RepartitionExec` node.

	/// Note that there are restrictions around where can the [NetworkBoundaryPlaceholder]s be placed,
	/// Note that there are restrictions around where the [NetworkBoundaryPlaceholder]s can be placed,

Field	Refers to
input	The ExecutionPlan for Stage N
input_task_count	Task count for Stage N
"task count for this network boundary"	Task count for Stage N+1 (output side)
"boundary immediately above"	The boundary between Stage N+1 and N+2

		/// The input [ExecutionPlan] that will run remotely on the stage below.
		pub input: Arc<dyn ExecutionPlan>,

Conversation

gabotechs commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Split the DistributedPhysicalOptimizerRule into multiple, more scoped ones

Expose the NetworkBoundaryPlaceholder as part of the public API

Uh oh!

gabrielkerr commented Feb 24, 2026

Uh oh!

gabrielkerr left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kurtvolmar left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gabotechs Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gabotechs commented Mar 1, 2026

Uh oh!

gabotechs commented Mar 3, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

gabotechs commented Feb 24, 2026 •

edited

Loading

Split the `DistributedPhysicalOptimizerRule` into multiple, more scoped ones

Expose the `NetworkBoundaryPlaceholder` as part of the public API

gabotechs Mar 3, 2026 •

edited

Loading