Skip to content

feat(tesseract): Support separate pre-aggregations for different multi-fact subqueries#10718

Open
waralexrom wants to merge 41 commits intotesseract-multi-stage-pre-aggregations-full-supportfrom
tesseract-multi-fact-separate-pre-aggregations
Open

feat(tesseract): Support separate pre-aggregations for different multi-fact subqueries#10718
waralexrom wants to merge 41 commits intotesseract-multi-stage-pre-aggregations-full-supportfrom
tesseract-multi-fact-separate-pre-aggregations

Conversation

@waralexrom
Copy link
Copy Markdown
Member

Check List

  • Tests have been run in packages where changes have been made if available
  • Linter has been run for changed code
  • Tests for the changes have been added if not covered yet
  • Docs have been added / updated if required

@waralexrom waralexrom requested a review from a team as a code owner April 20, 2026 12:38
@github-actions github-actions Bot added rust Pull requests that update Rust code javascript Pull requests that update Javascript code labels Apr 20, 2026
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Apr 20, 2026

Claude finished @waralexrom's task in 2m 37s —— View job


Code Review: Separate Pre-aggregations for Multi-fact Subqueries

  • Gather context and understand PR changes
  • Review code quality and architecture
  • Review security considerations
  • Review performance implications
  • Review test coverage
  • Post comprehensive review

Summary

This PR refactors how multiplied (multi-fact) measures are planned and optimized in Tesseract. The key architectural change is unifying the CTE management for both multi-stage members and multiplied measures into a single CteState, eliminating the separate ResolveMultipliedMeasures logical node. This enables pre-aggregation matching to work independently on each multi-fact subquery (regular or multiplied) rather than requiring the entire multiplied-measures subtree to match or fail as a unit.

What changed:

  1. Deleted ResolveMultipliedMeasures (logical + physical) — multiplied measures are now planned as individual LogicalMultiStageMember CTEs
  2. New CteState — shared counter/accumulator for both MultiStageQueryPlanner and MultipliedMeasuresQueryPlanner
  3. MultiStageLeafMeasure.measure.measures (Vec) — supports multi-measure CTEs from regular measure subqueries
  4. pre_aggregation_override on AggregateMultipliedSubquery — allows the pre-agg optimizer to short-circuit individual multiplied subqueries
  5. Pre-agg optimizer refactoredtry_rewrite_query now handles simple rewrite + multi-stage fallback in one place; new try_rewrite_schema_and_filter for AggregateMultipliedSubquery matching
  6. New bucketing integration tests — 8 tests covering multi-stage dimensions with bucketing logic

The approach is sound and the refactoring materially simplifies the codebase (~200 net lines removed despite adding tests and new functionality).


Code Quality

Positive:

  • Good separation of concerns: CteState is a clean, focused abstraction
  • The ResolveMultipliedMeasures + ResolvedMultipliedMeasures enum removal significantly simplifies FullKeyAggregate and all three aggregate strategy processors
  • The pre-aggregation optimizer flows more linearly now

Issues:

  1. try_rewrite_schema_and_filter always uses TimeShiftState::default()rust/cubesqlplanner/cubesqlplanner/src/logical_plan/optimizers/pre_aggregation/optimizer.rs:139-143

    When matching a AggregateMultipliedSubquery against pre-aggregations, try_rewrite_schema_and_filter hardcodes TimeShiftState::default() for date range extraction. This is likely correct since multiplied measure subqueries don't have time shifts, but the same hardcoding also appears in the PlanNode::AggregateMultipliedSubquery arm at line 218. Consider whether the time shifts from the enclosing context should be threaded through here.

    Investigate this →

  2. root_filter used for all multiplied subquery matchingoptimizer.rs:189-191

    The comment says "multiplied-measure CTEs don't carry their own filter — logically they apply the same filter as the root query." This is a key semantic assumption. If a future change adds filter pushdown into multiplied measure subqueries, this would silently become wrong. The comment is helpful but consider adding a debug assertion or at minimum ensuring this invariant is documented at the AggregateMultipliedSubquery struct level.

  3. Duplicate code in try_rewrite_simple_query vs try_rewrite_schema_and_filteroptimizer.rs:117-131 vs optimizer.rs:139-170

    These two methods share nearly identical logic (schema/filter matching → make source → build Query). The only differences are: (a) try_rewrite_simple_query copies the existing query's modifiers while try_rewrite_schema_and_filter creates empty modifiers, and (b) the date range extraction. Consider extracting a shared helper.

    Fix this →

  4. Empty data_queries left behind in strategy files — The full_join_aggregate_strategy.rs:103, inner_join_aggregate_strategy.rs:29, and keys_aggregate_strategy.rs:33 files still declare let mut data_queries = vec![]; or let mut keys_queries = vec![]; immediately before the remaining loop. Not a bug, but now the code looks like it was hastily trimmed rather than cleanly restructured. No need to fix now, but worth noting.


Security

No security concerns. This is internal query planning logic with no user-facing input parsing, no SQL injection vectors (queries are built via typed builders), and no auth/authz changes.


Performance

  1. CTE naming is globally sequential now — With CteState shared across multi-stage and multiplied planners, CTE names are cte_0, cte_1, ... in order of planning, not grouped by type. This is fine functionally but means CTE numbering is now sensitive to planning order. If the order of multi_stage_query_planner.plan_queries() vs multiplied_measures_query_planner.plan_queries() ever changes, CTE names will shift, potentially breaking snapshot tests. This is acceptable but worth being aware of.

  2. No performance regression expected — The refactoring doesn't add extra passes or copies. The pre_aggregation_override field adds one Option<Rc<Query>> per AggregateMultipliedSubquery which is negligible.

  3. Pre-aggregation matching improvement — The new approach tries simple rewrite first, then falls back to multi-stage rewriting (line 95-100). This is more efficient than the old code which checked !plan.multistage_members().is_empty() upfront and went straight to the multi-stage path, potentially missing a simpler match.


Testing

Strengths:

  • 8 new integration tests covering bucketing scenarios: simple, multistage measures, complex dimensions, cross-cube references, joins, multi-dimension, and concatenated dimensions
  • Tests exercise both SQL generation (build_sql) and actual execution against Postgres (try_execute_pg)
  • Good seed data with diverse customer revenue patterns enabling meaningful assertions

Concerns:

  1. No test for the pre-aggregation optimization path — The new PlanNode::AggregateMultipliedSubquery arm in the optimizer (line 216-233) and try_rewrite_schema_and_filter are not directly tested. The bucketing tests validate the planning pipeline but don't exercise the pre-aggregation rewrite of multiplied subqueries. Consider adding a test that provides a pre-aggregation definition matching a multiplied measure subquery.

  2. No negative/error tests — All tests are happy-path. Consider adding a test where pre-aggregation matching partially fails (some subqueries match, some don't) to verify the rollback behavior (line 243-250) works correctly.

  3. cumulative measure rejection — The new guard in measure_matcher.rs:34-38 that rejects cumulative measures from pre-aggregation matching is important but not directly tested.


Minor Notes

  • Pre-existing typo: only_addictive should be only_additive in measure_matcher.rs — not introduced by this PR.
  • modifers typo in Query builder (line 125, 161) — also pre-existing, not from this PR.
  • The regex update in pre-aggregation-utils.ts to handle WITH in addition to SELECT is correct and necessary since multiplied measure CTEs may now produce WITH ... SELECT patterns.

Verdict

This is a well-designed architectural refactoring that simplifies the codebase while enabling more granular pre-aggregation matching. The main actionable feedback is around code duplication between the two rewrite methods and the missing test coverage for the pre-aggregation optimization of multiplied subqueries. The rest is solid.

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 20, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 78.76%. Comparing base (a2e8f6a) to head (8a7bcd9).

Additional details and impacted files
@@                                   Coverage Diff                                   @@
##           tesseract-multi-stage-pre-aggregations-full-support   #10718      +/-   ##
=======================================================================================
- Coverage                                                78.79%   78.76%   -0.03%     
=======================================================================================
  Files                                                      465      465              
  Lines                                                    91999    91925      -74     
  Branches                                                  3374     3353      -21     
=======================================================================================
- Hits                                                     72488    72408      -80     
- Misses                                                   19020    19028       +8     
+ Partials                                                   491      489       -2     
Flag Coverage Δ
cube-backend 57.80% <ø> (-0.20%) ⬇️
cubesql 83.40% <ø> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@waralexrom waralexrom force-pushed the tesseract-multi-stage-pre-aggregations-full-support branch from a884e18 to 64242fe Compare April 24, 2026 10:53
@waralexrom waralexrom requested review from a team as code owners April 24, 2026 10:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

javascript Pull requests that update Javascript code rust Pull requests that update Rust code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant