feat(scoped-sd): scoped summary stats (raw + filtered) in merged_sd by paddymul · Pull Request #778 · buckaroo-data/buckaroo

paddymul · 2026-05-20T19:09:40Z

Summary

V1 of the scoped summary stats described in docs/plans/async-stats.md (forthcoming): merged_sd now carries both the pre-filter baseline (bare keys) and, when a search filter is active, the filtered view (filtered_* keys), so the frontend can render both side-by-side via the ? optional-pinned-row mechanism in #777.

New trait summary_sd_raw holds the bare-key sd. With no filter it aliases summary_sd (no extra compute). With a filter active it re-runs autocleaning with empty quick_command_args to materialize the unfiltered df, then computes stats on that.
_merged_sd wires summary_sd_raw into the bare-key layer and layers filtered_* keys from summary_sd on top when the filter is on.

Breaking change

merged_sd[col]["mean"] (and the other bare stat names) now refer to the pre-filter dataset. Callers needing the post-filter values should read filtered_mean etc. when a filter is active. No in-repo callers needed updating — existing tests construct dataflows without filters, where pre- and post-filter values coincide.

What's out of V1 (separate follow-ups)

cleaned_* scope (third scope, only meaningful when cleaning_method != ""). The current shape proves out raw + filtered first; cleaned slots in via a third sd on the cleaned-but-unfiltered df.
Histogram bin-edge sharing across scopes; categorical↔numerical column-type handling. Tracked in the plan doc.
Async (asyncio.to_thread) compute and two-message WS protocol. V1 is sync.

Depends on

feat(pinned-rows): ?key prefix marks an optional pinned row #777 (? prefix on PinnedRowConfig) for the frontend half. Either can land first; combined, the feature lights up.

Test plan

3 new failing-first tests in tests/unit/dataflow/scoped_summary_stats_test.py (commit cd2ee5b) — no-cleaning-no-filter baseline, filter activates filtered_* keys, bare keys reflect raw (pre-filter) dataset.
All 939 Python tests pass locally after the fix commit.
CI: verify the failing-tests commit fails and the fix commit passes.

🤖 Generated with Claude Code

… merged_sd Three integration tests covering the dataflow-level shape of the new scoped summary stats: - no cleaning, no filter → only bare-key raw scope (baseline; passes today) - filter active → bare-key raw + `filtered_*` scope - bare `length` reflects the raw (pre-filter) dataset, not the post-everything view — deliberate breaking change to today's bare-name semantics The second and third currently fail because today's dataflow runs the stats pipeline against the post-everything `processed_df` and overwrites the unfiltered baseline on each state change. See docs/plans/async-stats.md (forthcoming) for the design. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… is active Introduces the V1 scoped summary stats: when a search filter is active, ``merged_sd`` carries both the unfiltered baseline (bare keys) and the filtered view (``filtered_*`` keys), so a frontend can render both side by side via the ``?`` optional-pinned-row mechanism shipping in #777. Mechanism: - ``summary_sd_raw`` holds the bare-key (pre-filter) sd. When no filter is active it's an alias for ``summary_sd`` (no extra compute). When a filter is active it re-runs autocleaning with empty ``quick_command_args`` to materialize the unfiltered cleaned df and computes stats on that. - ``_merged_sd`` now wires ``summary_sd_raw`` into the bare-key layer (replacing today's ``summary_sd``) and layers ``filtered_*`` keys on top from ``summary_sd`` when ``quick_command_args`` is non-empty. Deliberate breaking change: ``merged_sd[col]["mean"]`` (and other bare stat names) now refer to the pre-filter dataset, not the post-everything view. Callers that need the post-filter values should read ``filtered_mean`` etc. when a filter is active. Documented in docs/plans/async-stats.md; no in-repo callers needed updating because existing tests construct dataflows without filters (where pre- and post-filter values coincide). Cleaned-scope (``cleaned_*`` keys) is deferred to a follow-up — the current shape proves out raw + filtered first, and the same pattern extends to cleaned via a third sd computed on the cleaned-but- unfiltered df. Full Python suite passes (939 tests). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Tripped ruff F401 on the pre-push full-tree check. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-20T19:12:54Z

📦 TestPyPI package published

pip install --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo==0.14.3.dev26186264748

or with uv:

uv pip install --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo==0.14.3.dev26186264748

MCP server for Claude Code

claude mcp add buckaroo-table -- uvx --from "buckaroo[mcp]==0.14.3.dev26186264748" --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo-table

📖 Docs preview

🎨 Storybook preview

… (one call per state change) The prior shape of this PR added a `_raw_summary_sd` observer that called `handle_ops_and_clean` a second time with empty `quick_command_args` to materialize the unfiltered df. Combined with #780's "one call per state change" invariant, this caused `handle_ops_and_clean` to fire 2–3× per filter change (once from `_operation_result` plus 1–2× from the observer cascade — the observer watched both `summary_sd` and `quick_command_args`). This commit folds the unfiltered-df materialization into `PandasAutocleaning.handle_ops_and_clean` itself. The method now runs the interpreter once with the full ops and once with cleaning-only ops (when they differ), returning a 5-tuple: [cleaned_df, cleaning_sd, generated_code, final_ops, cleaned_df_unfiltered] When no quick ops are present, `cleaned_df_unfiltered is cleaned_df` (no second interpreter run, preserves the no-op short-circuit identity invariant — see feedback_short_circuit_identity). Downstream: - `cleaned_df_unfiltered` exposed as a property on the dataflow. - `_raw_summary_sd` observer removed; raw-scope sd is now computed inside `_summary_sd`, reading from `cleaned[4]`. Post-processing is applied to the unfiltered cleaned df so the raw scope reflects the visible view, not just the cleaning output. - Test unpacks of `handle_ops_and_clean`'s return updated to swallow the new element with `*_` (forward-compatible across any future trailing additions). Net behaviour: `handle_ops_and_clean` fires exactly once per state change, restoring #780's invariant. The xorq cost benefit is real — filter changes now pay one round-trip instead of two. Full Python suite: 939 passed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

paddymul · 2026-05-20T22:21:25Z

Superseded by #785, which uses #783's keyed SD cache as the substrate instead of the parallel 5-tuple handle_ops_and_clean shape this PR took. Closing.

paddymul and others added 3 commits May 20, 2026 14:35

chore(scoped-sd): drop unused pytest import in scoped_summary_stats_test

7f84b51

Tripped ruff F401 on the pre-push full-tree check. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

paddymul mentioned this pull request May 20, 2026

docs(plans): consolidated scoped-summary-stats V1 plan + V2 async outline #779

Open

2 tasks

paddymul temporarily deployed to testpypi May 20, 2026 19:11 — with GitHub Actions Inactive

paddymul temporarily deployed to testpypi May 20, 2026 19:53 — with GitHub Actions Inactive

This was referenced May 20, 2026

feat(sd-cache): keyed summary-stats cache (raw / clean / filt scopes) #783

Merged

feat(scoped-sd): merge raw + filt scope SDs into prefixed merged_sd #785

Open

paddymul closed this May 20, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(scoped-sd): scoped summary stats (raw + filtered) in merged_sd#778

feat(scoped-sd): scoped summary stats (raw + filtered) in merged_sd#778
paddymul wants to merge 4 commits into
mainfrom
feat/scoped-summary-stats

paddymul commented May 20, 2026

Uh oh!

github-actions Bot commented May 20, 2026 •

edited

Loading

Uh oh!

paddymul commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

paddymul commented May 20, 2026

Summary

Breaking change

What's out of V1 (separate follow-ups)

Depends on

Test plan

Uh oh!

github-actions Bot commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📦 TestPyPI package published

MCP server for Claude Code

📖 Docs preview

🎨 Storybook preview

Uh oh!

paddymul commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

github-actions Bot commented May 20, 2026 •

edited

Loading