feat(scoped-sd): scoped summary stats (raw + filtered) in merged_sd#778
Closed
paddymul wants to merge 4 commits into
Closed
feat(scoped-sd): scoped summary stats (raw + filtered) in merged_sd#778paddymul wants to merge 4 commits into
paddymul wants to merge 4 commits into
Conversation
… merged_sd Three integration tests covering the dataflow-level shape of the new scoped summary stats: - no cleaning, no filter → only bare-key raw scope (baseline; passes today) - filter active → bare-key raw + `filtered_*` scope - bare `length` reflects the raw (pre-filter) dataset, not the post-everything view — deliberate breaking change to today's bare-name semantics The second and third currently fail because today's dataflow runs the stats pipeline against the post-everything `processed_df` and overwrites the unfiltered baseline on each state change. See docs/plans/async-stats.md (forthcoming) for the design. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… is active Introduces the V1 scoped summary stats: when a search filter is active, ``merged_sd`` carries both the unfiltered baseline (bare keys) and the filtered view (``filtered_*`` keys), so a frontend can render both side by side via the ``?`` optional-pinned-row mechanism shipping in #777. Mechanism: - ``summary_sd_raw`` holds the bare-key (pre-filter) sd. When no filter is active it's an alias for ``summary_sd`` (no extra compute). When a filter is active it re-runs autocleaning with empty ``quick_command_args`` to materialize the unfiltered cleaned df and computes stats on that. - ``_merged_sd`` now wires ``summary_sd_raw`` into the bare-key layer (replacing today's ``summary_sd``) and layers ``filtered_*`` keys on top from ``summary_sd`` when ``quick_command_args`` is non-empty. Deliberate breaking change: ``merged_sd[col]["mean"]`` (and other bare stat names) now refer to the pre-filter dataset, not the post-everything view. Callers that need the post-filter values should read ``filtered_mean`` etc. when a filter is active. Documented in docs/plans/async-stats.md; no in-repo callers needed updating because existing tests construct dataflows without filters (where pre- and post-filter values coincide). Cleaned-scope (``cleaned_*`` keys) is deferred to a follow-up — the current shape proves out raw + filtered first, and the same pattern extends to cleaned via a third sd computed on the cleaned-but- unfiltered df. Full Python suite passes (939 tests). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Tripped ruff F401 on the pre-push full-tree check. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2 tasks
Contributor
📦 TestPyPI package publishedpip install --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo==0.14.3.dev26186264748or with uv: uv pip install --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo==0.14.3.dev26186264748MCP server for Claude Codeclaude mcp add buckaroo-table -- uvx --from "buckaroo[mcp]==0.14.3.dev26186264748" --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo-table📖 Docs preview🎨 Storybook preview |
… (one call per state change) The prior shape of this PR added a `_raw_summary_sd` observer that called `handle_ops_and_clean` a second time with empty `quick_command_args` to materialize the unfiltered df. Combined with #780's "one call per state change" invariant, this caused `handle_ops_and_clean` to fire 2–3× per filter change (once from `_operation_result` plus 1–2× from the observer cascade — the observer watched both `summary_sd` and `quick_command_args`). This commit folds the unfiltered-df materialization into `PandasAutocleaning.handle_ops_and_clean` itself. The method now runs the interpreter once with the full ops and once with cleaning-only ops (when they differ), returning a 5-tuple: [cleaned_df, cleaning_sd, generated_code, final_ops, cleaned_df_unfiltered] When no quick ops are present, `cleaned_df_unfiltered is cleaned_df` (no second interpreter run, preserves the no-op short-circuit identity invariant — see feedback_short_circuit_identity). Downstream: - `cleaned_df_unfiltered` exposed as a property on the dataflow. - `_raw_summary_sd` observer removed; raw-scope sd is now computed inside `_summary_sd`, reading from `cleaned[4]`. Post-processing is applied to the unfiltered cleaned df so the raw scope reflects the visible view, not just the cleaning output. - Test unpacks of `handle_ops_and_clean`'s return updated to swallow the new element with `*_` (forward-compatible across any future trailing additions). Net behaviour: `handle_ops_and_clean` fires exactly once per state change, restoring #780's invariant. The xorq cost benefit is real — filter changes now pay one round-trip instead of two. Full Python suite: 939 passed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced May 20, 2026
Collaborator
Author
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
V1 of the scoped summary stats described in
docs/plans/async-stats.md(forthcoming):merged_sdnow carries both the pre-filter baseline (bare keys) and, when a search filter is active, the filtered view (filtered_*keys), so the frontend can render both side-by-side via the?optional-pinned-row mechanism in #777.summary_sd_rawholds the bare-key sd. With no filter it aliasessummary_sd(no extra compute). With a filter active it re-runs autocleaning with emptyquick_command_argsto materialize the unfiltered df, then computes stats on that._merged_sdwiressummary_sd_rawinto the bare-key layer and layersfiltered_*keys fromsummary_sdon top when the filter is on.Breaking change
merged_sd[col]["mean"](and the other bare stat names) now refer to the pre-filter dataset. Callers needing the post-filter values should readfiltered_meanetc. when a filter is active. No in-repo callers needed updating — existing tests construct dataflows without filters, where pre- and post-filter values coincide.What's out of V1 (separate follow-ups)
cleaned_*scope (third scope, only meaningful whencleaning_method != ""). The current shape proves out raw + filtered first;cleanedslots in via a third sd on the cleaned-but-unfiltered df.asyncio.to_thread) compute and two-message WS protocol. V1 is sync.Depends on
?keyprefix marks an optional pinned row #777 (?prefix onPinnedRowConfig) for the frontend half. Either can land first; combined, the feature lights up.Test plan
tests/unit/dataflow/scoped_summary_stats_test.py(commit cd2ee5b) — no-cleaning-no-filter baseline, filter activatesfiltered_*keys, bare keys reflect raw (pre-filter) dataset.🤖 Generated with Claude Code