Skip to content

Hoist Path-spec StageExample infrastructure (issue #387, Option B)#390

Open
mmcdermott wants to merge 1 commit intodevfrom
feat/filespec-stageexample
Open

Hoist Path-spec StageExample infrastructure (issue #387, Option B)#390
mmcdermott wants to merge 1 commit intodevfrom
feat/filespec-stageexample

Conversation

@mmcdermott
Copy link
Copy Markdown
Owner

Summary

Closes #387 (Option B). The upstream StageExample assumed outputs were either a MEDSDataset under data/*.parquet or a single metadata/codes.parquet — so two derived packages (meds-torch-data and MEDS-extract) independently reinvented the same ~200-LOC subclass: Path-based want_data / want_metadata plus a per-suffix comparator dispatcher.

This PR hoists that scaffolding into the base class so derived packages collapse to ~20 LOC.

Depends on

#389 — the render_content / want_metadata: Path bug-fix prerequisite. That PR is deliberately minimal (rendering dispatch only); this one adds the type widening + check_outputs dispatch + comparator registry on top.

Changes

Type widening

  • want_data: MEDSDataset | Path | None
  • want_metadata: pl.DataFrame | Path | None

A Path is interpreted as a yaml_to_disk spec describing the expected output tree.

from_dir graceful fallback

  • Try MEDSDataset.from_yaml(out_data_fp) / read_metadata_only(out_metadata_fp) first. On ValueError / TypeError, store the Path itself and let check_outputs dispatch to the file-spec path later. Mirrors the existing in_data pattern exactly.

New check_outputs dispatch

  • MEDSDataset / pl.DataFrame branches are byte-identical to before (no regression for existing stages).
  • Path branches go through the new _check_path_spec, which:
    1. materializes the spec into a tempdir via yaml_to_disk.yaml_disk;
    2. walks the expected tree, compares each file to its counterpart under the actual output dir via a per-suffix ComparatorFn;
    3. asserts the actual tree has no unexpected files (unless tolerate_unexpected=True), after filtering out skip_dirs / skip_files.

Policy fields (dataclass fields, not ClassVars)

  • skip_dirs: frozenset[str] — defaults to {.logs, .hydra} (the usual Hydra / logging byproducts both downstream packages reinvented).
  • skip_files: frozenset[str] — empty default.
  • tolerate_unexpected: bool — defaults to strict.
  • suffix_comparators: dict[str, ComparatorFn] | None — per-instance override of the class default.

Comparator registry (no process-global state)

  • Class-level default: SUFFIX_COMPARATORS: ClassVar[dict[str, ComparatorFn]] = {".parquet": _compare_parquet}.
  • Subclasses register additional comparators via standard class-var override (class MTDStageExample(StageExample): SUFFIX_COMPARATORS = {..., ".nrt": _compare_nrt}).
  • Per-instance suffix_comparators overrides the class default for one-offs (e.g. relaxed tolerance on a specific example).
  • No @register_comparator decorator — explicitly avoided per the design-discussion comment.

Comparator signature

ComparatorFn = Callable[[Path, Path, CompareContext], None]

CompareContext is a frozen dataclass carrying rel: Path (for error messages) and tolerances: Mapping[str, Any] keyed by suffix. df_check_kwargs is plumbed through as ctx.tol_for('.parquet') — so adding a new format doesn't require signature changes.

XOR relaxation

The want_data XOR want_metadata rule still fires when both are MEDSDataset + pl.DataFrame (the original ambiguity). It's relaxed when either is a Path — extraction stages legitimately describe both data and metadata file trees simultaneously.

What this doesn't do

  • No changes to example_class — stages that need fully custom rendering (JsonOutputStageExample-style) keep their escape hatch.
  • No changes to render_contentFix render_content blowup when want_metadata is a Path (refs #387) #389 handled the want_metadata: Path rendering bug on its own.
  • No downstream migrations — those land as PRs against the MTD and MEDS-extract repos once this merges.

Test plan

  • 18 new tests in tests/test_filespec_stageexample.py:
    • Path fallback in from_dir + existing MEDSDataset parsing unchanged
    • check_outputs dispatch for Path spec (match / mismatch / missing-file / unknown-suffix)
    • Policy fields (tolerate_unexpected, skip_dirs, skip_files)
    • Class-level comparator registration
    • Per-instance suffix_comparators wins over class default
    • _compare_parquet tolerance bridging
    • MEDSDataset legacy branch un-regressed
  • Full test suite: 177 passed, 2 pre-existing parallel-only failures unrelated.
  • All existing doctests still pass.

Refs #387

Two derived packages (meds-torch-data, MEDS-extract) independently
reinvented the same StageExample subclass — Path-based want_data /
want_metadata plus a per-suffix comparator dispatcher — because the
upstream class assumed outputs were either a MEDSDataset or a single
metadata/codes.parquet. This hoists the shared scaffolding into the
base class so derived packages can collapse to ~20 LOC.

Changes to `StageExample`:

- `want_data: MEDSDataset | Path | None`,
  `want_metadata: pl.DataFrame | Path | None`. A Path is interpreted as
  a `yaml_to_disk` spec describing the expected output tree.
- `from_dir` falls back to storing a Path when `out_data.yaml` /
  `out_metadata.yaml` don't parse as MEDS-shaped, mirroring the existing
  `in_data` graceful-degradation pattern.
- `check_outputs` dispatches on type: MEDSDataset / DataFrame branches
  are unchanged (no regression); Path branches go through the new
  `_check_path_spec`, which materializes the spec via `yaml_disk` and
  compares each expected file against its actual counterpart via a
  per-suffix `ComparatorFn`.
- New dataclass fields: `skip_dirs` (default `{.logs, .hydra}` —
  covers the usual Hydra / logging byproducts), `skip_files`,
  `tolerate_unexpected`, `suffix_comparators` (per-instance override).
- Class-level default registry `SUFFIX_COMPARATORS: ClassVar[dict]`
  seeded with a `.parquet` comparator. Subclasses register additional
  comparators (e.g. `.nrt` for MTD, `.json` / `.yaml` for MEDS-extract)
  via standard class-var override — no process-global state.
- `CompareContext` / `ComparatorFn` give comparators a minimal,
  extensible interface: `(expected_fp, actual_fp, ctx)` where `ctx`
  carries the relative path (for error messages) and a tolerances map
  keyed by suffix. `df_check_kwargs` flows through as
  `ctx.tol_for('.parquet')`, so comparator signatures don't grow when
  new formats want their own tolerance bundles.
- XOR rule (`want_data` XOR `want_metadata`) still enforced when both
  are MEDSDataset/DataFrame, but relaxed when either is a Path —
  extraction stages legitimately describe both data and metadata file
  trees simultaneously.

Tests: 18 new tests in `tests/test_filespec_stageexample.py` cover
Path fallback in `from_dir`, `check_outputs` dispatch, missing-file /
mismatch / unknown-suffix error paths, class-level and per-instance
comparator precedence, tolerance bridging through CompareContext, and
the un-regressed MEDSDataset branch.

No changes to `example_class`; stages that need fully custom rendering
(e.g. `JsonOutputStageExample`) still have that escape hatch.

Refs #387

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 22, 2026

Codecov Report

❌ Patch coverage is 93.90244% with 5 lines in your changes missing coverage. Please review.
✅ Project coverage is 98.10%. Comparing base (ce04e87) to head (abb3eb6).
⚠️ Report is 22 commits behind head on dev.

Files with missing lines Patch % Lines
src/MEDS_transforms/stages/examples.py 93.90% 5 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##              dev     #390      +/-   ##
==========================================
- Coverage   98.23%   98.10%   -0.14%     
==========================================
  Files          54       55       +1     
  Lines        2607     2691      +84     
==========================================
+ Hits         2561     2640      +79     
- Misses         46       51       +5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant