Hoist Path-spec StageExample infrastructure (issue #387, Option B) by mmcdermott · Pull Request #390 · mmcdermott/MEDS_transforms

mmcdermott · 2026-04-22T21:20:45Z

Summary

Closes #387 (Option B). The upstream StageExample assumed outputs were either a MEDSDataset under data/*.parquet or a single metadata/codes.parquet — so two derived packages (meds-torch-data and MEDS-extract) independently reinvented the same ~200-LOC subclass: Path-based want_data / want_metadata plus a per-suffix comparator dispatcher.

This PR hoists that scaffolding into the base class so derived packages collapse to ~20 LOC.

Depends on

#389 — the render_content / want_metadata: Path bug-fix prerequisite. That PR is deliberately minimal (rendering dispatch only); this one adds the type widening + check_outputs dispatch + comparator registry on top.

Changes

Type widening

want_data: MEDSDataset | Path | None
want_metadata: pl.DataFrame | Path | None

A Path is interpreted as a yaml_to_disk spec describing the expected output tree.

`from_dir` graceful fallback

Try MEDSDataset.from_yaml(out_data_fp) / read_metadata_only(out_metadata_fp) first. On ValueError / TypeError, store the Path itself and let check_outputs dispatch to the file-spec path later. Mirrors the existing in_data pattern exactly.

New `check_outputs` dispatch

MEDSDataset / pl.DataFrame branches are byte-identical to before (no regression for existing stages).
Path branches go through the new _check_path_spec, which:
1. materializes the spec into a tempdir via yaml_to_disk.yaml_disk;
2. walks the expected tree, compares each file to its counterpart under the actual output dir via a per-suffix ComparatorFn;
3. asserts the actual tree has no unexpected files (unless tolerate_unexpected=True), after filtering out skip_dirs / skip_files.

Policy fields (dataclass fields, not ClassVars)

skip_dirs: frozenset[str] — defaults to {.logs, .hydra} (the usual Hydra / logging byproducts both downstream packages reinvented).
skip_files: frozenset[str] — empty default.
tolerate_unexpected: bool — defaults to strict.
suffix_comparators: dict[str, ComparatorFn] | None — per-instance override of the class default.

Comparator registry (no process-global state)

Class-level default: SUFFIX_COMPARATORS: ClassVar[dict[str, ComparatorFn]] = {".parquet": _compare_parquet}.
Subclasses register additional comparators via standard class-var override (class MTDStageExample(StageExample): SUFFIX_COMPARATORS = {..., ".nrt": _compare_nrt}).
Per-instance suffix_comparators overrides the class default for one-offs (e.g. relaxed tolerance on a specific example).
No @register_comparator decorator — explicitly avoided per the design-discussion comment.

Comparator signature

ComparatorFn = Callable[[Path, Path, CompareContext], None]

CompareContext is a frozen dataclass carrying rel: Path (for error messages) and tolerances: Mapping[str, Any] keyed by suffix. df_check_kwargs is plumbed through as ctx.tol_for('.parquet') — so adding a new format doesn't require signature changes.

XOR relaxation

The want_data XOR want_metadata rule still fires when both are MEDSDataset + pl.DataFrame (the original ambiguity). It's relaxed when either is a Path — extraction stages legitimately describe both data and metadata file trees simultaneously.

What this doesn't do

No changes to example_class — stages that need fully custom rendering (JsonOutputStageExample-style) keep their escape hatch.
No changes to render_content — Fix render_content blowup when want_metadata is a Path (refs #387) #389 handled the want_metadata: Path rendering bug on its own.
No downstream migrations — those land as PRs against the MTD and MEDS-extract repos once this merges.

Test plan

18 new tests in tests/test_filespec_stageexample.py:
- Path fallback in from_dir + existing MEDSDataset parsing unchanged
- check_outputs dispatch for Path spec (match / mismatch / missing-file / unknown-suffix)
- Policy fields (tolerate_unexpected, skip_dirs, skip_files)
- Class-level comparator registration
- Per-instance suffix_comparators wins over class default
- _compare_parquet tolerance bridging
- MEDSDataset legacy branch un-regressed
Full test suite: 177 passed, 2 pre-existing parallel-only failures unrelated.
All existing doctests still pass.

Refs #387

Two derived packages (meds-torch-data, MEDS-extract) independently reinvented the same StageExample subclass — Path-based want_data / want_metadata plus a per-suffix comparator dispatcher — because the upstream class assumed outputs were either a MEDSDataset or a single metadata/codes.parquet. This hoists the shared scaffolding into the base class so derived packages can collapse to ~20 LOC. Changes to `StageExample`: - `want_data: MEDSDataset | Path | None`, `want_metadata: pl.DataFrame | Path | None`. A Path is interpreted as a `yaml_to_disk` spec describing the expected output tree. - `from_dir` falls back to storing a Path when `out_data.yaml` / `out_metadata.yaml` don't parse as MEDS-shaped, mirroring the existing `in_data` graceful-degradation pattern. - `check_outputs` dispatches on type: MEDSDataset / DataFrame branches are unchanged (no regression); Path branches go through the new `_check_path_spec`, which materializes the spec via `yaml_disk` and compares each expected file against its actual counterpart via a per-suffix `ComparatorFn`. - New dataclass fields: `skip_dirs` (default `{.logs, .hydra}` — covers the usual Hydra / logging byproducts), `skip_files`, `tolerate_unexpected`, `suffix_comparators` (per-instance override). - Class-level default registry `SUFFIX_COMPARATORS: ClassVar[dict]` seeded with a `.parquet` comparator. Subclasses register additional comparators (e.g. `.nrt` for MTD, `.json` / `.yaml` for MEDS-extract) via standard class-var override — no process-global state. - `CompareContext` / `ComparatorFn` give comparators a minimal, extensible interface: `(expected_fp, actual_fp, ctx)` where `ctx` carries the relative path (for error messages) and a tolerances map keyed by suffix. `df_check_kwargs` flows through as `ctx.tol_for('.parquet')`, so comparator signatures don't grow when new formats want their own tolerance bundles. - XOR rule (`want_data` XOR `want_metadata`) still enforced when both are MEDSDataset/DataFrame, but relaxed when either is a Path — extraction stages legitimately describe both data and metadata file trees simultaneously. Tests: 18 new tests in `tests/test_filespec_stageexample.py` cover Path fallback in `from_dir`, `check_outputs` dispatch, missing-file / mismatch / unknown-suffix error paths, class-level and per-instance comparator precedence, tolerance bridging through CompareContext, and the un-regressed MEDSDataset branch. No changes to `example_class`; stages that need fully custom rendering (e.g. `JsonOutputStageExample`) still have that escape hatch. Refs #387 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

codecov · 2026-04-22T21:22:57Z

Codecov Report

❌ Patch coverage is 93.90244% with 5 lines in your changes missing coverage. Please review.
✅ Project coverage is 98.10%. Comparing base (ce04e87) to head (abb3eb6).
⚠️ Report is 22 commits behind head on dev.

Files with missing lines	Patch %	Lines
src/MEDS_transforms/stages/examples.py	93.90%	5 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##              dev     #390      +/-   ##
==========================================
- Coverage   98.23%   98.10%   -0.14%     
==========================================
  Files          54       55       +1     
  Lines        2607     2691      +84     
==========================================
+ Hits         2561     2640      +79     
- Misses         46       51       +5

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hoist Path-spec StageExample infrastructure (issue #387, Option B)#390

Hoist Path-spec StageExample infrastructure (issue #387, Option B)#390
mmcdermott wants to merge 1 commit intodevfrom
feat/filespec-stageexample

mmcdermott commented Apr 22, 2026

Uh oh!

codecov Bot commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mmcdermott commented Apr 22, 2026

Summary

Depends on

Changes

Type widening

from_dir graceful fallback

New check_outputs dispatch

Policy fields (dataclass fields, not ClassVars)

Comparator registry (no process-global state)

Comparator signature

XOR relaxation

What this doesn't do

Test plan

Uh oh!

codecov Bot commented Apr 22, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

`from_dir` graceful fallback

New `check_outputs` dispatch