Hoist Path-spec StageExample infrastructure (issue #387, Option B)#390
Open
mmcdermott wants to merge 1 commit intodevfrom
Open
Hoist Path-spec StageExample infrastructure (issue #387, Option B)#390mmcdermott wants to merge 1 commit intodevfrom
mmcdermott wants to merge 1 commit intodevfrom
Conversation
Two derived packages (meds-torch-data, MEDS-extract) independently
reinvented the same StageExample subclass — Path-based want_data /
want_metadata plus a per-suffix comparator dispatcher — because the
upstream class assumed outputs were either a MEDSDataset or a single
metadata/codes.parquet. This hoists the shared scaffolding into the
base class so derived packages can collapse to ~20 LOC.
Changes to `StageExample`:
- `want_data: MEDSDataset | Path | None`,
`want_metadata: pl.DataFrame | Path | None`. A Path is interpreted as
a `yaml_to_disk` spec describing the expected output tree.
- `from_dir` falls back to storing a Path when `out_data.yaml` /
`out_metadata.yaml` don't parse as MEDS-shaped, mirroring the existing
`in_data` graceful-degradation pattern.
- `check_outputs` dispatches on type: MEDSDataset / DataFrame branches
are unchanged (no regression); Path branches go through the new
`_check_path_spec`, which materializes the spec via `yaml_disk` and
compares each expected file against its actual counterpart via a
per-suffix `ComparatorFn`.
- New dataclass fields: `skip_dirs` (default `{.logs, .hydra}` —
covers the usual Hydra / logging byproducts), `skip_files`,
`tolerate_unexpected`, `suffix_comparators` (per-instance override).
- Class-level default registry `SUFFIX_COMPARATORS: ClassVar[dict]`
seeded with a `.parquet` comparator. Subclasses register additional
comparators (e.g. `.nrt` for MTD, `.json` / `.yaml` for MEDS-extract)
via standard class-var override — no process-global state.
- `CompareContext` / `ComparatorFn` give comparators a minimal,
extensible interface: `(expected_fp, actual_fp, ctx)` where `ctx`
carries the relative path (for error messages) and a tolerances map
keyed by suffix. `df_check_kwargs` flows through as
`ctx.tol_for('.parquet')`, so comparator signatures don't grow when
new formats want their own tolerance bundles.
- XOR rule (`want_data` XOR `want_metadata`) still enforced when both
are MEDSDataset/DataFrame, but relaxed when either is a Path —
extraction stages legitimately describe both data and metadata file
trees simultaneously.
Tests: 18 new tests in `tests/test_filespec_stageexample.py` cover
Path fallback in `from_dir`, `check_outputs` dispatch, missing-file /
mismatch / unknown-suffix error paths, class-level and per-instance
comparator precedence, tolerance bridging through CompareContext, and
the un-regressed MEDSDataset branch.
No changes to `example_class`; stages that need fully custom rendering
(e.g. `JsonOutputStageExample`) still have that escape hatch.
Refs #387
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## dev #390 +/- ##
==========================================
- Coverage 98.23% 98.10% -0.14%
==========================================
Files 54 55 +1
Lines 2607 2691 +84
==========================================
+ Hits 2561 2640 +79
- Misses 46 51 +5 ☔ View full report in Codecov by Sentry. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes #387 (Option B). The upstream
StageExampleassumed outputs were either aMEDSDatasetunderdata/*.parquetor a singlemetadata/codes.parquet— so two derived packages (meds-torch-data and MEDS-extract) independently reinvented the same ~200-LOC subclass: Path-basedwant_data/want_metadataplus a per-suffix comparator dispatcher.This PR hoists that scaffolding into the base class so derived packages collapse to ~20 LOC.
Depends on
#389 — the
render_content/want_metadata: Pathbug-fix prerequisite. That PR is deliberately minimal (rendering dispatch only); this one adds the type widening + check_outputs dispatch + comparator registry on top.Changes
Type widening
want_data: MEDSDataset | Path | Nonewant_metadata: pl.DataFrame | Path | NoneA
Pathis interpreted as ayaml_to_diskspec describing the expected output tree.from_dirgraceful fallbackMEDSDataset.from_yaml(out_data_fp)/read_metadata_only(out_metadata_fp)first. OnValueError/TypeError, store thePathitself and letcheck_outputsdispatch to the file-spec path later. Mirrors the existingin_datapattern exactly.New
check_outputsdispatchMEDSDataset/pl.DataFramebranches are byte-identical to before (no regression for existing stages).Pathbranches go through the new_check_path_spec, which:yaml_to_disk.yaml_disk;ComparatorFn;tolerate_unexpected=True), after filtering outskip_dirs/skip_files.Policy fields (dataclass fields, not ClassVars)
skip_dirs: frozenset[str]— defaults to{.logs, .hydra}(the usual Hydra / logging byproducts both downstream packages reinvented).skip_files: frozenset[str]— empty default.tolerate_unexpected: bool— defaults to strict.suffix_comparators: dict[str, ComparatorFn] | None— per-instance override of the class default.Comparator registry (no process-global state)
SUFFIX_COMPARATORS: ClassVar[dict[str, ComparatorFn]] = {".parquet": _compare_parquet}.class MTDStageExample(StageExample): SUFFIX_COMPARATORS = {..., ".nrt": _compare_nrt}).suffix_comparatorsoverrides the class default for one-offs (e.g. relaxed tolerance on a specific example).@register_comparatordecorator — explicitly avoided per the design-discussion comment.Comparator signature
CompareContextis a frozen dataclass carryingrel: Path(for error messages) andtolerances: Mapping[str, Any]keyed by suffix.df_check_kwargsis plumbed through asctx.tol_for('.parquet')— so adding a new format doesn't require signature changes.XOR relaxation
The
want_dataXORwant_metadatarule still fires when both areMEDSDataset+pl.DataFrame(the original ambiguity). It's relaxed when either is aPath— extraction stages legitimately describe both data and metadata file trees simultaneously.What this doesn't do
example_class— stages that need fully custom rendering (JsonOutputStageExample-style) keep their escape hatch.render_content— Fix render_content blowup when want_metadata is a Path (refs #387) #389 handled thewant_metadata: Pathrendering bug on its own.Test plan
tests/test_filespec_stageexample.py:from_dir+ existing MEDSDataset parsing unchangedcheck_outputsdispatch for Path spec (match / mismatch / missing-file / unknown-suffix)tolerate_unexpected,skip_dirs,skip_files)suffix_comparatorswins over class default_compare_parquettolerance bridgingRefs #387