Mechanism: Stages declare flexible_schema input/output contracts (refs #324)#379
Mechanism: Stages declare flexible_schema input/output contracts (refs #324)#379mmcdermott wants to merge 3 commits intodevfrom
Conversation
Adds optional input_schema, output_schema, metadata_input_schema, and metadata_output_schema fields to the Stage class, plumbed through Stage.register. A declared_schemas property exposes the four roles for pipeline-load-time validation (#324) and downstream composer checks (#56). Migrates filter_subjects as a proof of concept: it now declares DataSchema for both input and output. Full migration of the remaining stages is a follow-up; #324 is the mechanism PR.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## dev #379 +/- ##
==========================================
+ Coverage 98.23% 98.25% +0.01%
==========================================
Files 54 55 +1
Lines 2607 2630 +23
==========================================
+ Hits 2561 2584 +23
Misses 46 46 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Pull request overview
This PR adds plumbing for MEDS-Transforms stages to declare optional flexible_schema input/output contracts, and demonstrates the pattern by annotating the filter_subjects stage (per #324).
Changes:
- Extend
Stageto accept optionalinput_schema,output_schema,metadata_input_schema,metadata_output_schema. - Add
Stage.declared_schemasto expose the four schema roles in a consistent dict for downstream consumers. - Update
filter_subjectsto declareDataSchemaas both its input and output schema.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
src/MEDS_transforms/stages/base.py |
Adds schema contract fields to Stage plus a declared_schemas accessor (with doctest). |
src/MEDS_transforms/stages/filter_subjects/filter_subjects.py |
Declares DataSchema for filter_subjects via Stage.register(...). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
mmcdermott
left a comment
There was a problem hiding this comment.
Can this not go straight into dev, but into another branch that is for the broader feature set this is a part of? If all we're adding is this, I'm not sure I see the point.
filter_subjects is a bad POC: input_schema and output_schema are both DataSchema, so the declaration doesn't demonstrate anything beyond the default. Reverted the declaration there and dropped the now-unused DataSchema import. aggregate_code_metadata is the natural POC: input is DataSchema (MEDS event records), metadata_output is CodeMetadataSchema (codes.parquet). These are genuinely different schemas, so the declaration is meaningful. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
# Conflicts: # src/MEDS_transforms/stages/aggregate_code_metadata/aggregate_code_metadata.py # src/MEDS_transforms/stages/base.py
Summary
Part 1 of #324. Introduces the mechanism by which stages declare their input and output schemas, and demonstrates it on
aggregate_code_metadata.Stagenow acceptsinput_schema,output_schema,metadata_input_schema,metadata_output_schema(allflexible_schema.Schemasubclasses, all optional).Stage.registerforwards these kwargs unchanged.Stage.declared_schemasreturns the four roles as a dict for consumers (pipeline-load-time validation, downstream composer checks in We should support a "meta-stage" set-up where you can run multiple transformations in a single stage at once, without caching the files in between. #56).aggregate_code_metadatadeclaresDataSchemaoninput_schemaandCodeMetadataSchemaonmetadata_output_schema— a case where the schemas actually differ, so the declaration carries real information. (An earlier draft annotatedfilter_subjectswithDataSchemaon both sides, but that was just restating the default; reverted and replaced per review.)Scope
Deliberately minimal. Full stage-by-stage migration is a follow-up so each migration can land with its own targeted review and tests (every stage is doing slightly different shape transforms and the
output_schemamay need to be derived from the stage config for some of them). This PR is the enabling plumbing.Test plan
declared_schemasdoctest covers all four role slots.aggregate_code_metadata: 7 passed, including the existing example scenarios.Refs #324