Skip to content

Mechanism: Stages declare flexible_schema input/output contracts (refs #324)#379

Open
mmcdermott wants to merge 3 commits intodevfrom
fix/stage-flexible-schema
Open

Mechanism: Stages declare flexible_schema input/output contracts (refs #324)#379
mmcdermott wants to merge 3 commits intodevfrom
fix/stage-flexible-schema

Conversation

@mmcdermott
Copy link
Copy Markdown
Owner

@mmcdermott mmcdermott commented Apr 17, 2026

Summary

Part 1 of #324. Introduces the mechanism by which stages declare their input and output schemas, and demonstrates it on aggregate_code_metadata.

  • Stage now accepts input_schema, output_schema, metadata_input_schema, metadata_output_schema (all flexible_schema.Schema subclasses, all optional).
  • Stage.register forwards these kwargs unchanged.
  • Stage.declared_schemas returns the four roles as a dict for consumers (pipeline-load-time validation, downstream composer checks in We should support a "meta-stage" set-up where you can run multiple transformations in a single stage at once, without caching the files in between. #56).
  • aggregate_code_metadata declares DataSchema on input_schema and CodeMetadataSchema on metadata_output_schema — a case where the schemas actually differ, so the declaration carries real information. (An earlier draft annotated filter_subjects with DataSchema on both sides, but that was just restating the default; reverted and replaced per review.)

Scope

Deliberately minimal. Full stage-by-stage migration is a follow-up so each migration can land with its own targeted review and tests (every stage is doing slightly different shape transforms and the output_schema may need to be derived from the stage config for some of them). This PR is the enabling plumbing.

Test plan

  • Full non-parallel suite: 145 passed.
  • New declared_schemas doctest covers all four role slots.
  • Registered-stage suite for aggregate_code_metadata: 7 passed, including the existing example scenarios.

Refs #324

Adds optional input_schema, output_schema, metadata_input_schema, and
metadata_output_schema fields to the Stage class, plumbed through
Stage.register. A declared_schemas property exposes the four roles for
pipeline-load-time validation (#324) and downstream composer checks
(#56).

Migrates filter_subjects as a proof of concept: it now declares
DataSchema for both input and output. Full migration of the remaining
stages is a follow-up; #324 is the mechanism PR.
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 17, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 98.25%. Comparing base (ce04e87) to head (ddd1676).
⚠️ Report is 22 commits behind head on dev.

Additional details and impacted files
@@            Coverage Diff             @@
##              dev     #379      +/-   ##
==========================================
+ Coverage   98.23%   98.25%   +0.01%     
==========================================
  Files          54       55       +1     
  Lines        2607     2630      +23     
==========================================
+ Hits         2561     2584      +23     
  Misses         46       46              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds plumbing for MEDS-Transforms stages to declare optional flexible_schema input/output contracts, and demonstrates the pattern by annotating the filter_subjects stage (per #324).

Changes:

  • Extend Stage to accept optional input_schema, output_schema, metadata_input_schema, metadata_output_schema.
  • Add Stage.declared_schemas to expose the four schema roles in a consistent dict for downstream consumers.
  • Update filter_subjects to declare DataSchema as both its input and output schema.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
src/MEDS_transforms/stages/base.py Adds schema contract fields to Stage plus a declared_schemas accessor (with doctest).
src/MEDS_transforms/stages/filter_subjects/filter_subjects.py Declares DataSchema for filter_subjects via Stage.register(...).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Owner Author

@mmcdermott mmcdermott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this not go straight into dev, but into another branch that is for the broader feature set this is a part of? If all we're adding is this, I'm not sure I see the point.

Comment thread src/MEDS_transforms/stages/filter_subjects/filter_subjects.py Outdated
filter_subjects is a bad POC: input_schema and output_schema are both
DataSchema, so the declaration doesn't demonstrate anything beyond the
default. Reverted the declaration there and dropped the now-unused
DataSchema import.

aggregate_code_metadata is the natural POC: input is DataSchema (MEDS
event records), metadata_output is CodeMetadataSchema (codes.parquet).
These are genuinely different schemas, so the declaration is meaningful.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

# Conflicts:
#	src/MEDS_transforms/stages/aggregate_code_metadata/aggregate_code_metadata.py
#	src/MEDS_transforms/stages/base.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants