Handle some cases during infer schema from dataclass#37855
Handle some cases during infer schema from dataclass#37855Abacn wants to merge 1 commit intoapache:masterfrom
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request enhances the robustness and correctness of schema inference within Apache Beam's type hinting system. It specifically refines how schemas are generated for dataclasses by preventing unintended inheritance of schema IDs from base classes, thereby ensuring type-specific schema generation. Additionally, it introduces a defensive check during type inference for generic collections like Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
701c97d to
0129b49
Compare
|
Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment |
0129b49 to
dc89818
Compare
|
Assigning reviewers: R: @claudevdm for label python. Note: If you would like to opt out of this review, comment Available commands:
The PR bot will only process comments in the main thread (not review comments). |
bb4cac3 to
f54be68
Compare
63fce82 to
18199f1
Compare
* For backward compatibility, only infer schema for frozen dataclasses when it's registered with row coder * Make sure Beam schema ID does not inherit * Fix IndexOutofBoundError trying to infer type from custom Iterable without type hint * Fix apache#37862: fixed named tuple and effectively fails dataclass inside union typehint
18199f1 to
ed6e34b
Compare
|
|
||
| def match_is_dataclass(user_type): | ||
| return dataclasses.is_dataclass(user_type) and isinstance(user_type, type) | ||
| def match_dataclass_for_row(user_type): |
There was a problem hiding this comment.
This part fixes #1-reassure backward compatibility of default coder for frozen dataclass
| # Currently registration happens when converting to schema protos, in | ||
| # apache_beam.typehints.schemas | ||
| self._schema_id = getattr(self._user_type, _BEAM_SCHEMA_ID, None) | ||
| if self._schema_id and _BEAM_SCHEMA_ID not in self._user_type.__dict__: |
There was a problem hiding this comment.
this part (and another evaluation of _BEAM_SCHEMA_ID in schema.py) fixes #2 handle inherited data classes
| return schema_pb2.FieldType( | ||
| array_type=schema_pb2.ArrayType(element_type=element_type)) | ||
| arg_types = _get_args(type_) | ||
| if len(arg_types) > 0: |
There was a problem hiding this comment.
this (and the following) fixes #3 IndexOutofBoundError
| element_types must be a set of schema-aware types whose fields have the | ||
| same naming and ordering. | ||
| """ | ||
| named_fields_and_types = [] |
There was a problem hiding this comment.
this part fixes #4 data corruption bug #37862
|
|
||
| # Union of dataclasses as type hint currently result in FastPrimitiveCoder | ||
| # fails at GBK | ||
| @unittest.skip("https://github.com/apache/beam/issues/22085") |
There was a problem hiding this comment.
The more I dig into, more gaps related to typehint<->schema found. This skipped test (dataclass counterpart of namedtuple test above) demonstrates current failure due to CoderRegistry.get_coder does not handle UnionTypeConstraint:
and it always falls back to FastPrimitiveCoder, which cannot encode non-frozen dataclass. Even if it does, it's not portable (backed by pickle)
Decide to stop here for this Beam release, as this PR is sufficient to basic dataclass support, in a backward compatibility way, and fixed two pre-existing bug currently also happening for named tuples
|
Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control. If you'd like to restart, comment |
|
LGTM! I would defer approval review to some one with knowledge of schema ID for more cases. |
Follow up #37728 added support for inferring schema from dataclasses
For backward compatibility, only infer schema for frozen dataclasses when it's registered with row coder
Previously frozen dataclass is accepted by FastPrimitiveCoder (while non-frozen will raise an Error). Coder change isn't upgrade-compatible
Make sure Beam schema ID does not inherit
If dataclass is inherited, the subclass assumes Beam Schema ID from base class, which isn't correct.
Fix IndexOutofBoundError trying to infer type from custom Iterable without type hint
the change exposed another issue (previously could also happen for named tuple) for user types defined
__iter__but does not have a typehint for element type.Found a pre-existing bug [Bug]: [Python] Union of named fields loss fields after GBK #37862. This change fixes named tuple and effectively fails dataclass inside union typehint (instead of silent data corruption if triggered the bug)
internal tracker b/492300593
Please add a meaningful description for your change here
Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, commentfixes #<ISSUE NUMBER>instead.CHANGES.mdwith noteworthy changes.See the Contributor Guide for more tips on how to make review process smoother.
To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md
GitHub Actions Tests Status (on master branch)
See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.