Draft
Conversation
…implemented, and no testing was done
…inst bytes-like patterns
Yehuda-blip
commented
Apr 7, 2023
| == r"(?L:test)" | ||
| ) | ||
|
|
||
| InlineFlag.PATTERN_IS_BYTES_LIKE = False |
Author
There was a problem hiding this comment.
pre-commit did not like my formatting of these tests, and made these really line-heavy trees. Originally each call was in it's own line.
Yehuda-blip
commented
Apr 7, 2023
| def re(pattern: AnyStr, flavor: Optional[Flavor] = None) -> AnyStr: | ||
| # TODO: LRU cache | ||
| if _is_bytes_like(pattern): | ||
| asm.InlineFlag.PATTERN_IS_BYTES_LIKE = True |
Author
There was a problem hiding this comment.
This might be better placed in some global values-per-compilation dictionary somewhere.
…implemented, and no testing was done
…inst bytes-like patterns
…eature/add_inline_flags # Conflicts: # ke/asm.py
Yehuda-blip
commented
Apr 7, 2023
| # reaches the to_regex() method is the first one in the | ||
| # sequence (there may be other sequences). This is necessary | ||
| # because the parenthesis wrapping and regex legality are | ||
| # dependent on the whole flagging expression. |
Author
There was a problem hiding this comment.
If we drop all the validations, there is always the option to simply compile [ignore_case multiline 'string'] to
(?i:(?m:string)) instead of (?im:string), which should make the whole thing a lot simpler (and I very much doubt has much effect on the performance of the result, if this is even a concern).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
These 3 flags are incompatible in regex: [ASCII, UNICODE, LOCALE]. This roughly means that the inline flags [a,u,L] cannot be used together in a same-nested-level flagging. When a when one of these flags is nested deeper than another one, it will override the outer flag for the substring it affects.
There are two problems with this implementation:
Python regex allows this behavior -
(?i:)- while this PR does not allow[ignore_case]or[ignore_case ''](the last one is easy to fix with half a line in compiler.py line:255). However, I don't hate this and think it's probably better behavior.The bigger problem is the fact that
(?u:(?a:somestring))is allowed in regex, with the ASCII overriding the UNICODE flag (same with all examples of the incompatible flags [ASCII, UNICODE, LOCALE] together).Here, this is not a valid expression -
[unicode [ascii_only 'somestring']]- because nesting depth of an expression is not kept in the parsing flow. Note however, this is kind of generalization of the first problem - the concat operator makes it possible to write this[unicode [ascii_only 'somestring']'anything'], so the expression is invalid only if the outer flag is not operating on anything, but in this case -[unicode [ascii_only 'somestring']]- it seems less obvious to realize what's the problem when translating from regex to kleenexp.The 3 solutions I can think of for this are:
a. Add a nesting depth value to nodes when parsing - I don't want to do, because I don't understand enough of the parsing and this seems a major change.
b. Add this to a collection of 'problematic behaviors' somewhere.
c. Remove the unicode and locale_dependent flags completely from kleenexp - kind of an overreaction in my opinion.
d. Ignore, which is what I'm doing now and is probably an inferior solution to b.