Skip to content

Feature/add inline flags#34

Draft
Yehuda-blip wants to merge 21 commits intoSonOfLilit:masterfrom
Yehuda-blip:feature/add_inline_flags
Draft

Feature/add inline flags#34
Yehuda-blip wants to merge 21 commits intoSonOfLilit:masterfrom
Yehuda-blip:feature/add_inline_flags

Conversation

@Yehuda-blip
Copy link

@Yehuda-blip Yehuda-blip commented Apr 7, 2023

These 3 flags are incompatible in regex: [ASCII, UNICODE, LOCALE]. This roughly means that the inline flags [a,u,L] cannot be used together in a same-nested-level flagging. When a when one of these flags is nested deeper than another one, it will override the outer flag for the substring it affects.
There are two problems with this implementation:
Python regex allows this behavior - (?i:) - while this PR does not allow [ignore_case] or [ignore_case ''] (the last one is easy to fix with half a line in compiler.py line:255). However, I don't hate this and think it's probably better behavior.
The bigger problem is the fact that (?u:(?a:somestring)) is allowed in regex, with the ASCII overriding the UNICODE flag (same with all examples of the incompatible flags [ASCII, UNICODE, LOCALE] together).
Here, this is not a valid expression - [unicode [ascii_only 'somestring']] - because nesting depth of an expression is not kept in the parsing flow. Note however, this is kind of generalization of the first problem - the concat operator makes it possible to write this [unicode [ascii_only 'somestring']'anything'], so the expression is invalid only if the outer flag is not operating on anything, but in this case - [unicode [ascii_only 'somestring']] - it seems less obvious to realize what's the problem when translating from regex to kleenexp.
The 3 solutions I can think of for this are:
a. Add a nesting depth value to nodes when parsing - I don't want to do, because I don't understand enough of the parsing and this seems a major change.
b. Add this to a collection of 'problematic behaviors' somewhere.
c. Remove the unicode and locale_dependent flags completely from kleenexp - kind of an overreaction in my opinion.
d. Ignore, which is what I'm doing now and is probably an inferior solution to b.

== r"(?L:test)"
)

InlineFlag.PATTERN_IS_BYTES_LIKE = False
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pre-commit did not like my formatting of these tests, and made these really line-heavy trees. Originally each call was in it's own line.

def re(pattern: AnyStr, flavor: Optional[Flavor] = None) -> AnyStr:
# TODO: LRU cache
if _is_bytes_like(pattern):
asm.InlineFlag.PATTERN_IS_BYTES_LIKE = True
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be better placed in some global values-per-compilation dictionary somewhere.

# reaches the to_regex() method is the first one in the
# sequence (there may be other sequences). This is necessary
# because the parenthesis wrapping and regex legality are
# dependent on the whole flagging expression.
Copy link
Author

@Yehuda-blip Yehuda-blip Apr 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we drop all the validations, there is always the option to simply compile [ignore_case multiline 'string'] to
(?i:(?m:string)) instead of (?im:string), which should make the whole thing a lot simpler (and I very much doubt has much effect on the performance of the result, if this is even a concern).

@Yehuda-blip Yehuda-blip marked this pull request as draft May 19, 2023 15:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants