ensure upper() doesn't increase string length #9

lapp0 · 2024-05-23T02:02:08Z

Problem

In master, interegular uses char.upper(), which can convert one char into two, resulting the set of accepts() and strings() being inconsistent.

 >>> 'ß'.upper()
'SS'

accepts() and strings() inconsistency:

   def test_multichar(self):
       fsm = parse_pattern("(?i:ß)").to_fsm()
       strings = set([s[0] for s in fsm.strings()])
       assert fsm.accepts("ß")
       assert not fsm.accepts("SS")
       assert "ß" in strings
>       assert "SS" not in strings
E       AssertionError: assert 'SS' not in {'SS', 'ß'}

This change ensures if a capitalized character isn't of length 1, the original character is used.

This is consistent with the behavior of re:

>>> print(re.match(r"(?i:ß)", "ß"))
<re.Match object; span=(0, 1), match='ß'>
>>> print(re.match(r"(?i:ß)", "SS"))
None

rlouf · 2024-06-05T12:01:32Z

Is this good to review?

MegaIng · 2024-06-05T13:09:06Z

Looks good, nice catch!

Can you add a link to this PR as a comment in the function? And can similar stuff happen for str.lower? (if not, please note that explictly as a comment next to where _one_char_upper is used)

MegaIng · 2024-06-05T16:18:40Z

Hm, this isn't really enough to match stdlib's re behavior: re.fullmatch(low_s, up_s, re.IGNORECASE), where low_s='ß'; up_s='ẞ', i.e. the uppercase version of ß which does exists does match, meaning I can't get the correct information with just str.upper and str.lower.

I am currently trying to figure out if python exposes case folding information somewhere that I could use.

lapp0 · 2024-06-05T20:33:48Z

Looks good, nice catch!

Can you add a link to this PR as a comment in the function? And can similar stuff happen for str.lower? (if not, please note that explictly as a comment next to where _one_char_upper is used)

I can make that change. And yes there is a single character for which this is true for str.lower.

>>> {char: char.upper() for char in iterate_utf8_characters() if len(char) != len(char.upper())}
{'ß': 'SS', 'ŉ': 'ʼN', 'ǰ': 'J̌', 'ΐ': 'Ϊ́', 'ΰ': 'Ϋ́', 'և': 'ԵՒ', 'ẖ': 'H̱', 'ẗ': 'T̈', 'ẘ': 'W̊', 'ẙ': 'Y̊', 'ẚ': 'Aʾ', 'ὐ': 'Υ̓', 'ὒ': 'Υ̓̀', 'ὔ': 'Υ̓́', 'ὖ': 'Υ̓͂', 'ᾀ': 'ἈΙ', 'ᾁ': 'ἉΙ', 'ᾂ': 'ἊΙ', 'ᾃ': 'ἋΙ', 'ᾄ': 'ἌΙ', 'ᾅ': 'ἍΙ', 'ᾆ': 'ἎΙ', 'ᾇ': 'ἏΙ', 'ᾈ': 'ἈΙ', 'ᾉ': 'ἉΙ', 'ᾊ': 'ἊΙ', 'ᾋ': 'ἋΙ', 'ᾌ': 'ἌΙ', 'ᾍ': 'ἍΙ', 'ᾎ': 'ἎΙ', 'ᾏ': 'ἏΙ', 'ᾐ': 'ἨΙ', 'ᾑ': 'ἩΙ', 'ᾒ': 'ἪΙ', 'ᾓ': 'ἫΙ', 'ᾔ': 'ἬΙ', 'ᾕ': 'ἭΙ', 'ᾖ': 'ἮΙ', 'ᾗ': 'ἯΙ', 'ᾘ': 'ἨΙ', 'ᾙ': 'ἩΙ', 'ᾚ': 'ἪΙ', 'ᾛ': 'ἫΙ', 'ᾜ': 'ἬΙ', 'ᾝ': 'ἭΙ', 'ᾞ': 'ἮΙ', 'ᾟ': 'ἯΙ', 'ᾠ': 'ὨΙ', 'ᾡ': 'ὩΙ', 'ᾢ': 'ὪΙ', 'ᾣ': 'ὫΙ', 'ᾤ': 'ὬΙ', 'ᾥ': 'ὭΙ', 'ᾦ': 'ὮΙ', 'ᾧ': 'ὯΙ', 'ᾨ': 'ὨΙ', 'ᾩ': 'ὩΙ', 'ᾪ': 'ὪΙ', 'ᾫ': 'ὫΙ', 'ᾬ': 'ὬΙ', 'ᾭ': 'ὭΙ', 'ᾮ': 'ὮΙ', 'ᾯ': 'ὯΙ', 'ᾲ': 'ᾺΙ', 'ᾳ': 'ΑΙ', 'ᾴ': 'ΆΙ', 'ᾶ': 'Α͂', 'ᾷ': 'Α͂Ι', 'ᾼ': 'ΑΙ', 'ῂ': 'ῊΙ', 'ῃ': 'ΗΙ', 'ῄ': 'ΉΙ', 'ῆ': 'Η͂', 'ῇ': 'Η͂Ι', 'ῌ': 'ΗΙ', 'ῒ': 'Ϊ̀', 'ΐ': 'Ϊ́', 'ῖ': 'Ι͂', 'ῗ': 'Ϊ͂', 'ῢ': 'Ϋ̀', 'ΰ': 'Ϋ́', 'ῤ': 'Ρ̓', 'ῦ': 'Υ͂', 'ῧ': 'Ϋ͂', 'ῲ': 'ῺΙ', 'ῳ': 'ΩΙ', 'ῴ': 'ΏΙ', 'ῶ': 'Ω͂', 'ῷ': 'Ω͂Ι', 'ῼ': 'ΩΙ', 'ﬀ': 'FF', 'ﬁ': 'FI', 'ﬂ': 'FL', 'ﬃ': 'FFI', 'ﬄ': 'FFL', 'ﬅ': 'ST', 'ﬆ': 'ST', 'ﬓ': 'ՄՆ', 'ﬔ': 'ՄԵ', 'ﬕ': 'ՄԻ', 'ﬖ': 'ՎՆ', 'ﬗ': 'ՄԽ'}
>>> {char: char.lower() for char in iterate_utf8_characters() if len(char) != len(char.lower())}
{'İ': 'i̇'}

MegaIng · 2024-06-05T20:37:40Z

Actually, don't worry about it, I will implement this in some other way, thank you for making me aware of this issue, and thank you for providing a fix!

I will properly implement the official unicode CaseFolding.txt information. (specifically the simple combination, i.e. CS so that everything stays the same length). That should take care of everything and fully match stdlibs behavior.

lapp0 · 2024-06-05T20:40:24Z

Thanks so much for ensuring there is a proper fix @MegaIng!!

ensure upper() doesn't increase string length

4a6c8bd

lapp0 mentioned this pull request May 29, 2024

AssertionError in case-insensitive regex containing specific characters (¤ ß İ ŉ ǰ ΐ ΰ) dottxt-ai/outlines#773

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ensure upper() doesn't increase string length #9

ensure upper() doesn't increase string length #9

Uh oh!

lapp0 commented May 23, 2024 •

edited

Loading

Uh oh!

rlouf commented Jun 5, 2024

Uh oh!

MegaIng commented Jun 5, 2024

Uh oh!

MegaIng commented Jun 5, 2024

Uh oh!

lapp0 commented Jun 5, 2024

Uh oh!

MegaIng commented Jun 5, 2024

Uh oh!

lapp0 commented Jun 5, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ensure upper() doesn't increase string length #9

Are you sure you want to change the base?

ensure upper() doesn't increase string length #9

Uh oh!

Conversation

lapp0 commented May 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Uh oh!

rlouf commented Jun 5, 2024

Uh oh!

MegaIng commented Jun 5, 2024

Uh oh!

MegaIng commented Jun 5, 2024

Uh oh!

lapp0 commented Jun 5, 2024

Uh oh!

MegaIng commented Jun 5, 2024

Uh oh!

lapp0 commented Jun 5, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

lapp0 commented May 23, 2024 •

edited

Loading