Skip to content

Conversation

@lapp0
Copy link

@lapp0 lapp0 commented May 23, 2024

Fixes dottxt-ai/outlines#773

Problem

In master, interegular uses char.upper(), which can convert one char into two, resulting the set of accepts() and strings() being inconsistent.

 >>> 'ß'.upper()
'SS'

accepts() and strings() inconsistency:

   def test_multichar(self):
       fsm = parse_pattern("(?i:ß)").to_fsm()
       strings = set([s[0] for s in fsm.strings()])
       assert fsm.accepts("ß")
       assert not fsm.accepts("SS")
       assert "ß" in strings
>       assert "SS" not in strings
E       AssertionError: assert 'SS' not in {'SS', 'ß'}

This change ensures if a capitalized character isn't of length 1, the original character is used.

This is consistent with the behavior of re:

>>> print(re.match(r"(?i:ß)", "ß"))
<re.Match object; span=(0, 1), match='ß'>
>>> print(re.match(r"(?i:ß)", "SS"))
None

@rlouf
Copy link

rlouf commented Jun 5, 2024

Is this good to review?

@MegaIng
Copy link
Owner

MegaIng commented Jun 5, 2024

Looks good, nice catch!

Can you add a link to this PR as a comment in the function? And can similar stuff happen for str.lower? (if not, please note that explictly as a comment next to where _one_char_upper is used)

@MegaIng
Copy link
Owner

MegaIng commented Jun 5, 2024

Hm, this isn't really enough to match stdlib's re behavior: re.fullmatch(low_s, up_s, re.IGNORECASE), where low_s='ß'; up_s='ẞ', i.e. the uppercase version of ß which does exists does match, meaning I can't get the correct information with just str.upper and str.lower.

I am currently trying to figure out if python exposes case folding information somewhere that I could use.

@lapp0
Copy link
Author

lapp0 commented Jun 5, 2024

Looks good, nice catch!

Can you add a link to this PR as a comment in the function? And can similar stuff happen for str.lower? (if not, please note that explictly as a comment next to where _one_char_upper is used)

I can make that change. And yes there is a single character for which this is true for str.lower.

>>> {char: char.upper() for char in iterate_utf8_characters() if len(char) != len(char.upper())}
{'ß': 'SS', 'ʼn': 'ʼN', 'ǰ': 'J̌', 'ΐ': 'Ϊ́', 'ΰ': 'Ϋ́', 'և': 'ԵՒ', 'ẖ': 'H̱', 'ẗ': 'T̈', 'ẘ': 'W̊', 'ẙ': 'Y̊', 'ẚ': 'Aʾ', 'ὐ': 'Υ̓', 'ὒ': 'Υ̓̀', 'ὔ': 'Υ̓́', 'ὖ': 'Υ̓͂', 'ᾀ': 'ἈΙ', 'ᾁ': 'ἉΙ', 'ᾂ': 'ἊΙ', 'ᾃ': 'ἋΙ', 'ᾄ': 'ἌΙ', 'ᾅ': 'ἍΙ', 'ᾆ': 'ἎΙ', 'ᾇ': 'ἏΙ', 'ᾈ': 'ἈΙ', 'ᾉ': 'ἉΙ', 'ᾊ': 'ἊΙ', 'ᾋ': 'ἋΙ', 'ᾌ': 'ἌΙ', 'ᾍ': 'ἍΙ', 'ᾎ': 'ἎΙ', 'ᾏ': 'ἏΙ', 'ᾐ': 'ἨΙ', 'ᾑ': 'ἩΙ', 'ᾒ': 'ἪΙ', 'ᾓ': 'ἫΙ', 'ᾔ': 'ἬΙ', 'ᾕ': 'ἭΙ', 'ᾖ': 'ἮΙ', 'ᾗ': 'ἯΙ', 'ᾘ': 'ἨΙ', 'ᾙ': 'ἩΙ', 'ᾚ': 'ἪΙ', 'ᾛ': 'ἫΙ', 'ᾜ': 'ἬΙ', 'ᾝ': 'ἭΙ', 'ᾞ': 'ἮΙ', 'ᾟ': 'ἯΙ', 'ᾠ': 'ὨΙ', 'ᾡ': 'ὩΙ', 'ᾢ': 'ὪΙ', 'ᾣ': 'ὫΙ', 'ᾤ': 'ὬΙ', 'ᾥ': 'ὭΙ', 'ᾦ': 'ὮΙ', 'ᾧ': 'ὯΙ', 'ᾨ': 'ὨΙ', 'ᾩ': 'ὩΙ', 'ᾪ': 'ὪΙ', 'ᾫ': 'ὫΙ', 'ᾬ': 'ὬΙ', 'ᾭ': 'ὭΙ', 'ᾮ': 'ὮΙ', 'ᾯ': 'ὯΙ', 'ᾲ': 'ᾺΙ', 'ᾳ': 'ΑΙ', 'ᾴ': 'ΆΙ', 'ᾶ': 'Α͂', 'ᾷ': 'Α͂Ι', 'ᾼ': 'ΑΙ', 'ῂ': 'ῊΙ', 'ῃ': 'ΗΙ', 'ῄ': 'ΉΙ', 'ῆ': 'Η͂', 'ῇ': 'Η͂Ι', 'ῌ': 'ΗΙ', 'ῒ': 'Ϊ̀', 'ΐ': 'Ϊ́', 'ῖ': 'Ι͂', 'ῗ': 'Ϊ͂', 'ῢ': 'Ϋ̀', 'ΰ': 'Ϋ́', 'ῤ': 'Ρ̓', 'ῦ': 'Υ͂', 'ῧ': 'Ϋ͂', 'ῲ': 'ῺΙ', 'ῳ': 'ΩΙ', 'ῴ': 'ΏΙ', 'ῶ': 'Ω͂', 'ῷ': 'Ω͂Ι', 'ῼ': 'ΩΙ', 'ff': 'FF', 'fi': 'FI', 'fl': 'FL', 'ffi': 'FFI', 'ffl': 'FFL', 'ſt': 'ST', 'st': 'ST', 'ﬓ': 'ՄՆ', 'ﬔ': 'ՄԵ', 'ﬕ': 'ՄԻ', 'ﬖ': 'ՎՆ', 'ﬗ': 'ՄԽ'}
>>> {char: char.lower() for char in iterate_utf8_characters() if len(char) != len(char.lower())}
{'İ': 'i̇'}

@MegaIng
Copy link
Owner

MegaIng commented Jun 5, 2024

Actually, don't worry about it, I will implement this in some other way, thank you for making me aware of this issue, and thank you for providing a fix!

I will properly implement the official unicode CaseFolding.txt information. (specifically the simple combination, i.e. CS so that everything stays the same length). That should take care of everything and fully match stdlibs behavior.

@lapp0
Copy link
Author

lapp0 commented Jun 5, 2024

Thanks so much for ensuring there is a proper fix @MegaIng!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

AssertionError in case-insensitive regex containing specific characters (¤ ß İ ʼn ǰ ΐ ΰ)

3 participants