ICU-3736 UAX44-LM2 loose matching for character names#3932
Hidden character warning
ICU-3736 UAX44-LM2 loose matching for character names#3932eggrobin wants to merge 16 commits intounicode-org:mainfrom
Conversation
richgillam
left a comment
There was a problem hiding this comment.
Maybe it's just because I'm not awake yet (or maybe I'm not the best choice of reviewer), but I had trouble finding my way through this. A lot of it made sense, but it wasn't clear to me what the actual rules for matching are, and there seemed to be several spots in the code that assumed more knowledge on the part of the reader than I possess. Can you help me understand what's going on here?
| return true; | ||
| } else if (c == '-') { | ||
| if (lastChar == ' ' || | ||
| (query->is1180 && skeletonIterator == query->skeleton.end() - 2)) { |
There was a problem hiding this comment.
Some internal documentation in this function would be helpful. Especially here, where I have no idea what query->is1180 is all about.
There was a problem hiding this comment.
I will add more comments; in the meantime, see https://www.unicode.org/reports/tr44/#UAX44-LM2.
There was a problem hiding this comment.
Wow, that made me dizzy...
Thanks for pointing me to that document. The rules there aren't at all intuitive unless you know the motivation, but I feel more confident that they are indeed what you implemented.
| } | ||
| if (isMedial && i <= query.length() - 2 && uprv_toupper(query[i + 1]) == 'E' && | ||
| std::string_view(skeletonData, skeletonLimit - skeletonData) == "HANGULJUNGSEONGO") { | ||
| is1180MedialHyphen = true; |
There was a problem hiding this comment.
Again, I think a comment explaining this part would be helpful...
| if (static_cast<uint8_t>(';') >= tokenCount || tokens[static_cast<uint8_t>(';')] == static_cast<uint16_t>(-1)) { | ||
| continue; | ||
| } | ||
| } |
There was a problem hiding this comment.
Why are you taking this out?
There was a problem hiding this comment.
Unicode 1 names were removed from ICU in ICU 49.
| if (static_cast<uint8_t>(';') >= tokenCount || tokens[static_cast<uint8_t>(';')] == static_cast<uint16_t>(-1)) { | ||
| continue; | ||
| } | ||
| } |
There was a problem hiding this comment.
Again, why is this coming out? Is this code no longer relevant, or did it move somewhere else?
| {uR"([\N{Hangul jungseong O-E}])", u"[ᆀ]"}, | ||
| {uR"([\N{Hangul jungseong O -E}])", u"[ᆀ]"}, |
There was a problem hiding this comment.
Can there be whitespace on either side of the hyphen (or both), or only before it?
There was a problem hiding this comment.
Whitespace on either side makes it non-medial.
|
I added some comments; they are pretty long because while writing them I noticed some subtle edge cases (which, as best I can tell, were handled correctly, but are noteworthy)… |
Checklist