Skip to content

ICU-3736 UAX44-LM2 loose matching for character names#3932

Open
eggrobin wants to merge 16 commits intounicode-org:mainfrom
eggrobin:𒈬𒁲𒁮

Hidden character warning

The head ref may contain hidden characters: "\ud808\ude2c\ud808\udc72\ud808\udc6e"
Open

ICU-3736 UAX44-LM2 loose matching for character names#3932
eggrobin wants to merge 16 commits intounicode-org:mainfrom
eggrobin:𒈬𒁲𒁮

Conversation

@eggrobin
Copy link
Copy Markdown
Member

@eggrobin eggrobin commented Apr 8, 2026

Checklist

  • Required: Issue filed: ICU-3736
  • Required: The PR title must be prefixed with a JIRA Issue number. Example: "ICU-NNNNN Fix xyz"
  • Required: Each commit message must be prefixed with a JIRA Issue number. Example: "ICU-NNNNN Fix xyz"
  • Issue accepted (done by Technical Committee after discussion)
  • Tests included, if applicable
  • API docs and/or User Guide docs changed or added, if applicable
  • Approver: Feel free to merge on my behalf

@markusicu markusicu self-assigned this Apr 9, 2026
@markusicu markusicu requested a review from richgillam April 9, 2026 16:32
Copy link
Copy Markdown
Contributor

@richgillam richgillam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's just because I'm not awake yet (or maybe I'm not the best choice of reviewer), but I had trouble finding my way through this. A lot of it made sense, but it wasn't clear to me what the actual rules for matching are, and there seemed to be several spots in the code that assumed more knowledge on the part of the reader than I possess. Can you help me understand what's going on here?

return true;
} else if (c == '-') {
if (lastChar == ' ' ||
(query->is1180 && skeletonIterator == query->skeleton.end() - 2)) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some internal documentation in this function would be helpful. Especially here, where I have no idea what query->is1180 is all about.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add more comments; in the meantime, see https://www.unicode.org/reports/tr44/#UAX44-LM2.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, that made me dizzy...

Thanks for pointing me to that document. The rules there aren't at all intuitive unless you know the motivation, but I feel more confident that they are indeed what you implemented.

}
if (isMedial && i <= query.length() - 2 && uprv_toupper(query[i + 1]) == 'E' &&
std::string_view(skeletonData, skeletonLimit - skeletonData) == "HANGULJUNGSEONGO") {
is1180MedialHyphen = true;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, I think a comment explaining this part would be helpful...

if (static_cast<uint8_t>(';') >= tokenCount || tokens[static_cast<uint8_t>(';')] == static_cast<uint16_t>(-1)) {
continue;
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you taking this out?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unicode 1 names were removed from ICU in ICU 49.

if (static_cast<uint8_t>(';') >= tokenCount || tokens[static_cast<uint8_t>(';')] == static_cast<uint16_t>(-1)) {
continue;
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, why is this coming out? Is this code no longer relevant, or did it move somewhere else?

{uR"([\N{Hangul jungseong O-E}])", u"[ᆀ]"},
{uR"([\N{Hangul jungseong O -E}])", u"[ᆀ]"},
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can there be whitespace on either side of the hyphen (or both), or only before it?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whitespace on either side makes it non-medial.

richgillam
richgillam previously approved these changes Apr 9, 2026
Copy link
Copy Markdown
Contributor

@richgillam richgillam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LOKTM

@eggrobin
Copy link
Copy Markdown
Member Author

eggrobin commented Apr 10, 2026

I added some comments; they are pretty long because while writing them I noticed some subtle edge cases (which, as best I can tell, were handled correctly, but are noteworthy)…

@eggrobin eggrobin requested a review from richgillam April 13, 2026 14:34
@eggrobin eggrobin requested a review from markusicu April 13, 2026 14:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants