ICU-3736 UAX44-LM2 loose matching for character names by eggrobin · Pull Request #3932 · unicode-org/icu

eggrobin · 2026-04-08T11:47:59Z

Checklist

Required: Issue filed: ICU-3736
Required: The PR title must be prefixed with a JIRA Issue number. Example: "ICU-NNNNN Fix xyz"
Required: Each commit message must be prefixed with a JIRA Issue number. Example: "ICU-NNNNN Fix xyz"
Issue accepted (done by Technical Committee after discussion)
Tests included, if applicable
API docs and/or User Guide docs changed or added, if applicable
Approver: Feel free to merge on my behalf

…1.0 stuff

richgillam

Maybe it's just because I'm not awake yet (or maybe I'm not the best choice of reviewer), but I had trouble finding my way through this. A lot of it made sense, but it wasn't clear to me what the actual rules for matching are, and there seemed to be several spots in the code that assumed more knowledge on the part of the reader than I possess. Can you help me understand what's going on here?

richgillam · 2026-04-09T18:29:00Z

icu4c/source/common/unames.cpp

+                return true;
+            } else if (c == '-') {
+                if (lastChar == ' ' ||
+                    (query->is1180 && skeletonIterator == query->skeleton.end() - 2)) {


Some internal documentation in this function would be helpful. Especially here, where I have no idea what query->is1180 is all about.

I will add more comments; in the meantime, see https://www.unicode.org/reports/tr44/#UAX44-LM2.

Wow, that made me dizzy...

Thanks for pointing me to that document. The rules there aren't at all intuitive unless you know the motivation, but I feel more confident that they are indeed what you implemented.

richgillam · 2026-04-09T18:31:32Z

icu4c/source/common/unames.cpp

+                }
+                if (isMedial && i <= query.length() - 2 && uprv_toupper(query[i + 1]) == 'E' &&
+                    std::string_view(skeletonData, skeletonLimit - skeletonData) == "HANGULJUNGSEONGO") {
+                    is1180MedialHyphen = true;


Again, I think a comment explaining this part would be helpful...

richgillam · 2026-04-09T18:32:41Z

icu4c/source/common/unames.cpp

-                        if (static_cast<uint8_t>(';') >= tokenCount || tokens[static_cast<uint8_t>(';')] == static_cast<uint16_t>(-1)) {
-                            continue;
-                        }
-                    }


Why are you taking this out?

Unicode 1 names were removed from ICU in ICU 49.

richgillam · 2026-04-09T18:33:34Z

icu4c/source/common/unames.cpp

-                        if (static_cast<uint8_t>(';') >= tokenCount || tokens[static_cast<uint8_t>(';')] == static_cast<uint16_t>(-1)) {
-                            continue;
-                        }
-                    }


Again, why is this coming out? Is this code no longer relevant, or did it move somewhere else?

richgillam · 2026-04-09T18:36:39Z

icu4c/source/test/intltest/usettest.cpp

            {uR"([\N{Hangul jungseong O-E}])", u"[ᆀ]"},
+            {uR"([\N{Hangul jungseong O -E}])", u"[ᆀ]"},


Can there be whitespace on either side of the hyphen (or both), or only before it?

Whitespace on either side makes it non-medial.

richgillam

LOKTM

eggrobin · 2026-04-10T16:55:16Z

I added some comments; they are pretty long because while writing them I noticed some subtle edge cases (which, as best I can tell, were handled correctly, but are noteworthy)…

eggrobin added 12 commits April 7, 2026 11:23

Things I forgot to commit

5be74d6

Seems to work?

a444a50

Don’t munge

70edc6e

Add a matcher class instead of the brittle reset

0d9461c

comment

d0b4d4d

failing tests for the algorithmic names

39a8d73

off by one

644d2fb

tests pass

5e1e48b

Test trailing hyphens

75294e1

Some warnings

bb32140

postincrement

c10a5e2

Remove the expandName counterpart of the removed compareName Unicode …

def7986

…1.0 stuff

markusicu self-assigned this Apr 9, 2026

markusicu requested a review from richgillam April 9, 2026 16:32

richgillam reviewed Apr 9, 2026

View reviewed changes

richgillam previously approved these changes Apr 9, 2026

View reviewed changes

comments

e82068f

eggrobin dismissed richgillam’s stale review via e82068f April 10, 2026 16:54

tyop

8e16f59

eggrobin requested a review from richgillam April 13, 2026 14:34

ungrammatical

80f813a

eggrobin requested a review from markusicu April 13, 2026 14:43

Remove dead code, more comments

d5cd008

		{uR"([\N{Hangul jungseong O-E}])", u"[ᆀ]"},
		{uR"([\N{Hangul jungseong O -E}])", u"[ᆀ]"},

Uh oh!

Conversation

eggrobin commented Apr 8, 2026 • edited by markusicu Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

Uh oh!

richgillam left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

richgillam left a comment

Choose a reason for hiding this comment

Uh oh!

eggrobin commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

eggrobin commented Apr 8, 2026 •

edited by markusicu

Loading

eggrobin commented Apr 10, 2026 •

edited

Loading