Skip to content

Regex cleanup#1299

Open
jowilco wants to merge 22 commits intounicode-org:mainfrom
jowilco:Regex_cleanup
Open

Regex cleanup#1299
jowilco wants to merge 22 commits intounicode-org:mainfrom
jowilco:Regex_cleanup

Conversation

@jowilco
Copy link
Contributor

@jowilco jowilco commented Feb 4, 2026

Currently, CheckProperties validates IndexPropertyRegexes during the PropertyParsingInfo process. However, this implementation is permissive; the build continues even if regexes are missing or if property values fail to match their defined patterns. This allows potential data integrity issues to go undetected during the build cycle.

To ensure more rigorous enforcement, we are transitioning away from CheckProperties as a build step in favor of one or more dedicated, independent tests.

  • TestIndexPropertyRegex: A new, standalone test designed specifically to validate regular expressions against the latest version of the Unicode Character Database (UCD).
  • Versioned error storage: Previously, data loading errors across all UCD versions were aggregated into a single list, making it impossible to limit showing regex failures to a given version of UCD. As part of this update, these errors are now organized into a Map keyed by UCD version, allowing for more granular debugging and reporting.

@jowilco jowilco requested a review from eggrobin February 4, 2026 23:36
eggrobin
eggrobin previously approved these changes Feb 5, 2026
@eggrobin
Copy link
Member

Since you wrote

I think that the code review is probably ready for a real review!

I’ll go ahead and undraft this, lest I forget about it…

@eggrobin eggrobin marked this pull request as ready for review February 11, 2026 21:57
@jowilco jowilco self-assigned this Feb 11, 2026
@markusicu markusicu requested a review from eggrobin February 11, 2026 22:42
null,
ValueCardinality.Unordered,
"Names_List_Cross_Ref"),
PropertyType.String, DerivedPropertyStatus.UCDNonProperty, "Names_List_Cross_Ref"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The names list cross refs are definitely multi-valued; it looks like this got broken by the IndexPropertyRegex.txt change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry - my mistake. Fixed.

PropertyType.String,
DerivedPropertyStatus.UCDNonProperty,
null,
ValueCardinality.Unordered,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is weird. Do_Not_Emit_Dispreferred is multivalued, but the mapping in the other direction should be single-valued (in fact we have some deprecated characters that are not in DoNotEmit.txt precisely because there would be an ambiguity as to which one is preferred).

PropertyType.String,
DerivedPropertyStatus.Provisional,
null,
ValueCardinality.Unordered,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems wrong: UAX60 states

Each Seal character is associated with a single CJK Unified ideograph which may be used to refer to the Seal character.

(emphasis mine.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I was adding these values, the changes to UAX60 hadn't been made yet, so I was using the examples in https://www.unicode.org/L2/L2025/25111-converging-small-seal.pdf which showed a multi-valued kSEAL_MCJK

I updated IndexPropertyRegex with the syntax from UAX60

I did notice two issues with UAX60 though:

  1. kSEAL_Rad is multi-valued
  2. kSEAL_THXSrc is missing the terminating )

null,
ValueCardinality.Unordered,
"Names_List_Alias"),
PropertyType.Miscellaneous, DerivedPropertyStatus.UCDNonProperty, "Names_List_Alias"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also actually multi-valued. Same for the other names list properties below.

DerivedPropertyStatus.Provisional,
null,
ValueCardinality.Unordered,
"cjkGB5"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the description in UAX38 this looks single-valued.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like nearly all changes to this file are spurious; please adjust IndexPropertyRegex.txt accordingly, and if any actually need to change, let’s discuss those in detail.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that I believe that most of the changes are actually correct, or at least are now in line with what the corresponding UAX indicates. Take the first change as an example.
Current IndexPropertyRegex
kMatthews ; SINGLE_VALUED ; [1-9][0-9]{0,3}(a|\.5)?

UAX44
Delimiter = space

So I changed it to match:
kMatthews ; MULTI_VALUED ; [1-9]\d{0,3}(a|\.5)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. It looks like UAX38 is very eager to declare a delimiter even when it doesn’t actually use it though.

The description of kMatthews is

The index of this ideograph in Chinese-English Dictionary by Robert H. Mathews, Cambridge: Harvard University Press, 1975.

which sounds single-valued. See also kKoreanName:

The year that corresponds to the 인명용 한자 (人名用漢字) list in which the ideograph first appears, regardless of its readings.

Which sounds decidedly unique, so there isn’t anything to delimit with a space…

I suppose following the Delimiter field makes sense though. If a property that has a declared delimiter turns out to never actually be multi-valued in the data, I could do some special handling elsewhere to treat it as single-valued (this is important for performance, inter alia).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or we could "fix" UAX38...

Copy link
Member

@markusicu markusicu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look ok, but there is no context.

Please write a PR description for what you are trying to achieve.
Is there an issue to link to?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants