Conversation
…those back and adding the missing regexes
…are Miscellaneous
…dd them to the build yaml
unicodetools/src/test/java/org/unicode/propstest/TestIndexPropertyRegex.java
Show resolved
Hide resolved
…Relation, so reverting back to something closer to the original
|
Since you wrote
I’ll go ahead and undraft this, lest I forget about it… |
| null, | ||
| ValueCardinality.Unordered, | ||
| "Names_List_Cross_Ref"), | ||
| PropertyType.String, DerivedPropertyStatus.UCDNonProperty, "Names_List_Cross_Ref"), |
There was a problem hiding this comment.
The names list cross refs are definitely multi-valued; it looks like this got broken by the IndexPropertyRegex.txt change.
There was a problem hiding this comment.
Sorry - my mistake. Fixed.
| PropertyType.String, | ||
| DerivedPropertyStatus.UCDNonProperty, | ||
| null, | ||
| ValueCardinality.Unordered, |
There was a problem hiding this comment.
This is weird. Do_Not_Emit_Dispreferred is multivalued, but the mapping in the other direction should be single-valued (in fact we have some deprecated characters that are not in DoNotEmit.txt precisely because there would be an ambiguity as to which one is preferred).
| PropertyType.String, | ||
| DerivedPropertyStatus.Provisional, | ||
| null, | ||
| ValueCardinality.Unordered, |
There was a problem hiding this comment.
That seems wrong: UAX60 states
Each Seal character is associated with a single CJK Unified ideograph which may be used to refer to the Seal character.
(emphasis mine.)
There was a problem hiding this comment.
When I was adding these values, the changes to UAX60 hadn't been made yet, so I was using the examples in https://www.unicode.org/L2/L2025/25111-converging-small-seal.pdf which showed a multi-valued kSEAL_MCJK
I updated IndexPropertyRegex with the syntax from UAX60
I did notice two issues with UAX60 though:
- kSEAL_Rad is multi-valued
- kSEAL_THXSrc is missing the terminating )
| null, | ||
| ValueCardinality.Unordered, | ||
| "Names_List_Alias"), | ||
| PropertyType.Miscellaneous, DerivedPropertyStatus.UCDNonProperty, "Names_List_Alias"), |
There was a problem hiding this comment.
This is also actually multi-valued. Same for the other names list properties below.
| DerivedPropertyStatus.Provisional, | ||
| null, | ||
| ValueCardinality.Unordered, | ||
| "cjkGB5"), |
There was a problem hiding this comment.
From the description in UAX38 this looks single-valued.
There was a problem hiding this comment.
See the change in https://www.unicode.org/reports/tr38/proposed.html#kGB5
There was a problem hiding this comment.
It looks like nearly all changes to this file are spurious; please adjust IndexPropertyRegex.txt accordingly, and if any actually need to change, let’s discuss those in detail.
There was a problem hiding this comment.
Note that I believe that most of the changes are actually correct, or at least are now in line with what the corresponding UAX indicates. Take the first change as an example.
Current IndexPropertyRegex
kMatthews ; SINGLE_VALUED ; [1-9][0-9]{0,3}(a|\.5)?
UAX44
Delimiter = space
So I changed it to match:
kMatthews ; MULTI_VALUED ; [1-9]\d{0,3}(a|\.5)?
There was a problem hiding this comment.
Hmm. It looks like UAX38 is very eager to declare a delimiter even when it doesn’t actually use it though.
The description of kMatthews is
The index of this ideograph in Chinese-English Dictionary by Robert H. Mathews, Cambridge: Harvard University Press, 1975.
which sounds single-valued. See also kKoreanName:
The year that corresponds to the 인명용 한자 (人名用漢字) list in which the ideograph first appears, regardless of its readings.
Which sounds decidedly unique, so there isn’t anything to delimit with a space…
I suppose following the Delimiter field makes sense though. If a property that has a declared delimiter turns out to never actually be multi-valued in the data, I could do some special handling elsewhere to treat it as single-valued (this is important for performance, inter alia).
markusicu
left a comment
There was a problem hiding this comment.
The changes look ok, but there is no context.
Please write a PR description for what you are trying to achieve.
Is there an issue to link to?
Currently, CheckProperties validates IndexPropertyRegexes during the PropertyParsingInfo process. However, this implementation is permissive; the build continues even if regexes are missing or if property values fail to match their defined patterns. This allows potential data integrity issues to go undetected during the build cycle.
To ensure more rigorous enforcement, we are transitioning away from CheckProperties as a build step in favor of one or more dedicated, independent tests.