Skip to content

Arabic mark reordering description could mislead #172

@bobh0303

Description

@bobh0303

I appreciate the inclusion of UTR#53 discusion in opentype-shaping-arabic.md, and I appreciate the desire to simplify the wording to make it easier to understand.

However the wording as currently in the document could mislead readers.

First, readers will need to understand that the first step of UTR#53 AMTRA algorithm is to perform NFD normalization. As far as I can tell, this step is completely omitted from opentype-shaping-arabic.md.

I am aware that the authors of Harfbuzz disagree with NFD, preferring to use NFC. But in either case, the mark order is normalized, which is essential to the remainder of the algorithm.

Secondly, opentype-shaping-arabic.md then gives the following statement:

Second, move any subsequence of combining-class-230 characters that begins with a 230_MCM character to the beginning of the sequence, before all "Shadda" characters. The subsequence must be moved as a group.

This is misleading because it says to move the entire sequence of combining-class-230 marks, when in fact UTR53 says to just move the initial MCMs from that sequence:

If a sequence of ccc=230 characters begins with any MCM characters, move the sequence of such MCM characters to the beginning of S (before any characters with ccc=33). (emphasis added)

There is a similarly-worded statement for combining-class-220 marks which similarly diverges from UTR#53.

Finally, it is important to realize that the algorithm -- just like Unicode normalization itself -- isn't applied to the complete sequence of marks that might be adjacent to an Arabic letter. Rather, it is applied to each maximal-length substring, S, of non-starter (D107) characters. The nuance is that a U+034F: COMBINING GRAPHEME JOINER is a mark but is a Starter in Unicode terms (since it has ccc=0). In essence, CGJ characters divide a mark sequence up into pieces and then mark reordering (whether part of normalization or AMTRA) happens within those pieces, but not across their boundaries.

(I'll admit that this last point may be more nuanced than you want to cover within this document. But the concept applies to Unicode normalization in general, so might be appropriate somewhere else in your documents.)

Regards,
Bob

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions