Arabic mark reordering description could mislead

I appreciate the inclusion of UTR#53 discusion in [opentype-shaping-arabic.md](https://github.com/n8willis/opentype-shaping-documents/blob/master/opentype-shaping-arabic.md#stage-1-transient-reordering-of-modifier-combining-marks), and I appreciate the desire to simplify the wording to make it easier to understand.

However the wording as currently in the document could mislead readers.

First, readers will need to understand that the first step of [UTR#53 AMTRA algorithm](https://www.unicode.org/reports/tr53/tr53-7.html#AMTRA_Specification) is to perform NFD normalization. As far as I can tell, this step is completely omitted from `opentype-shaping-arabic.md`.

I am aware that the authors of Harfbuzz [disagree with NFD](https://github.com/harfbuzz/harfbuzz/issues/3179), preferring to use NFC. But in either case, the mark order is normalized, which is essential to the remainder of the algorithm.

Secondly, `opentype-shaping-arabic.md` then gives the following statement:

> Second, move any subsequence of combining-class-230 characters that begins with a 230_MCM character to the beginning of the sequence, before all "Shadda" characters. The subsequence must be moved as a group.

This is misleading because it says to move the entire sequence of combining-class-230 marks, when in fact UTR53 says to just move the initial MCMs from that sequence:

> If a sequence of ccc=230 characters begins with any MCM characters, **move the sequence of such MCM characters** to the beginning of S (before any characters with ccc=33). _(emphasis added)_

There is a similarly-worded statement for combining-class-220 marks which similarly diverges from UTR#53.

Finally, it is important to realize that the algorithm -- just like Unicode normalization itself -- isn't applied to the _complete sequence of marks_ that might be adjacent to an Arabic letter. Rather, it is applied to _each maximal-length substring, S, of non-starter (D107) characters_. The nuance is that a U+034F: COMBINING GRAPHEME JOINER is a _mark_ but is a _Starter_ in Unicode terms (since it has ccc=0). In essence, CGJ characters divide a mark sequence up into pieces and then mark reordering (whether part of normalization or AMTRA) happens within those pieces, but not across their boundaries.  

(I'll admit that this last point may be more nuanced than you want to cover within _this_ document. But the concept applies to Unicode normalization in general, so might be appropriate somewhere else in your documents.) 

Regards,
Bob

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Arabic mark reordering description could mislead #172

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Arabic mark reordering description could mislead #172

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions