Skip to content

Fix double-encoded entity references flattened during XML round-trip#663

Merged
ronaldtse merged 1 commit intomainfrom
fix/double-encoded-entity-references
Apr 29, 2026
Merged

Fix double-encoded entity references flattened during XML round-trip#663
ronaldtse merged 1 commit intomainfrom
fix/double-encoded-entity-references

Conversation

@ronaldtse
Copy link
Copy Markdown
Contributor

When XML contains double-encoded entities (e.g. < to represent
literal text <), the round-trip through model serialization flattened
them by one level, producing semantically different or invalid XML.

Root cause: add_text_with_entities treated ALL entity-like patterns in
text content as EntityReference nodes, including standard XML entities
(lt, gt, amp, apos, quot) and numeric character references. This caused
&lt; to become < (i.e. <) after round-trip.

Three fixes in add_text_with_entities:

  • Exclude standard XML entities from EntityReference creation; treat as
    text nodes so the serializer handles proper escaping
  • Use #match instead of #match? for the entity name check (match? does
    not populate in Ruby, silently breaking the exclusion check)
  • Tighten entity name regex to require letter-initial names per the XML
    spec, preventing invalid entity references like &1; from shell syntax

Non-standard entities (copy, nbsp, mdash, etc.) continue to be preserved
as EntityReference nodes as before.

Fixes: rfc8792, rfc8846, rfc9052, rfc9095, rfc9108, rfc9338, rfc9683,
rfc9700, rfc9788, rfc9953 round-trip failures

fix: Oga adapter CDATA serialization in plan-based path

The build_moxml_node method always created text nodes, ignoring the
cdata flag on XmlElement. This caused 5 pre-existing OgaAdapter test
failures in cdata_spec.rb on GHA.

fix: Oga adapter CDATA in mixed content and xmlns deduplication

  • Check xml_element.cdata flag when creating String child nodes in
    build_moxml_node, creating CDATA sections instead of text nodes
    when cdata is true (matching Nokogiri adapter behavior)
  • Filter xmlns attributes in regular attribute iteration to prevent
    duplicate namespace declarations, matching Nokogiri's logic
  • Fixes pre-existing CI failures with Canon 0.2.x

When XML contains double-encoded entities (e.g. &amp;lt; to represent
literal text &lt;), the round-trip through model serialization flattened
them by one level, producing semantically different or invalid XML.

Root cause: add_text_with_entities treated ALL entity-like patterns in
text content as EntityReference nodes, including standard XML entities
(lt, gt, amp, apos, quot) and numeric character references. This caused
&amp;lt; to become &lt; (i.e. <) after round-trip.

Three fixes in add_text_with_entities:
- Exclude standard XML entities from EntityReference creation; treat as
  text nodes so the serializer handles proper escaping
- Use #match instead of #match? for the entity name check (match? does
  not populate  in Ruby, silently breaking the exclusion check)
- Tighten entity name regex to require letter-initial names per the XML
  spec, preventing invalid entity references like &1; from shell syntax

Non-standard entities (copy, nbsp, mdash, etc.) continue to be preserved
as EntityReference nodes as before.

Fixes: rfc8792, rfc8846, rfc9052, rfc9095, rfc9108, rfc9338, rfc9683,
rfc9700, rfc9788, rfc9953 round-trip failures

fix: Oga adapter CDATA serialization in plan-based path

The build_moxml_node method always created text nodes, ignoring the
cdata flag on XmlElement. This caused 5 pre-existing OgaAdapter test
failures in cdata_spec.rb on GHA.

fix: Oga adapter CDATA in mixed content and xmlns deduplication

- Check xml_element.cdata flag when creating String child nodes in
  build_moxml_node, creating CDATA sections instead of text nodes
  when cdata is true (matching Nokogiri adapter behavior)
- Filter xmlns attributes in regular attribute iteration to prevent
  duplicate namespace declarations, matching Nokogiri's logic
- Fixes pre-existing CI failures with Canon 0.2.x
@ronaldtse ronaldtse force-pushed the fix/double-encoded-entity-references branch from 189b3bc to 945dc34 Compare April 29, 2026 12:14
@ronaldtse ronaldtse merged commit 538fce5 into main Apr 29, 2026
75 of 88 checks passed
@ronaldtse ronaldtse deleted the fix/double-encoded-entity-references branch April 29, 2026 15:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant