Skip to content

Fix MarkdownV2 escape handling and align tags with spec#829

Open
zeynalnia wants to merge 1 commit intogram-js:masterfrom
zeynalnia:fix/markdownv2-escape-handling
Open

Fix MarkdownV2 escape handling and align tags with spec#829
zeynalnia wants to merge 1 commit intogram-js:masterfrom
zeynalnia:fix/markdownv2-escape-handling

Conversation

@zeynalnia
Copy link
Copy Markdown

@zeynalnia zeynalnia commented Apr 19, 2026

Closes #830.

Summary

The MarkdownV2Parser.parse regex chain departed from the
Telegram MarkdownV2 spec
in several places — most importantly, backslash escapes were ignored entirely,
so input like 1\.5 produced literal 1\.5 and \*not bold\* was still parsed
as bold. The italic delimiter was - (non-spec) instead of _, blockquote
syntax was unsupported, and HTML special characters in plain text confused the
downstream HTML parser.

This PR rewrites the markdown→HTML transform inside the existing
markdown → HTML → HTMLParser pipeline, splits it into two reusable exported
functions, and fills the missing tag/attribute coverage on the HTML side.

Changes

gramjs/extensions/markdownv2.ts

Rewritten as a 6-stage pipeline:

  1. Extract protected regions (pre / code / link / custom-emoji) into
    \u0000{n}\u0000 placeholders. Inside these, only the spec's local escape
    rules apply (\\ and \` for pre/code; \\ and \) for link URL).
  2. Mask remaining \X escapes with \u0001{n}\u0001 placeholders so the
    markup regexes in stage 3 can't consume escaped delimiters.
  3. HTML-escape & and < in user content (not >, since blockquote
    detection still needs it; not ", harmless in text).
  4. Run span-markup regexes — underline __ resolved before italic _ per
    spec greediness; switched italic to spec-mandated _.
  5. Line-level blockquote pass; final line ending in || marks the quote as
    expandable (<blockquote expandable>MessageEntityBlockquote.collapsed = true).
  6. Unmask escapes (re-escaping < and & as entities so HTML stays valid),
    then restore protected regions.
  7. Hand off to HTMLParser.parse.

Two new public exports replace the all-in-one class:

  • markdownV2ToHtml(message: string): string — markdown → Telegram HTML.
  • htmlToMarkdownV2(html: string): string — inverse, accepting every
    spec-listed tag form (<strong> / <em> / <ins> / <strike> /
    <del> / <tg-spoiler> / <span class="tg-spoiler">) and decoding the
    four named entities (&amp; &lt; &gt; &quot;).

MarkdownV2Parser.parse / .unparse are now thin wrappers over the two.

Also: input is sanitized for raw \u0000/\u0001 (used internally as
placeholder delimiters); the pre-language identifier regex was widened to
accept c++, c#, etc.; the custom-emoji URL match accepts extra query
parameters.

gramjs/extensions/html.ts

  • onopentag now recognizes the spec-alternative tags missing from
    HTMLParser.parse: <tg-spoiler>, <span class="tg-spoiler">, <ins>
    (underline), <strike> (strikethrough).
  • unparse now emits <tg-spoiler> instead of the library-internal
    <spoiler> so its output is valid Telegram HTML, and emits
    <blockquote expandable> when collapsed === true so the flag survives
    round-trips.

Tests

__tests__/extensions/MarkdownV2.spec.ts rewritten — 103 new specs across
14 describe blocks
, plus the original Markdown and HTML suites still pass:

  • Span basics + nesting + multi-span + spanning newlines
  • Inline code + escapes + literal markup chars + HTML chars
  • Pre + language detection + multi-line + escapes + non-identifier first line
  • Inline link + URL escapes + label markup + mention + malformed
  • Custom emoji + non-emoji bang-link fallback + label markup
  • Blockquote single / multi / mid-line literal > / escaped > / span markup
    inside / two separate groups
  • Expandable blockquote single / multi / interaction with spoiler
  • Backslash escapes for every spec-listed special char + double backslash +
    trailing lone \ + non-special chars + protected-region escape semantics +
    literal < / & round-trip
  • HTML chars in plain text rendered as text (not interpreted as HTML)
  • Edge cases: empty input, lone delimiters, malformed code/pre/link, literal
    control chars in input
  • Direct tests of markdownV2ToHtml and htmlToMarkdownV2
  • Round-trip tests: parse → unparse → parse preserves entity types and
    offsets for every supported entity, including the collapsed flag on
    expandable blockquotes

Known limitation

htmlToMarkdownV2 does not yet emit MarkdownV2 backslash-escapes for special
characters in surrounding plain text (only inside protected regions). So
round-tripping HTML whose plain-text portions contain literal *_~|... is
not guaranteed. This is documented in the function's JSDoc and is a logical
follow-up.

Test plan

  • npx jest — all 136 tests pass (4 pre-existing skips), no regressions
    in the HTML / Markdown / crypto suites.
  • Manual verification against a live Telegram chat with bold + italic +
    backslash-escaped chars + blockquote + expandable blockquote + custom
    emoji.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

MarkdownV2Parser.parse ignores backslash escapes and deviates from the spec in several places

1 participant