Fix MarkdownV2 escape handling and align tags with spec#829
Open
zeynalnia wants to merge 1 commit intogram-js:masterfrom
Open
Fix MarkdownV2 escape handling and align tags with spec#829zeynalnia wants to merge 1 commit intogram-js:masterfrom
zeynalnia wants to merge 1 commit intogram-js:masterfrom
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #830.
Summary
The
MarkdownV2Parser.parseregex chain departed from theTelegram MarkdownV2 spec
in several places — most importantly, backslash escapes were ignored entirely,
so input like
1\.5produced literal1\.5and\*not bold\*was still parsedas bold. The italic delimiter was
-(non-spec) instead of_, blockquotesyntax was unsupported, and HTML special characters in plain text confused the
downstream HTML parser.
This PR rewrites the markdown→HTML transform inside the existing
markdown → HTML → HTMLParserpipeline, splits it into two reusable exportedfunctions, and fills the missing tag/attribute coverage on the HTML side.
Changes
gramjs/extensions/markdownv2.tsRewritten as a 6-stage pipeline:
\u0000{n}\u0000placeholders. Inside these, only the spec's local escaperules apply (
\\and\`for pre/code;\\and\)for link URL).\Xescapes with\u0001{n}\u0001placeholders so themarkup regexes in stage 3 can't consume escaped delimiters.
&and<in user content (not>, since blockquotedetection still needs it; not
", harmless in text).__resolved before italic_perspec greediness; switched italic to spec-mandated
_.||marks the quote asexpandable (
<blockquote expandable>→MessageEntityBlockquote.collapsed = true).<and&as entities so HTML stays valid),then restore protected regions.
HTMLParser.parse.Two new public exports replace the all-in-one class:
markdownV2ToHtml(message: string): string— markdown → Telegram HTML.htmlToMarkdownV2(html: string): string— inverse, accepting everyspec-listed tag form (
<strong>/<em>/<ins>/<strike>/<del>/<tg-spoiler>/<span class="tg-spoiler">) and decoding thefour named entities (
& < > ").MarkdownV2Parser.parse/.unparseare now thin wrappers over the two.Also: input is sanitized for raw
\u0000/\u0001(used internally asplaceholder delimiters); the pre-language identifier regex was widened to
accept
c++,c#, etc.; the custom-emoji URL match accepts extra queryparameters.
gramjs/extensions/html.tsonopentagnow recognizes the spec-alternative tags missing fromHTMLParser.parse:<tg-spoiler>,<span class="tg-spoiler">,<ins>(underline),
<strike>(strikethrough).unparsenow emits<tg-spoiler>instead of the library-internal<spoiler>so its output is valid Telegram HTML, and emits<blockquote expandable>whencollapsed === trueso the flag survivesround-trips.
Tests
__tests__/extensions/MarkdownV2.spec.tsrewritten — 103 new specs across14 describe blocks, plus the original Markdown and HTML suites still pass:
>/ escaped>/ span markupinside / two separate groups
trailing lone
\+ non-special chars + protected-region escape semantics +literal
</&round-tripcontrol chars in input
markdownV2ToHtmlandhtmlToMarkdownV2parse → unparse → parsepreserves entity types andoffsets for every supported entity, including the
collapsedflag onexpandable blockquotes
Known limitation
htmlToMarkdownV2does not yet emit MarkdownV2 backslash-escapes for specialcharacters in surrounding plain text (only inside protected regions). So
round-tripping HTML whose plain-text portions contain literal
*_~|...isnot guaranteed. This is documented in the function's JSDoc and is a logical
follow-up.
Test plan
npx jest— all 136 tests pass (4 pre-existing skips), no regressionsin the HTML / Markdown / crypto suites.
backslash-escaped chars + blockquote + expandable blockquote + custom
emoji.