fix(markdown): preserve HTML comments as text during markdown parsing#7722
fix(markdown): preserve HTML comments as text during markdown parsing#7722weilinzung wants to merge 2 commits intoueberdosis:mainfrom
Conversation
🦋 Changeset detectedLatest commit: da5a9be The changes in this PR will be included in the next version bump. This PR includes changesets to release 72 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
✅ Deploy Preview for tiptap-embed ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
There was a problem hiding this comment.
Pull request overview
Fixes @tiptap/markdown parsing so HTML comments (<!-- ... -->) are preserved (as text) instead of being dropped by the browser DOM parsing step, improving markdown round-tripping and preventing data loss for comment-based metadata.
Changes:
- Intercepts HTML comment tokens in
MarkdownManager.parseHTMLTokenand converts them into text (paragraph-wrapped for block tokens). - Adds unit tests covering block, inline, multiline, and whitespace-preserving comment scenarios.
- Updates the Markdown parse demo content and adds a changeset for a patch release.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| packages/markdown/src/MarkdownManager.ts | Adds HTML-comment detection and a text-node fallback to preserve comment content during parse. |
| packages/markdown/tests/mixed-html.spec.ts | Adds regression tests for comment preservation in different positions/forms. |
| demos/src/Markdown/Parse/React/index.jsx | Extends the demo markdown sample with HTML and “Hidden HTML Comments” examples. |
| .changeset/calm-cycles-hear.md | Declares a patch release for preserving HTML comments during markdown parsing. |
|
@bdbch, I would appreciate it if you could review this. This is about our AI usage: we are feeding it MD content with hidden comments to replace. |
| return { | ||
| type: 'paragraph', | ||
| content: [ | ||
| { | ||
| type: 'text', | ||
| text: html, | ||
| }, | ||
| ], | ||
| } | ||
| } |
There was a problem hiding this comment.
Not a fan of this being a hardcoded paragraph - it's rare but people can overwrite the default paragraph type name which would cause issues here. I think it's safer to assume the default content node from the schema.
Also - I am not sure if this should be a default. By default I assume users would NOT expect comments to appear in their parsed content. Either we do this as a global option on the markdown manager (which can be controlled as an editor option) OR we add an option to the parse functions.
There was a problem hiding this comment.
That’s a fair point on the hardcoding—relying on the schema's default content node is definitely safer and cleaner.
Regarding the default behavior: the reason I was leaning toward a schema-based approach is that the DOM parser is currently dropping HTML comments entirely. By the time I run editor.markdown.parse(s), the data is already gone.
How heavy of a lift do you think it would be to introduce a default schema for this? If that feels too intrusive, I’m open to the global option on the markdown manager, provided we can ensure the parser doesn't strip the comments before they hit the manager. I feel like markedjs with sanitize-html could handle this already, but TipTap is using the custom parseHTMLToken.
There was a problem hiding this comment.
But isn't the problem here that your comments are lost when parsing into your editor via Markdown? Aren't the comments part of your markdown when parsing in (where you still would have them and would be able to keep them via an option?)
How heavy of a lift do you think it would be to introduce a default schema for this?
I think those comments by default should not be part of the content at all as most people expect comments to just be comments, not content. I wonder if you could add a custom parser for Markdown that catches HTML Comments before the default lexer picks them up as HTML content - that way you could turn them into a "Comment" node" or something on your end without relying on the editor core supporting it.
There was a problem hiding this comment.
Actually, the difficulty is that when we're in the clipboardTextParser hook (as seen in my PasteMarkdown extension below), we are dealing with a raw string. When we call editor.markdown.parse(s), the underlying lexer treats HTML comments as html_block or html_inline tokens.
export const PasteMarkdown = Extension.create({
name: 'pasteMarkdown',
addProseMirrorPlugins() {
const editor = this.editor;
return [
new Plugin({
props: {
clipboardTextParser(text, _, __, view): Slice {
const s = String(text ?? '');
const { schema } = view.state;
// Markdown → ProseMirror
if (!!s.trim() && editor.markdown) {
try {
// Without a schema node or a custom parser rule,
// HTML comments are lost here during the parse conversion.
const json = editor.markdown.parse(s);
const node = view.state.schema.nodeFromJSON(json);
return new Slice(node.content, 0, 0);
} catch (e) {
// Fall through to default paste if parsing fails
}
}
return new Slice(Fragment.from(schema.text(s)), 0, 0);
}
}
})
];
}
});By default, the parser handles these by converting them to DOM nodes and then into ProseMirror nodes. Since standard schemas don't have a spec for HTML comments, they are simply dropped during this conversion. To preserve them, we would need a node in the schema to "catch" them before they vanish.
I definitely hear you on comments not being "content" by default. Regarding the custom parser, perhaps we could introduce a configuration option in the Markdown Manager to define how specific tokens are handled.
For example, we could provide a parseHooks or tokenListeners option:
// Example of a potential configuration option
editor.configure({
extensions: [
Markdown.configure({
// An option like this would allow us to intercept the token
// and map it to a custom node without modifying core.
parseOptions: {
html_block: (state, token) => {
if (isComment(token.content)) {
state.addNode('comment', { value: token.content });
}
}
}
})
]
})How do you feel about providing a way to register a custom comment node in the schema that the markdown manager can optionally target? This keeps the "content" aspect opt-in while preventing the data loss I'm seeing in the paste hook. or this option is already able to handle it?:
Changes Overview
HTML comments (
<!-- ... -->) passed througheditor.markdown.parse()were silently dropped because the browser DOM parser strips comment nodes beforegenerateJSONprocesses them. This fix preserves them as plain text so comment content is not lost.It should be similar to match markedJs:
Implementation Approach
Added a regex check in
parseHTMLTokenthat intercepts comment tokens before they reach the DOM parser, returning them as plain text nodes instead:Block comments are wrapped in a paragraph; inline comments are returned as bare text nodes.
Testing Done
Added 4 unit tests in
mixed-html.spec.ts:Verification Steps
pnpm dev → open http://localhost:3000/Markdown/Parse/React
Click Parse Markdown — the Hidden HTML Comments section should render the comment text visibly in the editor
Run
pnpm -w -F @tiptap/markdown test mixed-htmlAdditional Notes
Before
After
Checklist
Related Issues
Fixes #7720