Skip to content

feat!: Replace regex parser with flexmark-java for CommonMark/GFM#69

Open
joewiz wants to merge 2 commits intoeXist-db:masterfrom
joewiz:feature/commonmark-gfm
Open

feat!: Replace regex parser with flexmark-java for CommonMark/GFM#69
joewiz wants to merge 2 commits intoeXist-db:masterfrom
joewiz:feature/commonmark-gfm

Conversation

@joewiz
Copy link
Copy Markdown
Member

@joewiz joewiz commented Mar 18, 2026

Summary

  • Replace the pure-XQuery regex-based markdown parser with a Java extension module using flexmark-java 0.64.8 for full CommonMark and GitHub Flavored Markdown compliance
  • New API: md:parse() returns md:* XML elements (not HTML), md:to-html() renders to HTML (from string or parsed nodes), md:serialize() round-trips back to markdown
  • Configurable parser profiles and extensions via an options map: md:parse($md, map { "profile": "commonmark" })
  • Fenced code blocks preserve language labels in @language attribute — critical for the forthcoming eXist-db Sandbox app's interactive xquery code editors
  • GFM extensions: tables, strikethrough, task lists, autolinks
  • Build system switched from npm/Gulp to Maven with exist-apps-parent and kuberam-expath-plugin

Breaking changes

  • md:parse() now returns md:* XML elements instead of HTML — use md:to-html() for HTML output
  • Custom config maps (e.g., TEI) are replaced by transforming md:parse() output with XQuery typeswitch or XSLT (see README for a complete TEI example)
  • Build requires Java 21 and Maven (previously Node.js and Gulp)
  • Minimum eXist-db version: 6.2.0

New API

import module namespace md = "http://exist-db.org/xquery/markdown";

(: Parse markdown to XML :)
md:parse("# Hello **world**")
<md:document xmlns:md="http://exist-db.org/xquery/markdown">
  <md:heading level="1">Hello <md:strong>world</md:strong></md:heading>
</md:document>
(: Render to HTML — pass a string or parsed nodes :)
md:to-html("## Hello")              (: → <h2>Hello</h2> :)
md:to-html($doc//md:paragraph)      (: → <p>...</p> :)

(: Round-trip back to markdown :)
md:serialize(md:parse("# Hello"))   (: → "# Hello" :)

(: Parser options — choose profile and extensions :)
md:parse($md, map { "profile": "commonmark", "extensions": () })
md:parse($md, map { "extensions": ("tables", "autolink") })

Supported profiles: commonmark, github (default), kramdown, markdown, pegdown, fixed-indent, multi-markdown.

Migration from 2.x

2.x 3.0
markdown:parse($md) → HTML md:to-html($md) → HTML
markdown:parse($md, $config) → custom output md:parse($md) → XML, then transform with XQuery/XSLT
Pure XQuery (regex-based) Java (flexmark-java)
npm/Gulp build Maven build

For most users, replacing markdown:parse(...) with md:to-html(...) is sufficient. For custom output formats (e.g., TEI), see the README for a complete typeswitch example.

Issues fixed

#4 — Problems parsing XQuery code blocks

Curly braces and other XQuery syntax in fenced code blocks were mangled by the regex parser's label handler. flexmark treats code block content as opaque text:

md:parse('```xquery
map { "k1": array { "v1", "v2" }, "k2": "v3" }
```')
<md:fenced-code language="xquery">map { "k1": array { "v1", "v2" }, "k2": "v3" }</md:fenced-code>

#18 — Curly braces in fenced code blocks are mangled

Same root cause as #4. The regex parser's {label: value} syntax handler was applied inside code blocks, replacing {$i * 2} with <span itemprop="$i * 2">. Now fixed:

md:parse('```xquery
for $i in 1 to 10
return
    <li>{$i * 2}</li>
```')
<md:fenced-code language="xquery">for $i in 1 to 10
return
    &lt;li&gt;{$i * 2}&lt;/li&gt;</md:fenced-code>

#6 — First paragraph missing

The regex parser dropped the first paragraph when input contained only a single block element. Now fixed:

md:parse("xx")
<md:document><md:paragraph>xx</md:paragraph></md:document>
count(md:parse("* a
* b
* c")//md:list-item)

3 (was 0 with the old parser)

#17 — Parsing of mark element in inline HTML

Partially addressed. The <mark> element is no longer dropped from output. In md:parse(), inline HTML is captured in md:html-inline elements that preserve the original tags. However, md:to-html() escapes inline HTML per XML serialization rules rather than passing it through as raw HTML — a known limitation of XML-based output.

#19 — Markdown interleaved in HTML blocks is mangled

Now follows CommonMark spec behavior. As noted in the issue, CommonMark itself does not support markdown interleaved inside HTML blocks — the spec treats HTML blocks as opaque. The old parser attempted this but produced mangled output (extra <body/> elements, ejected paragraphs). The new parser produces correct CommonMark-compliant output.

#14 (PR) — Code blocks properly identified

Superseded. The regex fix is no longer relevant since the entire parser is replaced.

#22, #24, #30 (PRs) — Dependabot Gulp bumps

Superseded. Gulp is no longer used in the build system.

Test plan

  • 58 XQSuite tests pass against eXist-db 7.0.0-SNAPSHOT (joewiz/exist, next branch)
  • Installation smoke tests confirm all functions are available
  • Structural round-trip tests confirm md:parse(md:serialize(md:parse($input))) preserves structure
  • Parser options tests confirm profile/extension selection works
  • Test installation on eXist-db 6.2.0
  • Verify eXist-db Sandbox integration with fenced code block language labels

🤖 Generated with Claude Code

@joewiz joewiz force-pushed the feature/commonmark-gfm branch from 14731fa to 7fa7c41 Compare March 18, 2026 21:26
Comment thread .github/workflows/exist.yml Outdated
uses: actions/setup-node@v6
with:
node-version: ${{ matrix.node-version }}
node-version: 22
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use LTS/*

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.

Comment thread .github/workflows/exist.yml Outdated

# Run XQSuite tests
- name: Run XQSuite tests
run: npx mocha test/xqs/*.js --exit
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0 tests are executed on ci.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. I'll work on this.

Comment thread pom.xml Outdated

<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<project.build.source>8</project.build.source>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I d be ok with dropping J8 compact and make this J21 exist 7 only

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eXist 7+ it is.

Comment thread pom.xml Outdated
<project.build.source>8</project.build.source>
<project.build.target>8</project.build.target>

<exist.java-api.version>6.4.1</exist.java-api.version>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a ghost setting, inherited from monex let's not repeat it.

Can we have one setting for min processor version and compatibility.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, I modeled this on monex, since that came to mind first, but I will switch my modeling to semver.xq, which came to mind second. I don't know what project represents best practice. If you have a suggestion, I'll take it.

Comment thread pom.xml Outdated

<build>
<plugins>
<!-- Create uber-jar with flexmark dependencies shaded in -->
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I m not a fan of uber jars.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'll de-uber.

/**
* Build a Parser configured from an XQuery options map.
*
* Supported options:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These need to be in the readme. And in Xqdoc function documentation

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The README has it in https://github.com/joewiz/exist-markdown/blob/ca1e9b0c56b71d7d84730097fd46f18b9bd25dcb/README.md#parser-options. For embedding xqdoc in Java modules, could you point me to a good example of this?

}

static void buildXml(final Node node, final MemTreeBuilder builder) {
if (node instanceof Heading) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switch ?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can do.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

S.a. About v6 compat

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will take out.

Comment thread test/xqs/test-suite.xqm
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would using stronger assertions. AssertTrue makes debugging hard in case of failures, as it hides the actual output.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will work on this.

@joewiz
Copy link
Copy Markdown
Member Author

joewiz commented Mar 19, 2026

@duncdrum I've addressed all of your suggestions in b5a9b8f. Please let me know if anything is lacking or if you spot anything else.

@duncdrum
Copy link
Copy Markdown
Contributor

@joewiz looking good. The xqsuite execution on ci still looks odd 0 passing or is this a display glitch on my phone ?

@joewiz
Copy link
Copy Markdown
Member Author

joewiz commented Mar 19, 2026

@duncdrum No, you're right. Looking into it...

@joewiz joewiz force-pushed the feature/commonmark-gfm branch 4 times, most recently from bf61655 to dbdd858 Compare March 20, 2026 02:33
joewiz and others added 2 commits March 20, 2026 17:56
Replace the pure-XQuery regex-based markdown parser with a Java
extension module using flexmark-java 0.64.8 for full CommonMark and
GitHub Flavored Markdown compliance.

BREAKING CHANGES:
- md:parse() now returns md:* XML elements instead of HTML
- Custom config maps replaced by XQuery typeswitch on md:parse() output
- Requires eXist-db 7.0.0+

New API:
- md:parse($markdown) → document-node() with md:* elements
- md:parse($markdown, $options) → with parser profile/extension options
- md:to-html($input) → HTML nodes (accepts string or md:* nodes)
- md:to-html($input, $options) → with parser options
- md:serialize($nodes) → markdown string (round-trip)

Parser options map supports:
- "profile": commonmark, github (default), kramdown, markdown, etc.
- "extensions": tables, strikethrough, tasklist, autolink (all default)
- "hard-wraps": boolean

Type constants resolved via reflection (TypeCompat) to ensure binary
compatibility across different eXist-db 7.0.0-SNAPSHOT builds.

Fixes eXist-db#4 (XQuery code blocks mangled), eXist-db#6 (first paragraph missing),
eXist-db#18 (curly braces mangled in code blocks). Partially addresses eXist-db#17
(inline HTML mark element). Follows CommonMark spec for eXist-db#19 (HTML
blocks). Supersedes PR eXist-db#14.

58 XQSuite tests with %test:assertXPath assertions covering
installation, parsing, HTML rendering, serialization, structural
round-trips, and parser options.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add pom.xml using exist-apps-parent and kuberam-expath-plugin for
  EXPath/XAR packaging with Java module registration
- Add xar-assembly.xml with individual flexmark dependency sets and
  Java module component registration
- Target eXist-db 7.0.0-SNAPSHOT / Java 21
- CI workflow: Maven build, docker pull for fresh image, readiness
  check before XQSuite tests, run xqSuite.js via node (not mocha)
- Preserve package name URI (http://exist-db.org/apps/markdown) and
  abbrev ("markdown") for upgrade continuity with 2.x releases
- Update package.json metadata to match v3.0.0
- Use repo.exist-db.org repositories for releases and snapshots
- Supersedes Dependabot PRs eXist-db#22, eXist-db#24, eXist-db#30 (Gulp no longer used)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@joewiz joewiz force-pushed the feature/commonmark-gfm branch from dbdd858 to 6ddf940 Compare March 20, 2026 21:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants