docs: add raw markdown files for LLM crawling by jessiemongeon1 · Pull Request #207 · MystenLabs/move-book

jessiemongeon1 · 2026-02-10T23:08:59Z

Generates a /markdown folder that contains markdown files served in the browser as raw markdown for LLM crawling. Since the regular docs files also use .md format, these files needed to be in a separate folder, and since they are for crawling purposes only, the different URL prefix should be okay.

site/src/scripts/copy-markdown-files.js

+  cleaned = cleaned.replace(/<(\w+)[^>]*>(.*?)<\/\1>/gs, '$2');
+
+  // Remove self-closing JSX tags
+  cleaned = cleaned.replace(/<\w+[^>]*\/>/g, '');


General approach: avoid relying on a single-pass, multi-character regex to fully sanitize/remove JSX/HTML components, especially when the intention is to prevent any HTML/JSX tags from surviving. Either (1) remove all angle-bracketed tags more robustly, or (2) repeatedly apply the existing tag-removal regexes until no changes occur, so that partially exposed fragments cannot re-form unsafe tags.

Best fix with minimal functional change here: keep the existing semantics (remove imports, convert <Card>, strip <Cards> wrappers, unwrap general JSX tags but keep their contents, remove self-closing tags, compress newlines) but ensure that tag removal is applied until the string stabilizes. This addresses the “multi-character sanitization” issue, because any time the first round of replacements creates new matches (e.g., a sequence evolving into <script), subsequent rounds will remove those too. We can implement this by turning the main body of cleanMdxComponents into a loop that re-applies the replacement steps until the text stops changing.

Concretely, in site/src/scripts/copy-markdown-files.js, modify cleanMdxComponents:

Keep its signature and call sites unchanged.

Inside, replace the current straight-line chain of cleaned = cleaned.replace(...) calls with:

A do { ... } while (cleaned !== previous); loop where each iteration:

Stores the current value in previous.

Applies all existing replacement steps (imports, <Card>, <Cards>, generic JSX wrapper, self-closing tags, newline compression) in the same order.

Keep the final return cleaned.trim(); outside the loop.

No new imports or extra helper methods are required.

damirka · 2026-02-11T08:35:14Z

site/package.json

    "docusaurus": "docusaurus",
    "start": "docusaurus start",
-    "build": "docusaurus build",
+    "build": "node src/scripts/copy-markdown-files.js; docusaurus build",


I think we should do something like build-prod to include these two and not break existing build flows during regular work / updates / rebuilds.

Sure, we can work this however you'd like for this repo since you own it. I simply applied what we're using for all other docs sites at this time.

damirka · 2026-02-11T08:37:13Z

site/src/scripts/copy-markdown-files.js

+ * Removes or simplifies MDX/JSX components for cleaner markdown
+ */
+function cleanMdxComponents(content) {
+  let cleaned = content;


Isn't there a library like strip html that can do this? I'm worried that we'll eventually hit a condition that's not covered by these rules. And there's no reliable way to test.

I'm not sure, our DevRel team wrote this script so I trusted their judgement on this.

damirka · 2026-02-11T08:37:48Z

site/src/scripts/copy-markdown-files.js

+  });
+}
+
+console.log('📝 Starting markdown export...');


Nit: we don't use emojis anywhere else.

See prev. comment regarding freedom to rework this however you'd like given you own this repo.

damirka

One thing that worries me: is it possible that a page marked as draft: true would be copied and available for fetching? That's undesirable

damirka · 2026-03-12T12:08:53Z

Hey @jessiemongeon1, closing this one in favour of #209!

add raw markdown files for llm crawling

9f03aa5

jessiemongeon1 requested a review from damirka February 10, 2026 23:09

github-advanced-security bot found potential problems Feb 10, 2026

View reviewed changes

damirka reviewed Feb 11, 2026

View reviewed changes

damirka closed this Mar 12, 2026

@@ -18,26 +18,31 @@
              */
             function cleanMdxComponents(content) {
               let cleaned = content;
+              let previous;
-              // Remove import statements
-              cleaned = cleaned.replace(/^import\s+.*?from\s+['"].*?['"];?\s*$/gm, '');
+              do {
+                previous = cleaned;
-              // Convert Card components to markdown links
-              cleaned = cleaned.replace(/<Card[^>]*title="([^"]*)"[^>]*href="([^"]*)"[^>]*\/>/g, '- [$1]($2)');
+                // Remove import statements
+                cleaned = cleaned.replace(/^import\s+.*?from\s+['"].*?['"];?\s*$/gm, '');
-              // Remove Cards wrapper
-              cleaned = cleaned.replace(/<Cards[^>]*>/g, '');
-              cleaned = cleaned.replace(/<\/Cards>/g, '');
+                // Convert Card components to markdown links
+                cleaned = cleaned.replace(/<Card[^>]*title="([^"]*)"[^>]*href="([^"]*)"[^>]*\/>/g, '- [$1]($2)');
-              // Remove other common JSX components but keep their content
-              cleaned = cleaned.replace(/<(\w+)[^>]*>(.*?)<\/\1>/gs, '$2');
+                // Remove Cards wrapper
+                cleaned = cleaned.replace(/<Cards[^>]*>/g, '');
+                cleaned = cleaned.replace(/<\/Cards>/g, '');
-              // Remove self-closing JSX tags
-              cleaned = cleaned.replace(/<\w+[^>]*\/>/g, '');
+                // Remove other common JSX components but keep their content
+                cleaned = cleaned.replace(/<(\w+)[^>]*>(.*?)<\/\1>/gs, '$2');
-              // Clean up excessive newlines
-              cleaned = cleaned.replace(/\n{3,}/g, '\n\n');
+                // Remove self-closing JSX tags
+                cleaned = cleaned.replace(/<\w+[^>]*\/>/g, '');
+                // Clean up excessive newlines
+                cleaned = cleaned.replace(/\n{3,}/g, '\n\n');
+              } while (cleaned !== previous);
               return cleaned.trim();
             }

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: add raw markdown files for LLM crawling#207

docs: add raw markdown files for LLM crawling#207
jessiemongeon1 wants to merge 1 commit intomainfrom
raw-markdown

jessiemongeon1 commented Feb 10, 2026

Uh oh!

Check failure

Copilot Autofix

damirka Feb 11, 2026

Uh oh!

jessiemongeon1 Feb 11, 2026

Uh oh!

damirka Feb 11, 2026

Uh oh!

jessiemongeon1 Feb 11, 2026

Uh oh!

damirka Feb 11, 2026

Uh oh!

jessiemongeon1 Feb 11, 2026

Uh oh!

damirka left a comment

Uh oh!

damirka commented Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jessiemongeon1 commented Feb 10, 2026

Uh oh!

Check failure

Uh oh!

Copilot Autofix

damirka Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

jessiemongeon1 Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

damirka Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

jessiemongeon1 Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

damirka Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

jessiemongeon1 Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

damirka left a comment

Choose a reason for hiding this comment

Uh oh!

damirka commented Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants