Skip to content

docs: add raw markdown files for LLM crawling#207

Closed
jessiemongeon1 wants to merge 1 commit intomainfrom
raw-markdown
Closed

docs: add raw markdown files for LLM crawling#207
jessiemongeon1 wants to merge 1 commit intomainfrom
raw-markdown

Conversation

@jessiemongeon1
Copy link
Copy Markdown
Collaborator

Generates a /markdown folder that contains markdown files served in the browser as raw markdown for LLM crawling. Since the regular docs files also use .md format, these files needed to be in a separate folder, and since they are for crawling purposes only, the different URL prefix should be okay.

cleaned = cleaned.replace(/<(\w+)[^>]*>(.*?)<\/\1>/gs, '$2');

// Remove self-closing JSX tags
cleaned = cleaned.replace(/<\w+[^>]*\/>/g, '');

Check failure

Code scanning / CodeQL

Incomplete multi-character sanitization High

This string may still contain
<script
, which may cause an HTML element injection vulnerability.

Copilot Autofix

AI about 2 months ago

General approach: avoid relying on a single-pass, multi-character regex to fully sanitize/remove JSX/HTML components, especially when the intention is to prevent any HTML/JSX tags from surviving. Either (1) remove all angle-bracketed tags more robustly, or (2) repeatedly apply the existing tag-removal regexes until no changes occur, so that partially exposed fragments cannot re-form unsafe tags.

Best fix with minimal functional change here: keep the existing semantics (remove imports, convert <Card>, strip <Cards> wrappers, unwrap general JSX tags but keep their contents, remove self-closing tags, compress newlines) but ensure that tag removal is applied until the string stabilizes. This addresses the “multi-character sanitization” issue, because any time the first round of replacements creates new matches (e.g., a sequence evolving into <script), subsequent rounds will remove those too. We can implement this by turning the main body of cleanMdxComponents into a loop that re-applies the replacement steps until the text stops changing.

Concretely, in site/src/scripts/copy-markdown-files.js, modify cleanMdxComponents:

  • Keep its signature and call sites unchanged.
  • Inside, replace the current straight-line chain of cleaned = cleaned.replace(...) calls with:
    • A do { ... } while (cleaned !== previous); loop where each iteration:
      • Stores the current value in previous.
      • Applies all existing replacement steps (imports, <Card>, <Cards>, generic JSX wrapper, self-closing tags, newline compression) in the same order.
  • Keep the final return cleaned.trim(); outside the loop.

No new imports or extra helper methods are required.


Suggested changeset 1
site/src/scripts/copy-markdown-files.js

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/site/src/scripts/copy-markdown-files.js b/site/src/scripts/copy-markdown-files.js
--- a/site/src/scripts/copy-markdown-files.js
+++ b/site/src/scripts/copy-markdown-files.js
@@ -18,26 +18,31 @@
  */
 function cleanMdxComponents(content) {
   let cleaned = content;
+  let previous;
 
-  // Remove import statements
-  cleaned = cleaned.replace(/^import\s+.*?from\s+['"].*?['"];?\s*$/gm, '');
+  do {
+    previous = cleaned;
 
-  // Convert Card components to markdown links
-  cleaned = cleaned.replace(/<Card[^>]*title="([^"]*)"[^>]*href="([^"]*)"[^>]*\/>/g, '- [$1]($2)');
+    // Remove import statements
+    cleaned = cleaned.replace(/^import\s+.*?from\s+['"].*?['"];?\s*$/gm, '');
 
-  // Remove Cards wrapper
-  cleaned = cleaned.replace(/<Cards[^>]*>/g, '');
-  cleaned = cleaned.replace(/<\/Cards>/g, '');
+    // Convert Card components to markdown links
+    cleaned = cleaned.replace(/<Card[^>]*title="([^"]*)"[^>]*href="([^"]*)"[^>]*\/>/g, '- [$1]($2)');
 
-  // Remove other common JSX components but keep their content
-  cleaned = cleaned.replace(/<(\w+)[^>]*>(.*?)<\/\1>/gs, '$2');
+    // Remove Cards wrapper
+    cleaned = cleaned.replace(/<Cards[^>]*>/g, '');
+    cleaned = cleaned.replace(/<\/Cards>/g, '');
 
-  // Remove self-closing JSX tags
-  cleaned = cleaned.replace(/<\w+[^>]*\/>/g, '');
+    // Remove other common JSX components but keep their content
+    cleaned = cleaned.replace(/<(\w+)[^>]*>(.*?)<\/\1>/gs, '$2');
 
-  // Clean up excessive newlines
-  cleaned = cleaned.replace(/\n{3,}/g, '\n\n');
+    // Remove self-closing JSX tags
+    cleaned = cleaned.replace(/<\w+[^>]*\/>/g, '');
 
+    // Clean up excessive newlines
+    cleaned = cleaned.replace(/\n{3,}/g, '\n\n');
+  } while (cleaned !== previous);
+
   return cleaned.trim();
 }
 
EOF
@@ -18,26 +18,31 @@
*/
function cleanMdxComponents(content) {
let cleaned = content;
let previous;

// Remove import statements
cleaned = cleaned.replace(/^import\s+.*?from\s+['"].*?['"];?\s*$/gm, '');
do {
previous = cleaned;

// Convert Card components to markdown links
cleaned = cleaned.replace(/<Card[^>]*title="([^"]*)"[^>]*href="([^"]*)"[^>]*\/>/g, '- [$1]($2)');
// Remove import statements
cleaned = cleaned.replace(/^import\s+.*?from\s+['"].*?['"];?\s*$/gm, '');

// Remove Cards wrapper
cleaned = cleaned.replace(/<Cards[^>]*>/g, '');
cleaned = cleaned.replace(/<\/Cards>/g, '');
// Convert Card components to markdown links
cleaned = cleaned.replace(/<Card[^>]*title="([^"]*)"[^>]*href="([^"]*)"[^>]*\/>/g, '- [$1]($2)');

// Remove other common JSX components but keep their content
cleaned = cleaned.replace(/<(\w+)[^>]*>(.*?)<\/\1>/gs, '$2');
// Remove Cards wrapper
cleaned = cleaned.replace(/<Cards[^>]*>/g, '');
cleaned = cleaned.replace(/<\/Cards>/g, '');

// Remove self-closing JSX tags
cleaned = cleaned.replace(/<\w+[^>]*\/>/g, '');
// Remove other common JSX components but keep their content
cleaned = cleaned.replace(/<(\w+)[^>]*>(.*?)<\/\1>/gs, '$2');

// Clean up excessive newlines
cleaned = cleaned.replace(/\n{3,}/g, '\n\n');
// Remove self-closing JSX tags
cleaned = cleaned.replace(/<\w+[^>]*\/>/g, '');

// Clean up excessive newlines
cleaned = cleaned.replace(/\n{3,}/g, '\n\n');
} while (cleaned !== previous);

return cleaned.trim();
}

Copilot is powered by AI and may make mistakes. Always verify output.
"docusaurus": "docusaurus",
"start": "docusaurus start",
"build": "docusaurus build",
"build": "node src/scripts/copy-markdown-files.js; docusaurus build",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should do something like build-prod to include these two and not break existing build flows during regular work / updates / rebuilds.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, we can work this however you'd like for this repo since you own it. I simply applied what we're using for all other docs sites at this time.

* Removes or simplifies MDX/JSX components for cleaner markdown
*/
function cleanMdxComponents(content) {
let cleaned = content;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't there a library like strip html that can do this? I'm worried that we'll eventually hit a condition that's not covered by these rules. And there's no reliable way to test.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure, our DevRel team wrote this script so I trusted their judgement on this.

});
}

console.log('📝 Starting markdown export...');
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: we don't use emojis anywhere else.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See prev. comment regarding freedom to rework this however you'd like given you own this repo.

Copy link
Copy Markdown
Collaborator

@damirka damirka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing that worries me: is it possible that a page marked as draft: true would be copied and available for fetching? That's undesirable

@damirka
Copy link
Copy Markdown
Collaborator

damirka commented Mar 12, 2026

Hey @jessiemongeon1, closing this one in favour of #209!

@damirka damirka closed this Mar 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants