docs: add raw markdown files for LLM crawling#207
docs: add raw markdown files for LLM crawling#207jessiemongeon1 wants to merge 1 commit intomainfrom
Conversation
| cleaned = cleaned.replace(/<(\w+)[^>]*>(.*?)<\/\1>/gs, '$2'); | ||
|
|
||
| // Remove self-closing JSX tags | ||
| cleaned = cleaned.replace(/<\w+[^>]*\/>/g, ''); |
Check failure
Code scanning / CodeQL
Incomplete multi-character sanitization High
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI about 2 months ago
General approach: avoid relying on a single-pass, multi-character regex to fully sanitize/remove JSX/HTML components, especially when the intention is to prevent any HTML/JSX tags from surviving. Either (1) remove all angle-bracketed tags more robustly, or (2) repeatedly apply the existing tag-removal regexes until no changes occur, so that partially exposed fragments cannot re-form unsafe tags.
Best fix with minimal functional change here: keep the existing semantics (remove imports, convert <Card>, strip <Cards> wrappers, unwrap general JSX tags but keep their contents, remove self-closing tags, compress newlines) but ensure that tag removal is applied until the string stabilizes. This addresses the “multi-character sanitization” issue, because any time the first round of replacements creates new matches (e.g., a sequence evolving into <script), subsequent rounds will remove those too. We can implement this by turning the main body of cleanMdxComponents into a loop that re-applies the replacement steps until the text stops changing.
Concretely, in site/src/scripts/copy-markdown-files.js, modify cleanMdxComponents:
- Keep its signature and call sites unchanged.
- Inside, replace the current straight-line chain of
cleaned = cleaned.replace(...)calls with:- A
do { ... } while (cleaned !== previous);loop where each iteration:- Stores the current value in
previous. - Applies all existing replacement steps (imports,
<Card>,<Cards>, generic JSX wrapper, self-closing tags, newline compression) in the same order.
- Stores the current value in
- A
- Keep the final
return cleaned.trim();outside the loop.
No new imports or extra helper methods are required.
| @@ -18,26 +18,31 @@ | ||
| */ | ||
| function cleanMdxComponents(content) { | ||
| let cleaned = content; | ||
| let previous; | ||
|
|
||
| // Remove import statements | ||
| cleaned = cleaned.replace(/^import\s+.*?from\s+['"].*?['"];?\s*$/gm, ''); | ||
| do { | ||
| previous = cleaned; | ||
|
|
||
| // Convert Card components to markdown links | ||
| cleaned = cleaned.replace(/<Card[^>]*title="([^"]*)"[^>]*href="([^"]*)"[^>]*\/>/g, '- [$1]($2)'); | ||
| // Remove import statements | ||
| cleaned = cleaned.replace(/^import\s+.*?from\s+['"].*?['"];?\s*$/gm, ''); | ||
|
|
||
| // Remove Cards wrapper | ||
| cleaned = cleaned.replace(/<Cards[^>]*>/g, ''); | ||
| cleaned = cleaned.replace(/<\/Cards>/g, ''); | ||
| // Convert Card components to markdown links | ||
| cleaned = cleaned.replace(/<Card[^>]*title="([^"]*)"[^>]*href="([^"]*)"[^>]*\/>/g, '- [$1]($2)'); | ||
|
|
||
| // Remove other common JSX components but keep their content | ||
| cleaned = cleaned.replace(/<(\w+)[^>]*>(.*?)<\/\1>/gs, '$2'); | ||
| // Remove Cards wrapper | ||
| cleaned = cleaned.replace(/<Cards[^>]*>/g, ''); | ||
| cleaned = cleaned.replace(/<\/Cards>/g, ''); | ||
|
|
||
| // Remove self-closing JSX tags | ||
| cleaned = cleaned.replace(/<\w+[^>]*\/>/g, ''); | ||
| // Remove other common JSX components but keep their content | ||
| cleaned = cleaned.replace(/<(\w+)[^>]*>(.*?)<\/\1>/gs, '$2'); | ||
|
|
||
| // Clean up excessive newlines | ||
| cleaned = cleaned.replace(/\n{3,}/g, '\n\n'); | ||
| // Remove self-closing JSX tags | ||
| cleaned = cleaned.replace(/<\w+[^>]*\/>/g, ''); | ||
|
|
||
| // Clean up excessive newlines | ||
| cleaned = cleaned.replace(/\n{3,}/g, '\n\n'); | ||
| } while (cleaned !== previous); | ||
|
|
||
| return cleaned.trim(); | ||
| } | ||
|
|
| "docusaurus": "docusaurus", | ||
| "start": "docusaurus start", | ||
| "build": "docusaurus build", | ||
| "build": "node src/scripts/copy-markdown-files.js; docusaurus build", |
There was a problem hiding this comment.
I think we should do something like build-prod to include these two and not break existing build flows during regular work / updates / rebuilds.
There was a problem hiding this comment.
Sure, we can work this however you'd like for this repo since you own it. I simply applied what we're using for all other docs sites at this time.
| * Removes or simplifies MDX/JSX components for cleaner markdown | ||
| */ | ||
| function cleanMdxComponents(content) { | ||
| let cleaned = content; |
There was a problem hiding this comment.
Isn't there a library like strip html that can do this? I'm worried that we'll eventually hit a condition that's not covered by these rules. And there's no reliable way to test.
There was a problem hiding this comment.
I'm not sure, our DevRel team wrote this script so I trusted their judgement on this.
| }); | ||
| } | ||
|
|
||
| console.log('📝 Starting markdown export...'); |
There was a problem hiding this comment.
Nit: we don't use emojis anywhere else.
There was a problem hiding this comment.
See prev. comment regarding freedom to rework this however you'd like given you own this repo.
damirka
left a comment
There was a problem hiding this comment.
One thing that worries me: is it possible that a page marked as draft: true would be copied and available for fetching? That's undesirable
|
Hey @jessiemongeon1, closing this one in favour of #209! |
Generates a /markdown folder that contains markdown files served in the browser as raw markdown for LLM crawling. Since the regular docs files also use .md format, these files needed to be in a separate folder, and since they are for crawling purposes only, the different URL prefix should be okay.