feat: add web URL content fetcher for chat context#2111
feat: add web URL content fetcher for chat context#2111Stijnus merged 1 commit intostackblitz-labs:mainfrom
Conversation
Add ability to fetch and inject web page content into chat as context. Includes SSRF protection (blocks private IPs, localhost), content extraction (strips scripts/styles/nav), and a clean popover UI. Reimplements the concept from PR stackblitz-labs#1703 without the issues (duplicated ChatBox, dual API routes, SSRF vulnerability, window.prompt UX). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR adds a web URL content fetcher feature that allows users to fetch content from web pages and inject it into the chat context. The implementation includes a globe button in the chat toolbar that opens a popover for URL input, server-side content fetching with HTML parsing, and SSRF protection attempts. This is a cleaner reimplementation of PR #1703, avoiding code duplication and following Remix conventions.
Changes:
- Added URL validation utilities with SSRF protection (IPv4 private IP blocking)
- Implemented server-side API route for fetching and extracting web page content
- Created client-side popover UI component for URL input with loading/error states
- Integrated the web search feature into the chat interface
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 18 comments.
Show a summary per file
| File | Description |
|---|---|
| app/utils/url.ts | New utility for URL validation with SSRF protection (IPv4 only) |
| app/routes/api.web-search.ts | Server-side API route for fetching URLs and extracting HTML content |
| app/components/chat/WebSearch.client.tsx | Client-side popover component for URL input and result handling |
| app/components/chat/ChatBox.tsx | Added WebSearch button to chat toolbar |
| app/components/chat/BaseChat.tsx | Added onWebSearchResult prop passthrough |
| app/components/chat/Chat.client.tsx | Handler for prepending fetched content to chat input |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| console.error('Web search error:', error); | ||
|
|
||
| return json({ error: error instanceof Error ? error.message : 'Failed to fetch URL' }, { status: 500 }); |
There was a problem hiding this comment.
The error message on line 102 could expose internal error details to the client. Error messages might contain sensitive information about the internal system, network configuration, or implementation details that could help an attacker.
Consider using a generic error message for production environments, while logging the detailed error server-side:
return json({ error: 'Failed to fetch URL' }, { status: 500 });The detailed error from error.message is already being logged on line 100 for debugging purposes.
| return json({ error: error instanceof Error ? error.message : 'Failed to fetch URL' }, { status: 500 }); | |
| return json({ error: 'Failed to fetch URL' }, { status: 500 }); |
| export async function action({ request }: ActionFunctionArgs) { | ||
| if (request.method !== 'POST') { | ||
| return json({ error: 'Method not allowed' }, { status: 405 }); | ||
| } | ||
|
|
||
| try { | ||
| const { url } = (await request.json()) as { url?: string }; | ||
|
|
||
| if (!url || typeof url !== 'string') { | ||
| return json({ error: 'URL is required' }, { status: 400 }); | ||
| } | ||
|
|
||
| if (!isAllowedUrl(url)) { | ||
| return json({ error: 'URL is not allowed. Only public HTTP/HTTPS URLs are accepted.' }, { status: 400 }); | ||
| } | ||
|
|
||
| const response = await fetch(url, { | ||
| headers: FETCH_HEADERS, | ||
| signal: AbortSignal.timeout(10_000), | ||
| }); | ||
|
|
||
| if (!response.ok) { | ||
| return json({ error: `Failed to fetch URL: ${response.status} ${response.statusText}` }, { status: 502 }); | ||
| } | ||
|
|
||
| const contentType = response.headers.get('content-type') || ''; | ||
|
|
||
| if (!contentType.includes('text/html') && !contentType.includes('text/plain')) { | ||
| return json({ error: 'URL must point to an HTML or text page' }, { status: 400 }); | ||
| } | ||
|
|
||
| const html = await response.text(); | ||
| const title = extractTitle(html); | ||
| const description = extractMetaDescription(html); | ||
| const content = extractTextContent(html); | ||
|
|
||
| return json({ | ||
| success: true, | ||
| data: { | ||
| title, | ||
| description, | ||
| content: content.length > MAX_CONTENT_LENGTH ? content.slice(0, MAX_CONTENT_LENGTH) + '...' : content, | ||
| sourceUrl: url, | ||
| }, | ||
| }); | ||
| } catch (error) { | ||
| if (error instanceof DOMException && error.name === 'TimeoutError') { | ||
| return json({ error: 'Request timed out after 10 seconds' }, { status: 504 }); | ||
| } | ||
|
|
||
| console.error('Web search error:', error); | ||
|
|
||
| return json({ error: error instanceof Error ? error.message : 'Failed to fetch URL' }, { status: 500 }); | ||
| } | ||
| } |
There was a problem hiding this comment.
There's no rate limiting or abuse prevention mechanism. A user could rapidly make many requests to external URLs, potentially:
- Using the server as a proxy for DDoS attacks
- Consuming excessive bandwidth and resources
- Making the server appear as the source of attacks to external services
Consider implementing rate limiting at the API route level, possibly using the user's IP address or session identifier. Other API routes in the codebase may have similar protections you can reference.
| function extractTextContent(html: string): string { | ||
| return html | ||
| .replace(/<script\b[^<]*(?:(?!<\/script>)<[^<]*)*<\/script>/gi, ' ') | ||
| .replace(/<style\b[^<]*(?:(?!<\/style>)<[^<]*)*<\/style>/gi, ' ') | ||
| .replace(/<nav\b[^<]*(?:(?!<\/nav>)<[^<]*)*<\/nav>/gi, ' ') | ||
| .replace(/<header\b[^<]*(?:(?!<\/header>)<[^<]*)*<\/header>/gi, ' ') | ||
| .replace(/<footer\b[^<]*(?:(?!<\/footer>)<[^<]*)*<\/footer>/gi, ' ') | ||
| .replace(/<[^>]+>/g, ' ') | ||
| .replace(/ /g, ' ') | ||
| .replace(/&/g, '&') | ||
| .replace(/</g, '<') | ||
| .replace(/>/g, '>') | ||
| .replace(/"/g, '"') | ||
| .replace(/'/g, "'") |
There was a problem hiding this comment.
The HTML entity decoding is incomplete. The code only handles a few common entities ( , &, <, >, ", '), but HTML has hundreds of named entities (e.g., ', —, ©) and numeric character references (e.g.,  , ’).
This will result in unreadable content when fetching pages that use other entities. Consider using a more comprehensive approach or at least adding support for:
'(apostrophe)- Numeric character references with a regex like
/&#(\d+);/gand/&#x([0-9a-fA-F]+);/g - More common entities like
—,–,…, etc.
| function extractTextContent(html: string): string { | |
| return html | |
| .replace(/<script\b[^<]*(?:(?!<\/script>)<[^<]*)*<\/script>/gi, ' ') | |
| .replace(/<style\b[^<]*(?:(?!<\/style>)<[^<]*)*<\/style>/gi, ' ') | |
| .replace(/<nav\b[^<]*(?:(?!<\/nav>)<[^<]*)*<\/nav>/gi, ' ') | |
| .replace(/<header\b[^<]*(?:(?!<\/header>)<[^<]*)*<\/header>/gi, ' ') | |
| .replace(/<footer\b[^<]*(?:(?!<\/footer>)<[^<]*)*<\/footer>/gi, ' ') | |
| .replace(/<[^>]+>/g, ' ') | |
| .replace(/ /g, ' ') | |
| .replace(/&/g, '&') | |
| .replace(/</g, '<') | |
| .replace(/>/g, '>') | |
| .replace(/"/g, '"') | |
| .replace(/'/g, "'") | |
| function decodeHtmlEntities(text: string): string { | |
| const namedEntities: Record<string, string> = { | |
| ' ': ' ', | |
| '&': '&', | |
| '<': '<', | |
| '>': '>', | |
| '"': '"', | |
| ''': "'", | |
| ''': "'", | |
| '—': '—', | |
| '–': '–', | |
| '…': '…', | |
| }; | |
| // Replace named entities we know about | |
| let result = text.replace( | |
| /&(nbsp|amp|lt|gt|quot|apos|mdash|ndash|hellip);|'/g, | |
| (match) => namedEntities[match] ?? match, | |
| ); | |
| // Decimal numeric character references:   | |
| result = result.replace(/&#(\d+);/g, (_, dec: string) => { | |
| const codePoint = Number(dec); | |
| if (!Number.isFinite(codePoint) || codePoint <= 0) { | |
| return _; | |
| } | |
| try { | |
| return String.fromCharCode(codePoint); | |
| } catch { | |
| return _; | |
| } | |
| }); | |
| // Hex numeric character references: — | |
| result = result.replace(/&#x([0-9a-fA-F]+);/g, (_, hex: string) => { | |
| const codePoint = parseInt(hex, 16); | |
| if (!Number.isFinite(codePoint) || codePoint <= 0) { | |
| return _; | |
| } | |
| try { | |
| return String.fromCharCode(codePoint); | |
| } catch { | |
| return _; | |
| } | |
| }); | |
| return result; | |
| } | |
| function extractTextContent(html: string): string { | |
| const withoutTags = html | |
| .replace(/<script\b[^<]*(?:(?!<\/script>)<[^<]*)*<\/script>/gi, ' ') | |
| .replace(/<style\b[^<]*(?:(?!<\/style>)<[^<]*)*<\/style>/gi, ' ') | |
| .replace(/<nav\b[^<]*(?:(?!<\/nav>)<[^<]*)*<\/nav>/gi, ' ') | |
| .replace(/<header\b[^<]*(?:(?!<\/header>)<[^<]*)*<\/header>/gi, ' ') | |
| .replace(/<footer\b[^<]*(?:(?!<\/footer>)<[^<]*)*<\/footer>/gi, ' ') | |
| .replace(/<[^>]+>/g, ' '); | |
| const decoded = decodeHtmlEntities(withoutTags); | |
| return decoded |
| const match = html.match(/<meta[^>]*name=["']description["'][^>]*content=["']([^"']*)["'][^>]*>/i); | ||
|
|
||
| if (match) { | ||
| return match[1].trim(); | ||
| } | ||
|
|
||
| // Try reverse attribute order | ||
| const altMatch = html.match(/<meta[^>]*content=["']([^"']*)["'][^>]*name=["']description["'][^>]*>/i); | ||
|
|
||
| return altMatch ? altMatch[1].trim() : ''; |
There was a problem hiding this comment.
The regex for extracting meta description doesn't handle escaped quotes within attribute values. For example:
<meta name="description" content="She said \"Hello\" to me">The pattern ["']([^"']*)["'] will match from the first quote to the escaped quote, stopping prematurely. This could result in incomplete or incorrect description extraction.
Consider using more robust HTML parsing logic or a pattern that handles escaped quotes, though regex-based HTML parsing has inherent limitations.
| const match = html.match(/<meta[^>]*name=["']description["'][^>]*content=["']([^"']*)["'][^>]*>/i); | |
| if (match) { | |
| return match[1].trim(); | |
| } | |
| // Try reverse attribute order | |
| const altMatch = html.match(/<meta[^>]*content=["']([^"']*)["'][^>]*name=["']description["'][^>]*>/i); | |
| return altMatch ? altMatch[1].trim() : ''; | |
| const match = html.match(/<meta[^>]*name=["']description["'][^>]*content=(["'])((?:\\.|(?!\1).)*)\1[^>]*>/i); | |
| if (match) { | |
| return match[2].trim(); | |
| } | |
| // Try reverse attribute order | |
| const altMatch = html.match(/<meta[^>]*content=(["'])((?:\\.|(?!\1).)*)\1[^>]*name=["']description["'][^>]*>/i); | |
| return altMatch ? altMatch[2].trim() : ''; |
| function extractTextContent(html: string): string { | ||
| return html | ||
| .replace(/<script\b[^<]*(?:(?!<\/script>)<[^<]*)*<\/script>/gi, ' ') | ||
| .replace(/<style\b[^<]*(?:(?!<\/style>)<[^<]*)*<\/style>/gi, ' ') | ||
| .replace(/<nav\b[^<]*(?:(?!<\/nav>)<[^<]*)*<\/nav>/gi, ' ') | ||
| .replace(/<header\b[^<]*(?:(?!<\/header>)<[^<]*)*<\/header>/gi, ' ') | ||
| .replace(/<footer\b[^<]*(?:(?!<\/footer>)<[^<]*)*<\/footer>/gi, ' ') |
There was a problem hiding this comment.
The regex patterns for removing script, style, nav, header, and footer tags are not comprehensive and can be bypassed. These patterns fail to handle:
- Self-closing tags:
<script src="..." />won't be matched - Multiline or nested tags: Complex nested structures may not be handled correctly
- Tag attributes with angle brackets:
<script data-content="<test>">could break the pattern - Case sensitivity: While the
iflag is used, the pattern structure makes assumptions about tag content
The regex approach to HTML parsing is fundamentally flawed for security purposes. Consider marking script/style removal as best-effort content cleaning rather than a security boundary, and document that malicious HTML could potentially inject unwanted content into the LLM context.
| /^0\.0\.0\.0$/, // Unspecified | ||
| ]; | ||
|
|
||
| const BLOCKED_HOSTNAMES = new Set(['localhost', '[::1]', '0.0.0.0']); |
There was a problem hiding this comment.
The [::1] entry in BLOCKED_HOSTNAMES should be just ::1 without brackets. The URL parsing in new URL() automatically removes brackets from IPv6 addresses when extracting the hostname. For example, new URL('http://[::1]:3000').hostname returns ::1, not [::1]. This means IPv6 localhost addresses are not being blocked.
| const BLOCKED_HOSTNAMES = new Set(['localhost', '[::1]', '0.0.0.0']); | |
| const BLOCKED_HOSTNAMES = new Set(['localhost', '::1', '0.0.0.0']); |
| }, | ||
| }); | ||
| } catch (error) { | ||
| if (error instanceof DOMException && error.name === 'TimeoutError') { |
There was a problem hiding this comment.
The timeout error detection is incorrect. In modern JavaScript environments, fetch timeout typically throws a TimeoutError from the AbortSignal, which may be an AbortError (DOMException with name 'AbortError') rather than 'TimeoutError', depending on the runtime environment.
The error handling should check for both:
if (error instanceof DOMException && (error.name === 'TimeoutError' || error.name === 'AbortError'))Or check the error message, as some environments throw different error types for timeouts.
| if (error instanceof DOMException && error.name === 'TimeoutError') { | |
| if (error instanceof DOMException && (error.name === 'TimeoutError' || error.name === 'AbortError')) { |
| setUrl(''); | ||
| setIsOpen(false); | ||
| } catch (error) { | ||
| toast.error(error instanceof Error ? error.message : 'Failed to fetch URL'); |
There was a problem hiding this comment.
The error message from the API is displayed directly to the user via toast without sanitization. While React generally escapes content, if the API returns malicious HTML or scripts in error messages, this could potentially be a security issue depending on how react-toastify handles the content.
Consider explicitly sanitizing the error message or ensuring it's displayed as plain text only.
| // Update the input via the same mechanism as handleInputChange | ||
| const syntheticEvent = { | ||
| target: { value: newInput }, | ||
| } as React.ChangeEvent<HTMLTextAreaElement>; | ||
| handleInputChange(syntheticEvent); | ||
| }, | ||
| [input, handleInputChange], |
There was a problem hiding this comment.
Creating a synthetic event to call handleInputChange is unnecessary since setInput is available from the useChat hook (line 125). Using setInput directly would be cleaner and more maintainable:
const newInput = currentInput.length > 0 ? `${result}\n\n${currentInput}` : result;
setInput(newInput);This avoids creating artificial event objects and makes the code more straightforward. The current approach works but adds unnecessary complexity.
| // Update the input via the same mechanism as handleInputChange | |
| const syntheticEvent = { | |
| target: { value: newInput }, | |
| } as React.ChangeEvent<HTMLTextAreaElement>; | |
| handleInputChange(syntheticEvent); | |
| }, | |
| [input, handleInputChange], | |
| setInput(newInput); | |
| }, | |
| [input, setInput], |
| .replace(/<script\b[^<]*(?:(?!<\/script>)<[^<]*)*<\/script>/gi, ' ') | ||
| .replace(/<style\b[^<]*(?:(?!<\/style>)<[^<]*)*<\/style>/gi, ' ') | ||
| .replace(/<nav\b[^<]*(?:(?!<\/nav>)<[^<]*)*<\/nav>/gi, ' ') | ||
| .replace(/<header\b[^<]*(?:(?!<\/header>)<[^<]*)*<\/header>/gi, ' ') | ||
| .replace(/<footer\b[^<]*(?:(?!<\/footer>)<[^<]*)*<\/footer>/gi, ' ') |
There was a problem hiding this comment.
The regex patterns for removing HTML tags have a fundamental flaw - they use negated character classes [^<] which will match newlines and can cause catastrophic backtracking on large HTML documents. The pattern <script\b[^<]*(?:(?!<\/script>)<[^<]*)*<\/script> tries to match everything between script tags but the nested quantifiers create exponential complexity.
For large HTML responses, this could cause:
- Extremely slow processing (potentially minutes for megabytes of HTML)
- High CPU usage
- Denial of service
Consider using a simpler approach like:
.replace(/<script\b[^>]*>[\s\S]*?<\/script>/gi, ' ')
.replace(/<style\b[^>]*>[\s\S]*?<\/style>/gi, ' ')The [\s\S]*? pattern with non-greedy matching is much more efficient and still handles most cases correctly.
| .replace(/<script\b[^<]*(?:(?!<\/script>)<[^<]*)*<\/script>/gi, ' ') | |
| .replace(/<style\b[^<]*(?:(?!<\/style>)<[^<]*)*<\/style>/gi, ' ') | |
| .replace(/<nav\b[^<]*(?:(?!<\/nav>)<[^<]*)*<\/nav>/gi, ' ') | |
| .replace(/<header\b[^<]*(?:(?!<\/header>)<[^<]*)*<\/header>/gi, ' ') | |
| .replace(/<footer\b[^<]*(?:(?!<\/footer>)<[^<]*)*<\/footer>/gi, ' ') | |
| .replace(/<script\b[^>]*>[\s\S]*?<\/script>/gi, ' ') | |
| .replace(/<style\b[^>]*>[\s\S]*?<\/style>/gi, ' ') | |
| .replace(/<nav\b[^>]*>[\s\S]*?<\/nav>/gi, ' ') | |
| .replace(/<header\b[^>]*>[\s\S]*?<\/header>/gi, ' ') | |
| .replace(/<footer\b[^>]*>[\s\S]*?<\/footer>/gi, ' ') |
Summary
This is a clean reimplementation of the concept from PR #1703, addressing all the issues found in that PR:
api.web-search.tsroute following Remix conventionswindow.prompt()fetchand regex-based HTML parsingpackage-lock.json, no logo files, nobuild.d.tsNew files
app/utils/url.ts— URL validation with SSRF protectionapp/routes/api.web-search.ts— Server-side URL fetcher with HTML content extractionapp/components/chat/WebSearch.client.tsx— Popover UI component with loading/error statesModified files
app/components/chat/ChatBox.tsx— Added WebSearch button to toolbarapp/components/chat/BaseChat.tsx— Pass-through foronWebSearchResultpropapp/components/chat/Chat.client.tsx— Handler that prepends fetched content to inputTest plan
http://localhost:3000,http://127.0.0.1,http://169.254.169.254— should be rejected🤖 Generated with Claude Code