Skip to content

feat: add web URL content fetcher for chat context#2111

Merged
Stijnus merged 1 commit intostackblitz-labs:mainfrom
Stijnus:feat/web-url-fetcher
Feb 7, 2026
Merged

feat: add web URL content fetcher for chat context#2111
Stijnus merged 1 commit intostackblitz-labs:mainfrom
Stijnus:feat/web-url-fetcher

Conversation

@Stijnus
Copy link
Collaborator

@Stijnus Stijnus commented Feb 5, 2026

Summary

  • Adds a globe button in the chat toolbar that opens a URL input popover
  • Fetches web page content server-side and injects it as context into the chat input
  • Includes SSRF protection (blocks private IPs, localhost, link-local addresses)

This is a clean reimplementation of the concept from PR #1703, addressing all the issues found in that PR:

  • No component duplication — uses the existing ChatBox component, no copy-paste
  • Single API route — one api.web-search.ts route following Remix conventions
  • SSRF protection — URL validation blocks private/internal IPs and localhost
  • Clean UX — popover with text input instead of window.prompt()
  • No new dependencies — uses native fetch and regex-based HTML parsing
  • No unrelated changes — no package-lock.json, no logo files, no build.d.ts

New files

  • app/utils/url.ts — URL validation with SSRF protection
  • app/routes/api.web-search.ts — Server-side URL fetcher with HTML content extraction
  • app/components/chat/WebSearch.client.tsx — Popover UI component with loading/error states

Modified files

  • app/components/chat/ChatBox.tsx — Added WebSearch button to toolbar
  • app/components/chat/BaseChat.tsx — Pass-through for onWebSearchResult prop
  • app/components/chat/Chat.client.tsx — Handler that prepends fetched content to input

Test plan

  • Click globe icon, enter a URL (e.g. https://example.com), verify content appears in textarea
  • Test SSRF protection: try http://localhost:3000, http://127.0.0.1, http://169.254.169.254 — should be rejected
  • Test error handling: try an invalid URL, a non-HTML URL, a URL that 404s
  • Verify popover closes on click outside and Escape key
  • Verify button is disabled while streaming

🤖 Generated with Claude Code

Add ability to fetch and inject web page content into chat as context.
Includes SSRF protection (blocks private IPs, localhost), content
extraction (strips scripts/styles/nav), and a clean popover UI.

Reimplements the concept from PR stackblitz-labs#1703 without the issues (duplicated
ChatBox, dual API routes, SSRF vulnerability, window.prompt UX).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings February 5, 2026 21:55
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a web URL content fetcher feature that allows users to fetch content from web pages and inject it into the chat context. The implementation includes a globe button in the chat toolbar that opens a popover for URL input, server-side content fetching with HTML parsing, and SSRF protection attempts. This is a cleaner reimplementation of PR #1703, avoiding code duplication and following Remix conventions.

Changes:

  • Added URL validation utilities with SSRF protection (IPv4 private IP blocking)
  • Implemented server-side API route for fetching and extracting web page content
  • Created client-side popover UI component for URL input with loading/error states
  • Integrated the web search feature into the chat interface

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 18 comments.

Show a summary per file
File Description
app/utils/url.ts New utility for URL validation with SSRF protection (IPv4 only)
app/routes/api.web-search.ts Server-side API route for fetching URLs and extracting HTML content
app/components/chat/WebSearch.client.tsx Client-side popover component for URL input and result handling
app/components/chat/ChatBox.tsx Added WebSearch button to chat toolbar
app/components/chat/BaseChat.tsx Added onWebSearchResult prop passthrough
app/components/chat/Chat.client.tsx Handler for prepending fetched content to chat input

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.


console.error('Web search error:', error);

return json({ error: error instanceof Error ? error.message : 'Failed to fetch URL' }, { status: 500 });
Copy link

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message on line 102 could expose internal error details to the client. Error messages might contain sensitive information about the internal system, network configuration, or implementation details that could help an attacker.

Consider using a generic error message for production environments, while logging the detailed error server-side:

return json({ error: 'Failed to fetch URL' }, { status: 500 });

The detailed error from error.message is already being logged on line 100 for debugging purposes.

Suggested change
return json({ error: error instanceof Error ? error.message : 'Failed to fetch URL' }, { status: 500 });
return json({ error: 'Failed to fetch URL' }, { status: 500 });

Copilot uses AI. Check for mistakes.
Comment on lines +50 to +104
export async function action({ request }: ActionFunctionArgs) {
if (request.method !== 'POST') {
return json({ error: 'Method not allowed' }, { status: 405 });
}

try {
const { url } = (await request.json()) as { url?: string };

if (!url || typeof url !== 'string') {
return json({ error: 'URL is required' }, { status: 400 });
}

if (!isAllowedUrl(url)) {
return json({ error: 'URL is not allowed. Only public HTTP/HTTPS URLs are accepted.' }, { status: 400 });
}

const response = await fetch(url, {
headers: FETCH_HEADERS,
signal: AbortSignal.timeout(10_000),
});

if (!response.ok) {
return json({ error: `Failed to fetch URL: ${response.status} ${response.statusText}` }, { status: 502 });
}

const contentType = response.headers.get('content-type') || '';

if (!contentType.includes('text/html') && !contentType.includes('text/plain')) {
return json({ error: 'URL must point to an HTML or text page' }, { status: 400 });
}

const html = await response.text();
const title = extractTitle(html);
const description = extractMetaDescription(html);
const content = extractTextContent(html);

return json({
success: true,
data: {
title,
description,
content: content.length > MAX_CONTENT_LENGTH ? content.slice(0, MAX_CONTENT_LENGTH) + '...' : content,
sourceUrl: url,
},
});
} catch (error) {
if (error instanceof DOMException && error.name === 'TimeoutError') {
return json({ error: 'Request timed out after 10 seconds' }, { status: 504 });
}

console.error('Web search error:', error);

return json({ error: error instanceof Error ? error.message : 'Failed to fetch URL' }, { status: 500 });
}
}
Copy link

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no rate limiting or abuse prevention mechanism. A user could rapidly make many requests to external URLs, potentially:

  1. Using the server as a proxy for DDoS attacks
  2. Consuming excessive bandwidth and resources
  3. Making the server appear as the source of attacks to external services

Consider implementing rate limiting at the API route level, possibly using the user's IP address or session identifier. Other API routes in the codebase may have similar protections you can reference.

Copilot uses AI. Check for mistakes.
Comment on lines +32 to +45
function extractTextContent(html: string): string {
return html
.replace(/<script\b[^<]*(?:(?!<\/script>)<[^<]*)*<\/script>/gi, ' ')
.replace(/<style\b[^<]*(?:(?!<\/style>)<[^<]*)*<\/style>/gi, ' ')
.replace(/<nav\b[^<]*(?:(?!<\/nav>)<[^<]*)*<\/nav>/gi, ' ')
.replace(/<header\b[^<]*(?:(?!<\/header>)<[^<]*)*<\/header>/gi, ' ')
.replace(/<footer\b[^<]*(?:(?!<\/footer>)<[^<]*)*<\/footer>/gi, ' ')
.replace(/<[^>]+>/g, ' ')
.replace(/&nbsp;/g, ' ')
.replace(/&amp;/g, '&')
.replace(/&lt;/g, '<')
.replace(/&gt;/g, '>')
.replace(/&quot;/g, '"')
.replace(/&#x27;/g, "'")
Copy link

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The HTML entity decoding is incomplete. The code only handles a few common entities (&nbsp;, &amp;, &lt;, &gt;, &quot;, &#x27;), but HTML has hundreds of named entities (e.g., &apos;, &mdash;, &copy;) and numeric character references (e.g., &#160;, &#8217;).

This will result in unreadable content when fetching pages that use other entities. Consider using a more comprehensive approach or at least adding support for:

  • &apos; (apostrophe)
  • Numeric character references with a regex like /&#(\d+);/g and /&#x([0-9a-fA-F]+);/g
  • More common entities like &mdash;, &ndash;, &hellip;, etc.
Suggested change
function extractTextContent(html: string): string {
return html
.replace(/<script\b[^<]*(?:(?!<\/script>)<[^<]*)*<\/script>/gi, ' ')
.replace(/<style\b[^<]*(?:(?!<\/style>)<[^<]*)*<\/style>/gi, ' ')
.replace(/<nav\b[^<]*(?:(?!<\/nav>)<[^<]*)*<\/nav>/gi, ' ')
.replace(/<header\b[^<]*(?:(?!<\/header>)<[^<]*)*<\/header>/gi, ' ')
.replace(/<footer\b[^<]*(?:(?!<\/footer>)<[^<]*)*<\/footer>/gi, ' ')
.replace(/<[^>]+>/g, ' ')
.replace(/&nbsp;/g, ' ')
.replace(/&amp;/g, '&')
.replace(/&lt;/g, '<')
.replace(/&gt;/g, '>')
.replace(/&quot;/g, '"')
.replace(/&#x27;/g, "'")
function decodeHtmlEntities(text: string): string {
const namedEntities: Record<string, string> = {
'&nbsp;': ' ',
'&amp;': '&',
'&lt;': '<',
'&gt;': '>',
'&quot;': '"',
'&#x27;': "'",
'&apos;': "'",
'&mdash;': '—',
'&ndash;': '–',
'&hellip;': '…',
};
// Replace named entities we know about
let result = text.replace(
/&(nbsp|amp|lt|gt|quot|apos|mdash|ndash|hellip);|&#x27;/g,
(match) => namedEntities[match] ?? match,
);
// Decimal numeric character references: &#160;
result = result.replace(/&#(\d+);/g, (_, dec: string) => {
const codePoint = Number(dec);
if (!Number.isFinite(codePoint) || codePoint <= 0) {
return _;
}
try {
return String.fromCharCode(codePoint);
} catch {
return _;
}
});
// Hex numeric character references: &#x2014;
result = result.replace(/&#x([0-9a-fA-F]+);/g, (_, hex: string) => {
const codePoint = parseInt(hex, 16);
if (!Number.isFinite(codePoint) || codePoint <= 0) {
return _;
}
try {
return String.fromCharCode(codePoint);
} catch {
return _;
}
});
return result;
}
function extractTextContent(html: string): string {
const withoutTags = html
.replace(/<script\b[^<]*(?:(?!<\/script>)<[^<]*)*<\/script>/gi, ' ')
.replace(/<style\b[^<]*(?:(?!<\/style>)<[^<]*)*<\/style>/gi, ' ')
.replace(/<nav\b[^<]*(?:(?!<\/nav>)<[^<]*)*<\/nav>/gi, ' ')
.replace(/<header\b[^<]*(?:(?!<\/header>)<[^<]*)*<\/header>/gi, ' ')
.replace(/<footer\b[^<]*(?:(?!<\/footer>)<[^<]*)*<\/footer>/gi, ' ')
.replace(/<[^>]+>/g, ' ');
const decoded = decodeHtmlEntities(withoutTags);
return decoded

Copilot uses AI. Check for mistakes.
Comment on lines +20 to +29
const match = html.match(/<meta[^>]*name=["']description["'][^>]*content=["']([^"']*)["'][^>]*>/i);

if (match) {
return match[1].trim();
}

// Try reverse attribute order
const altMatch = html.match(/<meta[^>]*content=["']([^"']*)["'][^>]*name=["']description["'][^>]*>/i);

return altMatch ? altMatch[1].trim() : '';
Copy link

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The regex for extracting meta description doesn't handle escaped quotes within attribute values. For example:

<meta name="description" content="She said \"Hello\" to me">

The pattern ["']([^"']*)["'] will match from the first quote to the escaped quote, stopping prematurely. This could result in incomplete or incorrect description extraction.

Consider using more robust HTML parsing logic or a pattern that handles escaped quotes, though regex-based HTML parsing has inherent limitations.

Suggested change
const match = html.match(/<meta[^>]*name=["']description["'][^>]*content=["']([^"']*)["'][^>]*>/i);
if (match) {
return match[1].trim();
}
// Try reverse attribute order
const altMatch = html.match(/<meta[^>]*content=["']([^"']*)["'][^>]*name=["']description["'][^>]*>/i);
return altMatch ? altMatch[1].trim() : '';
const match = html.match(/<meta[^>]*name=["']description["'][^>]*content=(["'])((?:\\.|(?!\1).)*)\1[^>]*>/i);
if (match) {
return match[2].trim();
}
// Try reverse attribute order
const altMatch = html.match(/<meta[^>]*content=(["'])((?:\\.|(?!\1).)*)\1[^>]*name=["']description["'][^>]*>/i);
return altMatch ? altMatch[2].trim() : '';

Copilot uses AI. Check for mistakes.
Comment on lines +32 to +38
function extractTextContent(html: string): string {
return html
.replace(/<script\b[^<]*(?:(?!<\/script>)<[^<]*)*<\/script>/gi, ' ')
.replace(/<style\b[^<]*(?:(?!<\/style>)<[^<]*)*<\/style>/gi, ' ')
.replace(/<nav\b[^<]*(?:(?!<\/nav>)<[^<]*)*<\/nav>/gi, ' ')
.replace(/<header\b[^<]*(?:(?!<\/header>)<[^<]*)*<\/header>/gi, ' ')
.replace(/<footer\b[^<]*(?:(?!<\/footer>)<[^<]*)*<\/footer>/gi, ' ')
Copy link

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The regex patterns for removing script, style, nav, header, and footer tags are not comprehensive and can be bypassed. These patterns fail to handle:

  1. Self-closing tags: <script src="..." /> won't be matched
  2. Multiline or nested tags: Complex nested structures may not be handled correctly
  3. Tag attributes with angle brackets: <script data-content="<test>"> could break the pattern
  4. Case sensitivity: While the i flag is used, the pattern structure makes assumptions about tag content

The regex approach to HTML parsing is fundamentally flawed for security purposes. Consider marking script/style removal as best-effort content cleaning rather than a security boundary, and document that malicious HTML could potentially inject unwanted content into the LLM context.

Copilot uses AI. Check for mistakes.
/^0\.0\.0\.0$/, // Unspecified
];

const BLOCKED_HOSTNAMES = new Set(['localhost', '[::1]', '0.0.0.0']);
Copy link

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The [::1] entry in BLOCKED_HOSTNAMES should be just ::1 without brackets. The URL parsing in new URL() automatically removes brackets from IPv6 addresses when extracting the hostname. For example, new URL('http://[::1]:3000').hostname returns ::1, not [::1]. This means IPv6 localhost addresses are not being blocked.

Suggested change
const BLOCKED_HOSTNAMES = new Set(['localhost', '[::1]', '0.0.0.0']);
const BLOCKED_HOSTNAMES = new Set(['localhost', '::1', '0.0.0.0']);

Copilot uses AI. Check for mistakes.
},
});
} catch (error) {
if (error instanceof DOMException && error.name === 'TimeoutError') {
Copy link

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The timeout error detection is incorrect. In modern JavaScript environments, fetch timeout typically throws a TimeoutError from the AbortSignal, which may be an AbortError (DOMException with name 'AbortError') rather than 'TimeoutError', depending on the runtime environment.

The error handling should check for both:

if (error instanceof DOMException && (error.name === 'TimeoutError' || error.name === 'AbortError'))

Or check the error message, as some environments throw different error types for timeouts.

Suggested change
if (error instanceof DOMException && error.name === 'TimeoutError') {
if (error instanceof DOMException && (error.name === 'TimeoutError' || error.name === 'AbortError')) {

Copilot uses AI. Check for mistakes.
setUrl('');
setIsOpen(false);
} catch (error) {
toast.error(error instanceof Error ? error.message : 'Failed to fetch URL');
Copy link

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message from the API is displayed directly to the user via toast without sanitization. While React generally escapes content, if the API returns malicious HTML or scripts in error messages, this could potentially be a security issue depending on how react-toastify handles the content.

Consider explicitly sanitizing the error message or ensuring it's displayed as plain text only.

Copilot uses AI. Check for mistakes.
Comment on lines +602 to +608
// Update the input via the same mechanism as handleInputChange
const syntheticEvent = {
target: { value: newInput },
} as React.ChangeEvent<HTMLTextAreaElement>;
handleInputChange(syntheticEvent);
},
[input, handleInputChange],
Copy link

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Creating a synthetic event to call handleInputChange is unnecessary since setInput is available from the useChat hook (line 125). Using setInput directly would be cleaner and more maintainable:

const newInput = currentInput.length > 0 ? `${result}\n\n${currentInput}` : result;
setInput(newInput);

This avoids creating artificial event objects and makes the code more straightforward. The current approach works but adds unnecessary complexity.

Suggested change
// Update the input via the same mechanism as handleInputChange
const syntheticEvent = {
target: { value: newInput },
} as React.ChangeEvent<HTMLTextAreaElement>;
handleInputChange(syntheticEvent);
},
[input, handleInputChange],
setInput(newInput);
},
[input, setInput],

Copilot uses AI. Check for mistakes.
Comment on lines +34 to +38
.replace(/<script\b[^<]*(?:(?!<\/script>)<[^<]*)*<\/script>/gi, ' ')
.replace(/<style\b[^<]*(?:(?!<\/style>)<[^<]*)*<\/style>/gi, ' ')
.replace(/<nav\b[^<]*(?:(?!<\/nav>)<[^<]*)*<\/nav>/gi, ' ')
.replace(/<header\b[^<]*(?:(?!<\/header>)<[^<]*)*<\/header>/gi, ' ')
.replace(/<footer\b[^<]*(?:(?!<\/footer>)<[^<]*)*<\/footer>/gi, ' ')
Copy link

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The regex patterns for removing HTML tags have a fundamental flaw - they use negated character classes [^<] which will match newlines and can cause catastrophic backtracking on large HTML documents. The pattern <script\b[^<]*(?:(?!<\/script>)<[^<]*)*<\/script> tries to match everything between script tags but the nested quantifiers create exponential complexity.

For large HTML responses, this could cause:

  1. Extremely slow processing (potentially minutes for megabytes of HTML)
  2. High CPU usage
  3. Denial of service

Consider using a simpler approach like:

.replace(/<script\b[^>]*>[\s\S]*?<\/script>/gi, ' ')
.replace(/<style\b[^>]*>[\s\S]*?<\/style>/gi, ' ')

The [\s\S]*? pattern with non-greedy matching is much more efficient and still handles most cases correctly.

Suggested change
.replace(/<script\b[^<]*(?:(?!<\/script>)<[^<]*)*<\/script>/gi, ' ')
.replace(/<style\b[^<]*(?:(?!<\/style>)<[^<]*)*<\/style>/gi, ' ')
.replace(/<nav\b[^<]*(?:(?!<\/nav>)<[^<]*)*<\/nav>/gi, ' ')
.replace(/<header\b[^<]*(?:(?!<\/header>)<[^<]*)*<\/header>/gi, ' ')
.replace(/<footer\b[^<]*(?:(?!<\/footer>)<[^<]*)*<\/footer>/gi, ' ')
.replace(/<script\b[^>]*>[\s\S]*?<\/script>/gi, ' ')
.replace(/<style\b[^>]*>[\s\S]*?<\/style>/gi, ' ')
.replace(/<nav\b[^>]*>[\s\S]*?<\/nav>/gi, ' ')
.replace(/<header\b[^>]*>[\s\S]*?<\/header>/gi, ' ')
.replace(/<footer\b[^>]*>[\s\S]*?<\/footer>/gi, ' ')

Copilot uses AI. Check for mistakes.
@Stijnus Stijnus merged commit 2e254ac into stackblitz-labs:main Feb 7, 2026
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant