Skip to content

fix: prevent SSRF in URL scraping by blocking private/internal IPs#91

Open
tranquac wants to merge 1 commit intoyoheinakajima:mainfrom
tranquac:fix/ssrf-url-scrape
Open

fix: prevent SSRF in URL scraping by blocking private/internal IPs#91
tranquac wants to merge 1 commit intoyoheinakajima:mainfrom
tranquac:fix/ssrf-url-scrape

Conversation

@tranquac
Copy link
Copy Markdown

@tranquac tranquac commented Mar 27, 2026

Summary

Prevent SSRF in the URL scraping function by validating that URLs don't resolve to private or internal IP addresses.

Problem

The scrape_text_from_url function fetches user-supplied URLs without any SSRF protection:

user_input = request.json.get("user_input", "")
if user_input.startswith("http"):
    user_input = scrape_text_from_url(user_input)

def scrape_text_from_url(url):
    response = requests.get(url)  # No IP validation

An attacker can:

  • Access cloud metadata: {"user_input": "http://169.254.169.254/latest/meta-data/"} → leaks IAM credentials
  • Scan internal network services
  • Access internal admin interfaces

Fix

Added is_safe_url() function that resolves the hostname and checks all IPs against Python's ipaddress module:

  • Blocks private IPs (10.x, 172.16.x, 192.168.x)
  • Blocks loopback (127.0.0.1)
  • Blocks link-local (169.254.x — cloud metadata)
  • Only allows http/https schemes

Impact

  • Type: Server-Side Request Forgery (CWE-918)
  • Affected endpoint: POST with user_input starting with http
  • Risk: Cloud credential theft, internal network scanning
  • OWASP: A10:2021 — Server-Side Request Forgery

Summary by CodeRabbit

  • Bug Fixes
    • Enhanced web scraping security by validating URLs and blocking requests to private or local network addresses.

Signed-off-by: tranquac <tranquac@users.noreply.github.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 27, 2026

📝 Walkthrough

Walkthrough

The change adds URL safety validation to prevent web scraping from accessing private networks or invalid URLs. A new is_safe_url() function validates schemes, resolves hostnames to IP addresses, and blocks private/loopback addresses. The scrape_text_from_url() function now checks URL safety before scraping.

Changes

Cohort / File(s) Summary
URL Safety Validation
main.py
Added is_safe_url(url) function that validates URL schemes (http/https only), resolves hostnames to IP addresses, and blocks requests to private/loopback/link-local addresses. Updated scrape_text_from_url(url) to invoke this validation and short-circuit with an error message if the URL is unsafe; otherwise proceeds with existing scraping logic.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 A safety check hops into view,
No private IPs shall slip through!
URLs parse, hostnames resolve,
A security puzzle we now solve.
Safe scraping, as it should be true! 🔐

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main change: adding SSRF prevention through blocking private/internal IPs in URL scraping, which directly matches the changeset's primary objective.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
main.py (1)

59-69: ⚠️ Potential issue | 🟡 Minor

Add a timeout to prevent indefinite hanging.

The requests.get() call has no timeout, allowing a malicious or slow server to cause the request to hang indefinitely. This is a denial-of-service vector.

Suggested fix
-    response = requests.get(url)
+    response = requests.get(url, timeout=10)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@main.py` around lines 59 - 69, The scrape_text_from_url function currently
calls requests.get(url) without a timeout; update the call to include a sensible
timeout value (e.g., timeout=10) and handle timeout and general request
exceptions (requests.exceptions.Timeout and
requests.exceptions.RequestException) by logging the error (using logging) and
returning an appropriate error string instead of hanging; ensure you modify the
requests.get invocation and add try/except around it in scrape_text_from_url so
timeouts and network errors are caught and reported.
🧹 Nitpick comments (1)
main.py (1)

37-40: Move imports to the top of the file.

These standard library imports should be grouped with the other standard library imports at the top of the file (lines 1-5) to follow PEP 8 conventions and improve maintainability.

Suggested placement

Move these imports to the top of the file, after line 5:

import argparse
import ipaddress
import json
import logging
import os
import re
import socket
from urllib.parse import urlparse

Then remove lines 37-40.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@main.py` around lines 37 - 40, Move the standard-library imports ipaddress,
socket and from urllib.parse import urlparse up into the existing top-of-file
imports block (grouped with argparse, json, logging, os, re, etc.) so all stdlib
imports are together per PEP8; then remove the duplicate import lines currently
located around lines 37-40 to avoid redeclaration and keep a single, ordered
import section at the top of the module.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@main.py`:
- Around line 42-57: The DNS rebinding TOCTOU in is_safe_url() can be bypassed
because DNS is resolved again when making the request; change the flow so you
resolve the hostname once (inside is_safe_url or a new
resolve_and_validate(hostname) helper), verify every returned address is not
private/loopback/link-local/reserved (use ip.is_private, ip.is_loopback,
ip.is_link_local and ip.is_reserved), and then make the HTTP request to the
validated numeric IP (or create a socket bound to that IP) while supplying the
original hostname in the Host header so requests.get() does not trigger a second
DNS lookup; update call sites that use is_safe_url() to use the resolved
IP/connection info instead of relying on a second hostname resolution.

---

Outside diff comments:
In `@main.py`:
- Around line 59-69: The scrape_text_from_url function currently calls
requests.get(url) without a timeout; update the call to include a sensible
timeout value (e.g., timeout=10) and handle timeout and general request
exceptions (requests.exceptions.Timeout and
requests.exceptions.RequestException) by logging the error (using logging) and
returning an appropriate error string instead of hanging; ensure you modify the
requests.get invocation and add try/except around it in scrape_text_from_url so
timeouts and network errors are caught and reported.

---

Nitpick comments:
In `@main.py`:
- Around line 37-40: Move the standard-library imports ipaddress, socket and
from urllib.parse import urlparse up into the existing top-of-file imports block
(grouped with argparse, json, logging, os, re, etc.) so all stdlib imports are
together per PEP8; then remove the duplicate import lines currently located
around lines 37-40 to avoid redeclaration and keep a single, ordered import
section at the top of the module.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: da4b0510-f060-45f4-8415-7fe0e5ec3585

📥 Commits

Reviewing files that changed from the base of the PR and between 65b179d and fae9bcf.

📒 Files selected for processing (1)
  • main.py

Comment on lines +42 to +57
def is_safe_url(url):
"""Check if URL is safe to fetch (not pointing to private/internal addresses)."""
try:
parsed = urlparse(url)
if parsed.scheme not in ('http', 'https'):
return False
hostname = parsed.hostname
if not hostname:
return False
for addr_info in socket.getaddrinfo(hostname, None):
ip = ipaddress.ip_address(addr_info[4][0])
if ip.is_private or ip.is_loopback or ip.is_link_local:
return False
return True
except (socket.gaierror, ValueError):
return False
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

DNS rebinding vulnerability (TOCTOU): validation can be bypassed.

The current implementation resolves DNS in is_safe_url(), but requests.get() performs a separate DNS lookup. An attacker can exploit DNS rebinding: the first lookup returns a safe public IP, then TTL expires and the second lookup (during the actual request) returns a private IP like 169.254.169.254.

Additionally, consider checking ip.is_reserved to block other special-use addresses (documentation, benchmarking, etc.).

Recommended fix: resolve once and pass IP to requests
 import ipaddress
 import socket
 from urllib.parse import urlparse
+
+class UnsafeURLError(Exception):
+    pass

-def is_safe_url(url):
-    """Check if URL is safe to fetch (not pointing to private/internal addresses)."""
+def resolve_and_validate_url(url):
+    """
+    Validate URL and resolve to a safe IP address.
+    Returns (validated_url, resolved_ip) or raises UnsafeURLError.
+    """
     try:
         parsed = urlparse(url)
         if parsed.scheme not in ('http', 'https'):
-            return False
+            raise UnsafeURLError("Invalid scheme")
         hostname = parsed.hostname
         if not hostname:
-            return False
-        for addr_info in socket.getaddrinfo(hostname, None):
+            raise UnsafeURLError("No hostname")
+        
+        # Resolve and validate all IPs
+        addr_infos = socket.getaddrinfo(hostname, None)
+        safe_ip = None
+        for addr_info in addr_infos:
             ip = ipaddress.ip_address(addr_info[4][0])
-            if ip.is_private or ip.is_loopback or ip.is_link_local:
-                return False
-        return True
-    except (socket.gaierror, ValueError):
-        return False
+            if ip.is_private or ip.is_loopback or ip.is_link_local or ip.is_reserved:
+                raise UnsafeURLError(f"Blocked IP: {ip}")
+            if safe_ip is None:
+                safe_ip = str(ip)
+        
+        return url, safe_ip
+    except (socket.gaierror, ValueError) as e:
+        raise UnsafeURLError(f"Resolution failed: {e}")

 def scrape_text_from_url(url):
-    if not is_safe_url(url):
+    try:
+        validated_url, resolved_ip = resolve_and_validate_url(url)
+    except UnsafeURLError as e:
         return "Error: URL points to a private or internal address."
-    response = requests.get(url)
+    
+    # Use resolved IP directly with Host header to prevent DNS rebinding
+    parsed = urlparse(validated_url)
+    port = parsed.port or (443 if parsed.scheme == 'https' else 80)
+    ip_url = f"{parsed.scheme}://{resolved_ip}:{port}{parsed.path or '/'}"
+    if parsed.query:
+        ip_url += f"?{parsed.query}"
+    
+    response = requests.get(ip_url, headers={'Host': parsed.hostname}, timeout=10)

Alternatively, consider using a battle-tested library like ssrf-guard or implementing the fix at the network layer with egress firewall rules.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@main.py` around lines 42 - 57, The DNS rebinding TOCTOU in is_safe_url() can
be bypassed because DNS is resolved again when making the request; change the
flow so you resolve the hostname once (inside is_safe_url or a new
resolve_and_validate(hostname) helper), verify every returned address is not
private/loopback/link-local/reserved (use ip.is_private, ip.is_loopback,
ip.is_link_local and ip.is_reserved), and then make the HTTP request to the
validated numeric IP (or create a socket bound to that IP) while supplying the
original hostname in the Host header so requests.get() does not trigger a second
DNS lookup; update call sites that use is_safe_url() to use the resolved
IP/connection info instead of relying on a second hostname resolution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant