fix: prevent SSRF in URL scraping by blocking private/internal IPs#91
fix: prevent SSRF in URL scraping by blocking private/internal IPs#91tranquac wants to merge 1 commit intoyoheinakajima:mainfrom
Conversation
Signed-off-by: tranquac <tranquac@users.noreply.github.com>
📝 WalkthroughWalkthroughThe change adds URL safety validation to prevent web scraping from accessing private networks or invalid URLs. A new Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
main.py (1)
59-69:⚠️ Potential issue | 🟡 MinorAdd a timeout to prevent indefinite hanging.
The
requests.get()call has no timeout, allowing a malicious or slow server to cause the request to hang indefinitely. This is a denial-of-service vector.Suggested fix
- response = requests.get(url) + response = requests.get(url, timeout=10)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@main.py` around lines 59 - 69, The scrape_text_from_url function currently calls requests.get(url) without a timeout; update the call to include a sensible timeout value (e.g., timeout=10) and handle timeout and general request exceptions (requests.exceptions.Timeout and requests.exceptions.RequestException) by logging the error (using logging) and returning an appropriate error string instead of hanging; ensure you modify the requests.get invocation and add try/except around it in scrape_text_from_url so timeouts and network errors are caught and reported.
🧹 Nitpick comments (1)
main.py (1)
37-40: Move imports to the top of the file.These standard library imports should be grouped with the other standard library imports at the top of the file (lines 1-5) to follow PEP 8 conventions and improve maintainability.
Suggested placement
Move these imports to the top of the file, after line 5:
import argparse import ipaddress import json import logging import os import re import socket from urllib.parse import urlparseThen remove lines 37-40.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@main.py` around lines 37 - 40, Move the standard-library imports ipaddress, socket and from urllib.parse import urlparse up into the existing top-of-file imports block (grouped with argparse, json, logging, os, re, etc.) so all stdlib imports are together per PEP8; then remove the duplicate import lines currently located around lines 37-40 to avoid redeclaration and keep a single, ordered import section at the top of the module.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@main.py`:
- Around line 42-57: The DNS rebinding TOCTOU in is_safe_url() can be bypassed
because DNS is resolved again when making the request; change the flow so you
resolve the hostname once (inside is_safe_url or a new
resolve_and_validate(hostname) helper), verify every returned address is not
private/loopback/link-local/reserved (use ip.is_private, ip.is_loopback,
ip.is_link_local and ip.is_reserved), and then make the HTTP request to the
validated numeric IP (or create a socket bound to that IP) while supplying the
original hostname in the Host header so requests.get() does not trigger a second
DNS lookup; update call sites that use is_safe_url() to use the resolved
IP/connection info instead of relying on a second hostname resolution.
---
Outside diff comments:
In `@main.py`:
- Around line 59-69: The scrape_text_from_url function currently calls
requests.get(url) without a timeout; update the call to include a sensible
timeout value (e.g., timeout=10) and handle timeout and general request
exceptions (requests.exceptions.Timeout and
requests.exceptions.RequestException) by logging the error (using logging) and
returning an appropriate error string instead of hanging; ensure you modify the
requests.get invocation and add try/except around it in scrape_text_from_url so
timeouts and network errors are caught and reported.
---
Nitpick comments:
In `@main.py`:
- Around line 37-40: Move the standard-library imports ipaddress, socket and
from urllib.parse import urlparse up into the existing top-of-file imports block
(grouped with argparse, json, logging, os, re, etc.) so all stdlib imports are
together per PEP8; then remove the duplicate import lines currently located
around lines 37-40 to avoid redeclaration and keep a single, ordered import
section at the top of the module.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
| def is_safe_url(url): | ||
| """Check if URL is safe to fetch (not pointing to private/internal addresses).""" | ||
| try: | ||
| parsed = urlparse(url) | ||
| if parsed.scheme not in ('http', 'https'): | ||
| return False | ||
| hostname = parsed.hostname | ||
| if not hostname: | ||
| return False | ||
| for addr_info in socket.getaddrinfo(hostname, None): | ||
| ip = ipaddress.ip_address(addr_info[4][0]) | ||
| if ip.is_private or ip.is_loopback or ip.is_link_local: | ||
| return False | ||
| return True | ||
| except (socket.gaierror, ValueError): | ||
| return False |
There was a problem hiding this comment.
DNS rebinding vulnerability (TOCTOU): validation can be bypassed.
The current implementation resolves DNS in is_safe_url(), but requests.get() performs a separate DNS lookup. An attacker can exploit DNS rebinding: the first lookup returns a safe public IP, then TTL expires and the second lookup (during the actual request) returns a private IP like 169.254.169.254.
Additionally, consider checking ip.is_reserved to block other special-use addresses (documentation, benchmarking, etc.).
Recommended fix: resolve once and pass IP to requests
import ipaddress
import socket
from urllib.parse import urlparse
+
+class UnsafeURLError(Exception):
+ pass
-def is_safe_url(url):
- """Check if URL is safe to fetch (not pointing to private/internal addresses)."""
+def resolve_and_validate_url(url):
+ """
+ Validate URL and resolve to a safe IP address.
+ Returns (validated_url, resolved_ip) or raises UnsafeURLError.
+ """
try:
parsed = urlparse(url)
if parsed.scheme not in ('http', 'https'):
- return False
+ raise UnsafeURLError("Invalid scheme")
hostname = parsed.hostname
if not hostname:
- return False
- for addr_info in socket.getaddrinfo(hostname, None):
+ raise UnsafeURLError("No hostname")
+
+ # Resolve and validate all IPs
+ addr_infos = socket.getaddrinfo(hostname, None)
+ safe_ip = None
+ for addr_info in addr_infos:
ip = ipaddress.ip_address(addr_info[4][0])
- if ip.is_private or ip.is_loopback or ip.is_link_local:
- return False
- return True
- except (socket.gaierror, ValueError):
- return False
+ if ip.is_private or ip.is_loopback or ip.is_link_local or ip.is_reserved:
+ raise UnsafeURLError(f"Blocked IP: {ip}")
+ if safe_ip is None:
+ safe_ip = str(ip)
+
+ return url, safe_ip
+ except (socket.gaierror, ValueError) as e:
+ raise UnsafeURLError(f"Resolution failed: {e}")
def scrape_text_from_url(url):
- if not is_safe_url(url):
+ try:
+ validated_url, resolved_ip = resolve_and_validate_url(url)
+ except UnsafeURLError as e:
return "Error: URL points to a private or internal address."
- response = requests.get(url)
+
+ # Use resolved IP directly with Host header to prevent DNS rebinding
+ parsed = urlparse(validated_url)
+ port = parsed.port or (443 if parsed.scheme == 'https' else 80)
+ ip_url = f"{parsed.scheme}://{resolved_ip}:{port}{parsed.path or '/'}"
+ if parsed.query:
+ ip_url += f"?{parsed.query}"
+
+ response = requests.get(ip_url, headers={'Host': parsed.hostname}, timeout=10)Alternatively, consider using a battle-tested library like ssrf-guard or implementing the fix at the network layer with egress firewall rules.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@main.py` around lines 42 - 57, The DNS rebinding TOCTOU in is_safe_url() can
be bypassed because DNS is resolved again when making the request; change the
flow so you resolve the hostname once (inside is_safe_url or a new
resolve_and_validate(hostname) helper), verify every returned address is not
private/loopback/link-local/reserved (use ip.is_private, ip.is_loopback,
ip.is_link_local and ip.is_reserved), and then make the HTTP request to the
validated numeric IP (or create a socket bound to that IP) while supplying the
original hostname in the Host header so requests.get() does not trigger a second
DNS lookup; update call sites that use is_safe_url() to use the resolved
IP/connection info instead of relying on a second hostname resolution.
Summary
Prevent SSRF in the URL scraping function by validating that URLs don't resolve to private or internal IP addresses.
Problem
The
scrape_text_from_urlfunction fetches user-supplied URLs without any SSRF protection:An attacker can:
{"user_input": "http://169.254.169.254/latest/meta-data/"}→ leaks IAM credentialsFix
Added
is_safe_url()function that resolves the hostname and checks all IPs against Python'sipaddressmodule:Impact
user_inputstarting withhttpSummary by CodeRabbit