fix: prevent SSRF in URL scraping by blocking private/internal IPs by tranquac · Pull Request #91 · yoheinakajima/instagraph

tranquac · 2026-03-27T16:43:24Z

Summary

Prevent SSRF in the URL scraping function by validating that URLs don't resolve to private or internal IP addresses.

Problem

The scrape_text_from_url function fetches user-supplied URLs without any SSRF protection:

user_input = request.json.get("user_input", "")
if user_input.startswith("http"):
    user_input = scrape_text_from_url(user_input)

def scrape_text_from_url(url):
    response = requests.get(url)  # No IP validation

An attacker can:

Access cloud metadata: {"user_input": "http://169.254.169.254/latest/meta-data/"} → leaks IAM credentials
Scan internal network services
Access internal admin interfaces

Fix

Added is_safe_url() function that resolves the hostname and checks all IPs against Python's ipaddress module:

Blocks private IPs (10.x, 172.16.x, 192.168.x)
Blocks loopback (127.0.0.1)
Blocks link-local (169.254.x — cloud metadata)
Only allows http/https schemes

Impact

Type: Server-Side Request Forgery (CWE-918)
Affected endpoint: POST with user_input starting with http
Risk: Cloud credential theft, internal network scanning
OWASP: A10:2021 — Server-Side Request Forgery

Summary by CodeRabbit

Bug Fixes
- Enhanced web scraping security by validating URLs and blocking requests to private or local network addresses.

Signed-off-by: tranquac <tranquac@users.noreply.github.com>

coderabbitai · 2026-03-27T16:43:42Z

📝 Walkthrough

Walkthrough

The change adds URL safety validation to prevent web scraping from accessing private networks or invalid URLs. A new is_safe_url() function validates schemes, resolves hostnames to IP addresses, and blocks private/loopback addresses. The scrape_text_from_url() function now checks URL safety before scraping.

Changes

Cohort / File(s)	Summary
URL Safety Validation `main.py`	Added `is_safe_url(url)` function that validates URL schemes (http/https only), resolves hostnames to IP addresses, and blocks requests to private/loopback/link-local addresses. Updated `scrape_text_from_url(url)` to invoke this validation and short-circuit with an error message if the URL is unsafe; otherwise proceeds with existing scraping logic.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 A safety check hops into view,
No private IPs shall slip through!
URLs parse, hostnames resolve,
A security puzzle we now solve.
Safe scraping, as it should be true! 🔐

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and specifically describes the main change: adding SSRF prevention through blocking private/internal IPs in URL scraping, which directly matches the changeset's primary objective.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

main.py (1)

59-69: ⚠️ Potential issue | 🟡 Minor

Add a timeout to prevent indefinite hanging.

The requests.get() call has no timeout, allowing a malicious or slow server to cause the request to hang indefinitely. This is a denial-of-service vector.

Suggested fix

-    response = requests.get(url)
+    response = requests.get(url, timeout=10)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@main.py` around lines 59 - 69, The scrape_text_from_url function currently
calls requests.get(url) without a timeout; update the call to include a sensible
timeout value (e.g., timeout=10) and handle timeout and general request
exceptions (requests.exceptions.Timeout and
requests.exceptions.RequestException) by logging the error (using logging) and
returning an appropriate error string instead of hanging; ensure you modify the
requests.get invocation and add try/except around it in scrape_text_from_url so
timeouts and network errors are caught and reported.

🧹 Nitpick comments (1)

main.py (1)
37-40: Move imports to the top of the file.

These standard library imports should be grouped with the other standard library imports at the top of the file (lines 1-5) to follow PEP 8 conventions and improve maintainability.
Suggested placement

Move these imports to the top of the file, after line 5:
import argparse
import ipaddress
import json
import logging
import os
import re
import socket
from urllib.parse import urlparse
Then remove lines 37-40.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@main.py` around lines 37 - 40, Move the standard-library imports ipaddress,
socket and from urllib.parse import urlparse up into the existing top-of-file
imports block (grouped with argparse, json, logging, os, re, etc.) so all stdlib
imports are together per PEP8; then remove the duplicate import lines currently
located around lines 37-40 to avoid redeclaration and keep a single, ordered
import section at the top of the module.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@main.py`:
- Around line 42-57: The DNS rebinding TOCTOU in is_safe_url() can be bypassed
because DNS is resolved again when making the request; change the flow so you
resolve the hostname once (inside is_safe_url or a new
resolve_and_validate(hostname) helper), verify every returned address is not
private/loopback/link-local/reserved (use ip.is_private, ip.is_loopback,
ip.is_link_local and ip.is_reserved), and then make the HTTP request to the
validated numeric IP (or create a socket bound to that IP) while supplying the
original hostname in the Host header so requests.get() does not trigger a second
DNS lookup; update call sites that use is_safe_url() to use the resolved
IP/connection info instead of relying on a second hostname resolution.

---

Outside diff comments:
In `@main.py`:
- Around line 59-69: The scrape_text_from_url function currently calls
requests.get(url) without a timeout; update the call to include a sensible
timeout value (e.g., timeout=10) and handle timeout and general request
exceptions (requests.exceptions.Timeout and
requests.exceptions.RequestException) by logging the error (using logging) and
returning an appropriate error string instead of hanging; ensure you modify the
requests.get invocation and add try/except around it in scrape_text_from_url so
timeouts and network errors are caught and reported.

---

Nitpick comments:
In `@main.py`:
- Around line 37-40: Move the standard-library imports ipaddress, socket and
from urllib.parse import urlparse up into the existing top-of-file imports block
(grouped with argparse, json, logging, os, re, etc.) so all stdlib imports are
together per PEP8; then remove the duplicate import lines currently located
around lines 37-40 to avoid redeclaration and keep a single, ordered import
section at the top of the module.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: da4b0510-f060-45f4-8415-7fe0e5ec3585

📥 Commits

Reviewing files that changed from the base of the PR and between 65b179d and fae9bcf.

📒 Files selected for processing (1)

main.py

coderabbitai · 2026-03-27T16:46:43Z

main.py

+def is_safe_url(url):
+    """Check if URL is safe to fetch (not pointing to private/internal addresses)."""
+    try:
+        parsed = urlparse(url)
+        if parsed.scheme not in ('http', 'https'):
+            return False
+        hostname = parsed.hostname
+        if not hostname:
+            return False
+        for addr_info in socket.getaddrinfo(hostname, None):
+            ip = ipaddress.ip_address(addr_info[4][0])
+            if ip.is_private or ip.is_loopback or ip.is_link_local:
+                return False
+        return True
+    except (socket.gaierror, ValueError):
+        return False


⚠️ Potential issue | 🟠 Major

DNS rebinding vulnerability (TOCTOU): validation can be bypassed.

The current implementation resolves DNS in is_safe_url(), but requests.get() performs a separate DNS lookup. An attacker can exploit DNS rebinding: the first lookup returns a safe public IP, then TTL expires and the second lookup (during the actual request) returns a private IP like 169.254.169.254.

Additionally, consider checking ip.is_reserved to block other special-use addresses (documentation, benchmarking, etc.).

Recommended fix: resolve once and pass IP to requests

import ipaddress import socket from urllib.parse import urlparse + +class UnsafeURLError(Exception): + pass -def is_safe_url(url): - """Check if URL is safe to fetch (not pointing to private/internal addresses).""" +def resolve_and_validate_url(url): + """ + Validate URL and resolve to a safe IP address. + Returns (validated_url, resolved_ip) or raises UnsafeURLError. + """ try: parsed = urlparse(url) if parsed.scheme not in ('http', 'https'): - return False + raise UnsafeURLError("Invalid scheme") hostname = parsed.hostname if not hostname: - return False - for addr_info in socket.getaddrinfo(hostname, None): + raise UnsafeURLError("No hostname") + + # Resolve and validate all IPs + addr_infos = socket.getaddrinfo(hostname, None) + safe_ip = None + for addr_info in addr_infos: ip = ipaddress.ip_address(addr_info[4][0]) - if ip.is_private or ip.is_loopback or ip.is_link_local: - return False - return True - except (socket.gaierror, ValueError): - return False + if ip.is_private or ip.is_loopback or ip.is_link_local or ip.is_reserved: + raise UnsafeURLError(f"Blocked IP: {ip}") + if safe_ip is None: + safe_ip = str(ip) + + return url, safe_ip + except (socket.gaierror, ValueError) as e: + raise UnsafeURLError(f"Resolution failed: {e}") def scrape_text_from_url(url): - if not is_safe_url(url): + try: + validated_url, resolved_ip = resolve_and_validate_url(url) + except UnsafeURLError as e: return "Error: URL points to a private or internal address." - response = requests.get(url) + + # Use resolved IP directly with Host header to prevent DNS rebinding + parsed = urlparse(validated_url) + port = parsed.port or (443 if parsed.scheme == 'https' else 80) + ip_url = f"{parsed.scheme}://{resolved_ip}:{port}{parsed.path or '/'}" + if parsed.query: + ip_url += f"?{parsed.query}" + + response = requests.get(ip_url, headers={'Host': parsed.hostname}, timeout=10)

Alternatively, consider using a battle-tested library like ssrf-guard or implementing the fix at the network layer with egress firewall rules.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@main.py` around lines 42 - 57, The DNS rebinding TOCTOU in is_safe_url() can be bypassed because DNS is resolved again when making the request; change the flow so you resolve the hostname once (inside is_safe_url or a new resolve_and_validate(hostname) helper), verify every returned address is not private/loopback/link-local/reserved (use ip.is_private, ip.is_loopback, ip.is_link_local and ip.is_reserved), and then make the HTTP request to the validated numeric IP (or create a socket bound to that IP) while supplying the original hostname in the Host header so requests.get() does not trigger a second DNS lookup; update call sites that use is_safe_url() to use the resolved IP/connection info instead of relying on a second hostname resolution.

fix: prevent SSRF in URL scraping by blocking private/internal IPs

fae9bcf

Signed-off-by: tranquac <tranquac@users.noreply.github.com>

coderabbitai bot reviewed Mar 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: prevent SSRF in URL scraping by blocking private/internal IPs#91

fix: prevent SSRF in URL scraping by blocking private/internal IPs#91
tranquac wants to merge 1 commit intoyoheinakajima:mainfrom
tranquac:fix/ssrf-url-scrape

tranquac commented Mar 27, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Mar 27, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tranquac commented Mar 27, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Fix

Impact

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

tranquac commented Mar 27, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 27, 2026 •

edited

Loading