Digital Footprint Investigator

EDUCATIONAL USE ONLY: This tool is designed for educational purposes, security research, and legitimate OSINT investigations. Users must comply with all applicable laws and ethical guidelines. Misuse for stalking, harassment, or illegal activities is strictly prohibited.

A multi-agent OSINT tool built with LangGraph that searches across Google, social media platforms, and enrichment APIs in parallel, then chains the results through an analysis and report-generation stage powered by Claude (Anthropic).

Features

Parallel LangGraph workflow: Google search and social media search run simultaneously; results feed a single analysis → report pipeline
Platform coverage: Google (Tavily → SerpAPI → free fallback), GitHub, Reddit, Twitter/X (with dork fallback), YouTube, LinkedIn, Instagram (dork fallback), Facebook (dork fallback), SoundCloud (dork fallback)
API enrichment: HIBP breach detection, Hunter.io email discovery
Resilient execution: Automatic retry with exponential backoff, disk-based caching to avoid rate limits
Input validation: Sanitization and validation of all user inputs
Streamlit web UI: interactive interface with dual tabs (Investigate / Reports) for real-time log streaming and viewing saved reports inline
CLI mode: scriptable via python main.py
Advanced analysis (optional): timeline correlation, network analysis, deep content analysis
Docker-first: a single image runs the app, unit tests, and UI tests

Prerequisites

Python 3.11+ (or Docker)
Required: Anthropic API key
Optional: additional API keys for richer results (see Environment)

Quick Start

Option 1: Docker (Recommended)

# Copy and fill in your API keys
cp .env.example .env   # then edit .env and set ANTHROPIC_API_KEY

# Build and start the web UI
docker compose up --build

Open http://localhost:8501 in your browser.

CLI via Docker:

docker compose run --rm osint-tool python main.py "John Doe"

Option 2: Local Python

# Install dependencies
pip install -r requirements.txt

# Copy and configure environment
cp .env.example .env   # then edit .env

# Start the web UI
streamlit run app.py

# Or use the CLI
python main.py "John Doe"

Environment

Copy .env.example to .env and configure:

# Required
ANTHROPIC_API_KEY=sk-ant-...
LLM_MODEL=claude-sonnet-4-6          # default; change to use a different Claude model

# Optional — the tool works without these, but results improve significantly
TAVILY_API_KEY=...                    # Best Google results (free tier available)
SERPAPI_KEY=...                       # Google results fallback ($50/mo, 5000 searches)
GITHUB_TOKEN=...                      # Higher rate limits (free personal access token)
TWITTER_BEARER_TOKEN=...              # Twitter timeline access (free tier available)
YOUTUBE_API_KEY=...                   # YouTube channel data (free, 10 000 units/day)
HUNTER_API_KEY=...                    # Email discovery (free tier: 25 searches/month)
HIBP_API_KEY=...                      # Breach detection ($3.50/month)

# Logging
LOG_LEVEL=INFO                        # DEBUG, INFO, WARNING, ERROR

Getting API keys:

Key	Where to get it	Cost
`ANTHROPIC_API_KEY`	console.anthropic.com	Pay-per-use
`GITHUB_TOKEN`	GitHub → Settings → Developer settings → Personal access tokens	Free
`YOUTUBE_API_KEY`	console.developers.google.com → YouTube Data API v3	Free (quota)
`TWITTER_BEARER_TOKEN`	developer.twitter.com	Free tier
`HUNTER_API_KEY`	hunter.io/api	Free tier (25/mo)
`HIBP_API_KEY`	haveibeenpwned.com/API/Key	$3.50/mo
`TAVILY_API_KEY`	tavily.com	Free tier available
`SERPAPI_KEY`	serpapi.com	$50/mo

Quick start recommendation: Enable TAVILY_API_KEY, GITHUB_TOKEN, and YOUTUBE_API_KEY first — all have free tiers and require no approval.

Project Structure

DigitalFootprintInvestigator/
├── graph/
│   ├── nodes/
│   │   ├── _timing.py        # Shared log_start / log_done helpers
│   │   ├── search.py         # Google and social search nodes (run in parallel)
│   │   ├── analysis.py       # Data correlation and pattern extraction
│   │   ├── advanced.py       # Optional timeline / network / content analysis
│   │   └── report.py         # Claude-powered report generation
│   ├── state.py              # LangGraph state TypedDict
│   └── workflow.py           # Graph construction and MemorySaver checkpointing
├── tools/
│   ├── search_tools.py       # Google search and platform scrapers
│   └── api_tools.py          # HIBP, Hunter.io, YouTube, Twitter wrappers
├── utils/
│   ├── llm.py                # Shared ChatAnthropic factory
│   ├── logger.py             # Logging setup
│   ├── cache.py              # Disk-based caching for API calls
│   ├── retry.py              # Exponential backoff retry logic
│   ├── validation.py         # Input validation and sanitization
│   └── models.py             # Pydantic data models
├── tests/
│   ├── conftest.py           # Playwright session/page fixtures
│   ├── healer.py             # Self-healing Playwright page wrapper
│   ├── unit/                 # 215 unit tests (no browser or live API required)
│   └── ui/                   # 25 Playwright browser tests
├── app.py                    # Streamlit web UI
├── main.py                   # CLI entry point
├── config.yaml               # Platform and analysis settings
├── pytest.ini                # Test markers and paths
├── .env.example              # Environment variable template
├── requirements.txt          # Python dependencies
├── Dockerfile                # Single image used by all three services
├── docker-compose.yml        # Services: osint-tool, unit-tests, tests
├── pyproject.toml            # Bandit security scan config
├── .pre-commit-config.yaml   # Pre-commit hooks
└── .dockerignore             # Docker build exclusions

Running Tests

Locally

# Unit tests (no browser, no API key needed)
python -m pytest tests/unit/ -v

# UI tests (starts Streamlit automatically; requires a Playwright browser)
playwright install chromium   # first time only
python -m pytest tests/ui/ -m "not integration" -v

# Integration tests (require ANTHROPIC_API_KEY and a full workflow run)
python -m pytest tests/ui/ -m integration -v

In Docker

# Unit tests — no running app needed
docker compose run --rm unit-tests

# UI tests — automatically starts the app and waits for it to be healthy
docker compose run --rm tests

Integration tests are excluded by default in Docker (-m "not integration"). To run them: docker compose run --rm tests pytest tests/ui/ -v

Configuration

config.yaml controls platforms and advanced analysis defaults. The Streamlit sidebar and CLI flags override these values per run.

Advanced analysis (each option adds an extra LLM pass; all off by default):

advanced_analysis:
  timeline_correlation: false   # Build a chronological activity timeline
  network_analysis: false       # Map relationships between accounts
  deep_content_analysis: false  # Sentiment, topics, behavioral patterns

CLI flags:

python main.py "Jane Smith" --timeline --network --deep

Usage Examples

# Name
python main.py "Jane Smith"

# Email
python main.py "jane.smith@example.com"

# Username
python main.py "@janesmith"

# With advanced analysis
python main.py "Jane Smith" --timeline --network --deep

Reports are saved to reports/ with a timestamp, e.g. reports/Jane_Smith_20260227_143022.md.

Code Quality

Pre-commit hooks run automatically on git commit:

pip install pre-commit
pre-commit install

Hooks: detect-secrets, ruff (lint + format), merge-conflict detection, large-file guard, YAML/JSON/TOML validation, debug-statement blocking, and bandit security scanning. Bandit config lives in pyproject.toml.

To run all hooks manually: pre-commit run --all-files

Customization

Adding a platform

Add a search function to tools/search_tools.py following the _search_github / _search_reddit pattern.
Register it in the _search_platform dispatch table in the same file.
Optionally add platform config to config.yaml.

Adding a graph node

Create graph/nodes/my_node.py:

from graph.state import OSINTState
from graph.nodes._timing import log_start, log_done

def my_node(state: OSINTState) -> dict:
    start = log_start("My Node")
    # ... process state ...
    log_done("My Node", start)
    return {"my_key": result}

Register in graph/workflow.py:

from .nodes.my_node import my_node

workflow.add_node("my_node", my_node)
workflow.add_edge("analysis", "my_node")

Troubleshooting

"No Anthropic API key found" — create .env from .env.example and set ANTHROPIC_API_KEY.

Google searches return no results — the free googlesearch-python library is rate-limited and unreliable. Add TAVILY_API_KEY or SERPAPI_KEY to .env for consistent results (Tavily is tried first, then SerpAPI, then the free fallback).

Page refresh hangs in Docker on Windows — Streamlit’s file watcher conflicts with Docker volume mounts. fileWatcherType = "none" is set in .streamlit/config.toml; restore it if it gets removed.

Module not found — run pip install -r requirements.txt and confirm Python 3.11+.

Console warning: missing ScriptRunContext — harmless startup warning, handled in app.py.

Console warning: file_cache is only supported with oauth2client<4.0.0 — harmless warning from the Google API client library.

Docker Desktop won’t start (WSL error) — the Ubuntu WSL distro may have auto-shut down. Run wsl -d Ubuntu in a terminal first, wait a few seconds, then retry Docker Desktop.

Privacy & Ethics

EDUCATIONAL AND LEGITIMATE PURPOSES ONLY

Appropriate uses: security research, due diligence, investigative journalism, personal privacy audits, OSINT methodology research, identity verification with consent.

Prohibited uses: stalking or harassment, unauthorized surveillance, identity theft, doxxing, any illegal activity.

All searches use publicly available information only. Users are responsible for compliance with:

Local privacy laws (GDPR, CCPA, etc.)
Platform Terms of Service (including Twitter/X, Reddit, YouTube, and others)
The Computer Fraud and Abuse Act (CFAA) and equivalent local laws

The consent checkbox in the UI is not a legal shield — you remain fully responsible for how you use this tool.

By using this tool, you agree to use it responsibly and ethically.

Performance & Reliability

The tool includes built-in optimizations:

Caching: API responses cached for 1-24 hours to reduce quota usage and improve speed. Repeated searches are significantly faster.
Retry logic: Failed API calls automatically retry up to 3 times with exponential backoff for resilience against network issues.
Input validation: All inputs sanitized and validated before processing to prevent errors and improve security.

Cache files are stored in .cache/ and automatically expire based on TTL. To clear cache: rm -rf .cache/ (Unix) or rmdir /s .cache (Windows).

License

This project is for educational purposes. Use responsibly and in accordance with applicable laws.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Digital Footprint Investigator

Features

Prerequisites

Quick Start

Option 1: Docker (Recommended)

Option 2: Local Python

Environment

Project Structure

Running Tests

Locally

In Docker

Configuration

Usage Examples

Code Quality

Customization

Adding a platform

Adding a graph node

Troubleshooting

Privacy & Ethics

Performance & Reliability

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
.claude		.claude
.github/workflows		.github/workflows
.streamlit		.streamlit
.vscode		.vscode
graph		graph
scripts		scripts
tests		tests
tools		tools
utils		utils
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
.mcp.json		.mcp.json
.pre-commit-config.yaml		.pre-commit-config.yaml
.secrets.baseline		.secrets.baseline
BACKLOG.md		BACKLOG.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
app.py		app.py
config.yaml		config.yaml
docker-compose.yml		docker-compose.yml
main.py		main.py
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
requirements.txt		requirements.txt
ruff.toml		ruff.toml

Folders and files

Latest commit

History

Repository files navigation

Digital Footprint Investigator

Features

Prerequisites

Quick Start

Option 1: Docker (Recommended)

Option 2: Local Python

Environment

Project Structure

Running Tests

Locally

In Docker

Configuration

Usage Examples

Code Quality

Customization

Adding a platform

Adding a graph node

Troubleshooting

Privacy & Ethics

Performance & Reliability

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages