THE Email Scraper — Quickstart & Developer Guide

Purpose

Extract contact emails and related pages for a list of companies (Excel input). Supports static pages, sitemaps and dynamic pages using Playwright.

Prerequisites

Python 3.10+ (project has been run with Python 3.13 in logs).
A virtual environment is strongly recommended.

Canonical quickstart (cross-platform)

Create a virtual environment and activate it

POSIX (bash / zsh / macOS / Linux):

python -m venv .venv
source .venv/bin/activate

PowerShell (Windows):

python -m venv .venv
. .\.venv\Scripts\Activate.ps1

Install dependencies
```
pip install -r requirements.txt
```
Install Playwright browsers (required for dynamic rendering)
```
# inside the activated virtualenv
playwright install
```
If you see errors like "Executable doesn't exist...", re-run this step inside the activated virtualenv.
Configure environment (create a .env file in the project root)

Minimum recommended variables:

GOOGLE_API_KEY and GOOGLE_CX_ID (for Google site search path)
MAX_WORKERS, PROCESS_PDFS, DOMAIN_SCORE_THRESHOLD, etc. — see scraper/config.py for defaults.

Run the scraper

# example usage
scraper test_input.xlsx results.xlsx
# or set environment overrides inline (POSIX)
MAX_WORKERS=8 scraper test_input.xlsx results.xlsx
# or (PowerShell)
$env:MAX_WORKERS=8; scraper test_input.xlsx results.xlsx

Expected flow (high level)

CLI reads Excel input -> Orchestrator scores & resolves domains -> static fetcher or Playwright BrowserService renders pages -> parsers extract emails (optionally from PDFs) -> results written to results.xlsx.

Playwright note

Playwright is required for rendering/login flows (Canvas). After adding or updating Playwright, always run playwright install inside the activated virtualenv to download browser executables.

Minimal .env example

GOOGLE_API_KEY=your_key
GOOGLE_CX_ID=your_cx
MAX_WORKERS=8
PROCESS_PDFS=False

Troubleshooting

Error: "Executable doesn't exist..." — run playwright install inside the activated virtualenv.
Missing Google keys — set GOOGLE_API_KEY and GOOGLE_CX_ID or use a non‑Google lookup path if available.
Windows multiprocessing issues with high worker counts — reduce MAX_WORKERS in .env or via the CLI.

Developer workflow

Branch from main for features (e.g., feature/canvas-scraper).
Run formatting and linting (e.g., black, flake8) before committing.
Add unit tests with pytest; Playwright browser install is required in CI if tests rely on it.
For local debugging, set LOGLEVEL=DEBUG or run with a single-row test_input.xlsx.

Config reference

Defaults and keys are in scraper/config.py (see the Config class). Important values: MAX_WORKERS, PROCESS_PDFS, GOOGLE_API_KEY, GOOGLE_CX_ID, MAX_URLS_PER_SITEMAP, timeouts and user agents.

CI hints

Ensure playwright install runs as a CI step and cache the downloaded browsers between runs to save time.

Next steps

I can add README.dev.md with extended developer notes or create scripts/dev_setup.ps1 to automate venv creation, dependency installation and playwright install.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
Canvas		Canvas
__pycache__		__pycache__
archive		archive
scraper		scraper
test		test
.gitignore		.gitignore
README.md		README.md
__main__.py		__main__.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

THE Email Scraper — Quickstart & Developer Guide

About

Uh oh!

Releases

Packages

Languages

Jul352mf/THE_Email_Scraper

Folders and files

Latest commit

History

Repository files navigation

THE Email Scraper — Quickstart & Developer Guide

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages