Crawler

A Node.js + Puppeteer.js application to fetch and validate email addresses by crawling web pages from search engine results using specific and random search keys.

Built in February 2020. This application automates the process of discovering email addresses across the web using intelligent search strategies, advanced validation, and MongoDB storage.

Features

Core Capabilities

🔍 Multi-Search Engine Support: Crawls Bing and Google search results
🤖 Headless Browser: Uses Puppeteer.js for real browser-based page rendering
✉️ Smart Email Validation: Advanced validation with automatic typo correction
🗄️ MongoDB Storage: Stores and deduplicates email addresses
🔄 Auto-Restart Monitor: Automatically restarts on failures or timeouts
🎯 Flexible Goals: Stop based on email count, time duration, or links crawled
📊 Real-Time Statistics: Live console status updates with progress tracking

Technical Excellence

🧪 Development Mode: Test with local sources without making real requests
🚫 Smart Filtering: Configurable domain and email filters
📝 Comprehensive Logging: Logs all emails and links to TXT files
🇮🇱 Hebrew Support: Built-in Hebrew search key generation
🧹 Gibberish Detection: Filters out randomly generated email addresses

Developer Experience

Real-Time Feedback: Detailed console status lines for monitoring progress
Resilience: Automatic restart logic for handling network timeouts or crashes
Testability: Sandbox and specialized test scripts for validating individual components
Configuration Flexibility: Modular settings for search, filtering, and goal management

Getting Started

Prerequisites

Node.js (v14 or higher)
MongoDB (v4 or higher)
npm or pnpm

Installation

Clone the repository:

git clone https://github.com/orassayag/crawler.git
cd crawler

Install dependencies:

npm install

Ensure MongoDB is running:

mongod

For production mode with Puppeteer:

npm run preload

Quick Start

Test Mode (Development)

# Edit src/settings/settings.js
# Set IS_PRODUCTION_MODE: false
# Set GOAL_VALUE: 10
npm start

Production Mode

# Edit src/settings/settings.js
# Set IS_PRODUCTION_MODE: true
# Configure search engines and keys
npm run preload
npm start

Type y when prompted to confirm settings and start crawling.

Usage

Starting the Application

The main entry point is the monitor script, which ensures the crawler remains active:

npm start

Goal Management

Configure your crawling goals in src/settings/settings.js:

EMAIL_ADDRESSES: Stop after finding X valid emails.
MINUTES: Stop after running for X minutes.
LINKS: Stop after crawling X links.

Data Extraction

The application automatically extracts emails from the page source of search results and subsequent links, applying multi-stage validation.

Configuration

Edit src/settings/settings.js to configure:

Core Settings

IS_PRODUCTION_MODE: Use real crawling (true) or test mode (false)
GOAL_TYPE: Stop condition - EMAIL_ADDRESSES, MINUTES, or LINKS
GOAL_VALUE: Target value for the goal
IS_DROP_COLLECTION: Clear database before starting

Search Configuration

SEARCH_KEY: Static search term or null for random keys
IS_ADVANCE_SEARCH_KEYS: Use advanced Hebrew keys or basic static keys
Search engines configured in src/configurations/files/searchEngines.configuration.js
Search keys configured in src/configurations/files/searchKeys.configuration.js

Filtering

Email filters: src/configurations/files/filterEmailAddress.configuration.js
Link filters: src/configurations/files/filterLinkDomains.configuration.js
File extensions: src/configurations/files/filterFileExtensions.configuration.js

See INSTRUCTIONS.md for detailed configuration options.

Available Scripts

Main Application

npm start              # Start crawler with monitoring
npm run backup         # Backup the project
npm run domains        # Count email domains from results

Testing Scripts

npm run val            # Validate single email address
npm run valmany        # Validate multiple email addresses
npm run valdebug       # Debug email validation
npm run typos          # Test typo detection and correction
npm run link           # Test link crawling
npm run session        # Test session with predefined links
npm run generator      # Test email address generation
npm run cases          # Run email validation test cases
npm run sand           # General testing sandbox

Project Structure

crawler/
├── src/
│   ├── monitor/              # Application entry point with restart logic
│   ├── scripts/              # Executable scripts
│   │   ├── crawl.script.js   # Main crawling script
│   │   ├── backup.script.js  # Backup script
│   │   └── domains.script.js # Domain counter script
│   ├── logics/               # Business logic orchestration
│   │   └── crawl.logic.js    # Core crawling logic
│   ├── services/             # Service layer
│   │   ├── crawlLink.service.js          # Link crawling
│   │   ├── crawlEmailAddress.service.js  # Email extraction
│   │   ├── emailAddressValidation.service.js # Email validation
│   │   ├── mongoDatabase.service.js      # Database operations
│   │   ├── puppeteer.service.js          # Browser automation
│   │   └── search.service.js             # Search key generation
│   ├── configurations/       # Configuration files
│   │   ├── searchEngines.configuration.js
│   │   ├── searchKeys.configuration.js
│   │   ├── filterEmailAddress.configuration.js
│   │   └── filterLinkDomains.configuration.js
│   ├── settings/             # Application settings
│   │   └── settings.js       # Main settings file
│   ├── core/                 # Core models and enums
│   │   ├── models/           # Data models
│   │   └── enums/            # Enumerations
│   ├── utils/                # Utility functions
│   └── tests/                # Test files
├── dist/                     # Output files (generated)
│   ├── production/           # Production mode outputs
│   └── development/          # Development mode outputs
├── sources/                  # Test sources for development mode
├── INSTRUCTIONS.md           # Detailed setup and usage guide
├── CONTRIBUTING.md           # Contribution guidelines
└── package.json

Architecture

Directory Structure

The project follows a modular structure:

src/monitor/: Process management and auto-restart logic.
src/scripts/: Top-level execution scripts for different tasks.
src/logics/: Orchestration of complex business workflows.
src/services/: Atomic business logic (Puppeteer, MongoDB, Search).
src/configurations/: Static and dynamic rule sets.
src/settings/: Centralized application configuration.
src/core/: Shared models, enums, and data structures.

Architecture Principles

Separation of Concerns: UI (Monitor) is separated from Business Logic (Logics) and Infrastructure (Services).
Goal-Oriented Design: Execution is driven by configurable targets (emails, time, links).
Fault Tolerance: Automatic monitoring and restart mechanisms handle runtime failures.
Stateless Services: Most services are designed to be stateless, relying on the logic layer for state management.

Design Patterns

Monitor Pattern: A wrapper process handles the lifecycle of the main crawler.
Service Layer: Decouples business logic from implementation details like Puppeteer or MongoDB.
Configuration-Driven Development: Application behavior is primarily controlled through configuration files rather than code changes.
Singleton Pattern: Core services like database and loggers are managed as singletons.

How It Works

graph TB
    A[Start Monitor] --> B[Confirm Settings]
    B --> C{MongoDB Connected?}
    C -->|No| D[Exit with Error]
    C -->|Yes| E[Start Crawl Logic]

    E --> F[Generate Search Key]
    F --> G[Build Search Engine URL]
    G --> H[Fetch Search Results with Puppeteer]

    H --> I[Extract Links from Results]
    I --> J[Filter Links]
    J --> K{More Links?}

    K -->|Yes| L[Fetch Page with Puppeteer]
    L --> M[Extract Email Addresses]
    M --> N[Validate Each Email]

    N --> O{Valid Email?}
    O -->|Yes| P[Check if Exists in DB]
    O -->|No| Q{Can Fix Typo?}

    Q -->|Yes| P
    Q -->|No| R[Log as Invalid]

    P --> S{Exists?}
    S -->|No| T[Save to MongoDB]
    S -->|Yes| U[Skip - Already Exists]

    T --> V[Log to TXT File]
    V --> K
    U --> K
    R --> K

    K -->|No| W{Goal Reached?}
    W -->|No| X[Next Process]
    W -->|Yes| Y[End & Log Statistics]

    X --> F

    Y --> Z[Close Puppeteer]
    Z --> AA[Exit Successfully]

    subgraph "Email Validation"
        N --> N1[Check Format]
        N1 --> N2[Check Common Typos]
        N2 --> N3[Validate Domain]
        N3 --> N4[Gibberish Detection]
        N4 --> N5[Final Validation]
    end

    subgraph "Monitoring"
        BB[Monitor Process] --> CC{Timeout?}
        CC -->|Yes| DD[Auto Restart]
        CC -->|No| BB
        DD --> E
    end

Architecture Flow

Monitor Layer: Manages process lifecycle and auto-restart
Crawl Logic: Orchestrates the crawling process
Search Service: Generates search keys and builds search URLs
Crawl Link Service: Fetches and extracts links from search engines
Puppeteer Service: Handles browser automation
Crawl Email Service: Extracts emails from page sources
Email Validation Service: Validates and corrects emails
MongoDB Service: Handles database operations
Log Service: Manages console output and file logging

Email Validation Features

The email validation service includes:

Format Validation: Checks proper email structure
Typo Correction: Automatically fixes common typos (e.g., gmial.com → gmail.com)
Domain Validation: Verifies domain endings and structure
Gibberish Detection: Filters out randomly generated strings
Common Domain Recognition: Special handling for Gmail, Hotmail, etc.
Character Validation: Removes invalid characters
Length Validation: Enforces min/max length constraints

Console Status Example

===IMPORTANT SETTINGS===
SEARCH ENGINES: bing, google
DATABASE: crawl032021
IS_PRODUCTION_MODE: true
IS_DROP_COLLECTION: false
GOAL_TYPE: MINUTES
GOAL_VALUE: 700
========================

===[SETTINGS] Mode: PRODUCTION | Plan: STANDARD | Database: crawl032021 | Active Methods: LINKS,CRAWL===
===[GENERAL] Time: 00.00:05:23 | Goal: MINUTES | Progress: 5/700 (00.71%) | Status: CRAWL | Restarts: 0===
===[PROCESS] Process: 3/10,000 | Page: 1/1 | Engine: Bing | Key: job developer===
===[LINK] Crawl: ✅  15 | Total: 42 | Filter: 27 | Error: 0 | Current: 3/15===
===[EMAIL ADDRESS] Save: ✅  12 | Total: 28 | Database: 15,927 | Exists: 14 | Invalid: ❌  2===

Output Files

All output files are saved in dist/production/YYYYMMDD_HHMMSS/ or dist/development/:

valid_email_addresses.txt - Successfully validated emails
fix_email_addresses.txt - Emails that were auto-corrected
invalid_email_addresses.txt - Invalid emails that couldn't be fixed
crawl_links.txt - All crawled page URLs
crawl_error_links.txt - URLs that failed to load

Development

Running Tests

# Test email validation
npm run val

# Test link crawling
npm run link

# Test email generation
npm run generator

# Test typo correction
npm run typos

Development Mode

Set IS_PRODUCTION_MODE: false in settings to:

Use local HTML sources instead of real requests
Test without Puppeteer
Avoid rate limiting from search engines
Debug faster without network delays

Best Practices

Polite Crawling: Respect robots.txt and avoid aggressive crawling frequencies.
Goal Setting: Start with smaller goals (e.g., 50 emails) to verify settings before long runs.
Database Maintenance: Regularly backup your MongoDB collections using npm run backup.
Search Key Diversity: Use advanced search keys to improve discovery rates and avoid search engine pattern detection.
Monitor Output: Keep an eye on the console status line to ensure the crawler is making progress.

Contributing

Contributions to this project are released to the public under the project's open source license.

Everyone is welcome to contribute. Contributing doesn't just mean submitting pull requests—there are many different ways to get involved, including answering questions and reporting issues.

See CONTRIBUTING.md for detailed guidelines.

Built With

Node.js - JavaScript runtime
Puppeteer - Headless browser automation
MongoDB - Database
Mongoose - MongoDB object modeling
Axios - HTTP client
forever-monitor - Process monitoring

License

This application has an MIT license - see the LICENSE file for details.

Author

Or Assayag - Initial work - orassayag
Or Assayag orassayag@gmail.com
GitHub: https://github.com/orassayag
StackOverflow: https://stackoverflow.com/users/4442606/or-assayag?tab=profile
LinkedIn: https://linkedin.com/in/orassayag

Acknowledgments

Built for educational and research purposes
Respects robots.txt and implements rate limiting
Uses user-agent rotation to avoid detection
Implements polite crawling practices

Name		Name	Last commit message	Last commit date
Latest commit History 468 Commits
.github/rulesets		.github/rulesets
.vscode		.vscode
misc/documents		misc/documents
sources		sources
src		src
.gitignore		.gitignore
.npmignore		.npmignore
.prettierrc		.prettierrc
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
INSTRUCTIONS.md		INSTRUCTIONS.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
knip.json		knip.json
package-lock.json		package-lock.json
package.json		package.json

Folders and files

Latest commit

History

Repository files navigation

Crawler

Features

Core Capabilities

Technical Excellence

Developer Experience

Getting Started

Prerequisites

Installation

Quick Start

Test Mode (Development)

Production Mode

Usage

Starting the Application

Goal Management

Data Extraction

Configuration

Core Settings

Search Configuration

Filtering

Available Scripts

Main Application

Testing Scripts

Project Structure

Architecture

Directory Structure

Architecture Principles

Design Patterns

How It Works

Architecture Flow

Email Validation Features

Console Status Example

Output Files

Development

Running Tests

Development Mode

Best Practices

Contributing

Built With

License

Author

Acknowledgments

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages