Skip to content

orassayag/crawler

Crawler

A Node.js + Puppeteer.js application to fetch and validate email addresses by crawling web pages from search engine results using specific and random search keys.

Built in February 2020. This application automates the process of discovering email addresses across the web using intelligent search strategies, advanced validation, and MongoDB storage.

Features

Core Capabilities

  • 🔍 Multi-Search Engine Support: Crawls Bing and Google search results
  • 🤖 Headless Browser: Uses Puppeteer.js for real browser-based page rendering
  • ✉️ Smart Email Validation: Advanced validation with automatic typo correction
  • 🗄️ MongoDB Storage: Stores and deduplicates email addresses
  • 🔄 Auto-Restart Monitor: Automatically restarts on failures or timeouts
  • 🎯 Flexible Goals: Stop based on email count, time duration, or links crawled
  • 📊 Real-Time Statistics: Live console status updates with progress tracking

Technical Excellence

  • 🧪 Development Mode: Test with local sources without making real requests
  • 🚫 Smart Filtering: Configurable domain and email filters
  • 📝 Comprehensive Logging: Logs all emails and links to TXT files
  • 🇮🇱 Hebrew Support: Built-in Hebrew search key generation
  • 🧹 Gibberish Detection: Filters out randomly generated email addresses

Developer Experience

  • Real-Time Feedback: Detailed console status lines for monitoring progress
  • Resilience: Automatic restart logic for handling network timeouts or crashes
  • Testability: Sandbox and specialized test scripts for validating individual components
  • Configuration Flexibility: Modular settings for search, filtering, and goal management

Getting Started

Prerequisites

  • Node.js (v14 or higher)
  • MongoDB (v4 or higher)
  • npm or pnpm

Installation

  1. Clone the repository:
git clone https://github.com/orassayag/crawler.git
cd crawler
  1. Install dependencies:
npm install
  1. Ensure MongoDB is running:
mongod
  1. For production mode with Puppeteer:
npm run preload

Quick Start

Test Mode (Development)

# Edit src/settings/settings.js
# Set IS_PRODUCTION_MODE: false
# Set GOAL_VALUE: 10
npm start

Production Mode

# Edit src/settings/settings.js
# Set IS_PRODUCTION_MODE: true
# Configure search engines and keys
npm run preload
npm start

Type y when prompted to confirm settings and start crawling.

Usage

Starting the Application

The main entry point is the monitor script, which ensures the crawler remains active:

npm start

Goal Management

Configure your crawling goals in src/settings/settings.js:

  • EMAIL_ADDRESSES: Stop after finding X valid emails.
  • MINUTES: Stop after running for X minutes.
  • LINKS: Stop after crawling X links.

Data Extraction

The application automatically extracts emails from the page source of search results and subsequent links, applying multi-stage validation.

Configuration

Edit src/settings/settings.js to configure:

Core Settings

  • IS_PRODUCTION_MODE: Use real crawling (true) or test mode (false)
  • GOAL_TYPE: Stop condition - EMAIL_ADDRESSES, MINUTES, or LINKS
  • GOAL_VALUE: Target value for the goal
  • IS_DROP_COLLECTION: Clear database before starting

Search Configuration

  • SEARCH_KEY: Static search term or null for random keys
  • IS_ADVANCE_SEARCH_KEYS: Use advanced Hebrew keys or basic static keys
  • Search engines configured in src/configurations/files/searchEngines.configuration.js
  • Search keys configured in src/configurations/files/searchKeys.configuration.js

Filtering

  • Email filters: src/configurations/files/filterEmailAddress.configuration.js
  • Link filters: src/configurations/files/filterLinkDomains.configuration.js
  • File extensions: src/configurations/files/filterFileExtensions.configuration.js

See INSTRUCTIONS.md for detailed configuration options.

Available Scripts

Main Application

npm start              # Start crawler with monitoring
npm run backup         # Backup the project
npm run domains        # Count email domains from results

Testing Scripts

npm run val            # Validate single email address
npm run valmany        # Validate multiple email addresses
npm run valdebug       # Debug email validation
npm run typos          # Test typo detection and correction
npm run link           # Test link crawling
npm run session        # Test session with predefined links
npm run generator      # Test email address generation
npm run cases          # Run email validation test cases
npm run sand           # General testing sandbox

Project Structure

crawler/
├── src/
│   ├── monitor/              # Application entry point with restart logic
│   ├── scripts/              # Executable scripts
│   │   ├── crawl.script.js   # Main crawling script
│   │   ├── backup.script.js  # Backup script
│   │   └── domains.script.js # Domain counter script
│   ├── logics/               # Business logic orchestration
│   │   └── crawl.logic.js    # Core crawling logic
│   ├── services/             # Service layer
│   │   ├── crawlLink.service.js          # Link crawling
│   │   ├── crawlEmailAddress.service.js  # Email extraction
│   │   ├── emailAddressValidation.service.js # Email validation
│   │   ├── mongoDatabase.service.js      # Database operations
│   │   ├── puppeteer.service.js          # Browser automation
│   │   └── search.service.js             # Search key generation
│   ├── configurations/       # Configuration files
│   │   ├── searchEngines.configuration.js
│   │   ├── searchKeys.configuration.js
│   │   ├── filterEmailAddress.configuration.js
│   │   └── filterLinkDomains.configuration.js
│   ├── settings/             # Application settings
│   │   └── settings.js       # Main settings file
│   ├── core/                 # Core models and enums
│   │   ├── models/           # Data models
│   │   └── enums/            # Enumerations
│   ├── utils/                # Utility functions
│   └── tests/                # Test files
├── dist/                     # Output files (generated)
│   ├── production/           # Production mode outputs
│   └── development/          # Development mode outputs
├── sources/                  # Test sources for development mode
├── INSTRUCTIONS.md           # Detailed setup and usage guide
├── CONTRIBUTING.md           # Contribution guidelines
└── package.json

Architecture

Directory Structure

The project follows a modular structure:

  • src/monitor/: Process management and auto-restart logic.
  • src/scripts/: Top-level execution scripts for different tasks.
  • src/logics/: Orchestration of complex business workflows.
  • src/services/: Atomic business logic (Puppeteer, MongoDB, Search).
  • src/configurations/: Static and dynamic rule sets.
  • src/settings/: Centralized application configuration.
  • src/core/: Shared models, enums, and data structures.

Architecture Principles

  • Separation of Concerns: UI (Monitor) is separated from Business Logic (Logics) and Infrastructure (Services).
  • Goal-Oriented Design: Execution is driven by configurable targets (emails, time, links).
  • Fault Tolerance: Automatic monitoring and restart mechanisms handle runtime failures.
  • Stateless Services: Most services are designed to be stateless, relying on the logic layer for state management.

Design Patterns

  • Monitor Pattern: A wrapper process handles the lifecycle of the main crawler.
  • Service Layer: Decouples business logic from implementation details like Puppeteer or MongoDB.
  • Configuration-Driven Development: Application behavior is primarily controlled through configuration files rather than code changes.
  • Singleton Pattern: Core services like database and loggers are managed as singletons.

How It Works

graph TB
    A[Start Monitor] --> B[Confirm Settings]
    B --> C{MongoDB Connected?}
    C -->|No| D[Exit with Error]
    C -->|Yes| E[Start Crawl Logic]

    E --> F[Generate Search Key]
    F --> G[Build Search Engine URL]
    G --> H[Fetch Search Results with Puppeteer]

    H --> I[Extract Links from Results]
    I --> J[Filter Links]
    J --> K{More Links?}

    K -->|Yes| L[Fetch Page with Puppeteer]
    L --> M[Extract Email Addresses]
    M --> N[Validate Each Email]

    N --> O{Valid Email?}
    O -->|Yes| P[Check if Exists in DB]
    O -->|No| Q{Can Fix Typo?}

    Q -->|Yes| P
    Q -->|No| R[Log as Invalid]

    P --> S{Exists?}
    S -->|No| T[Save to MongoDB]
    S -->|Yes| U[Skip - Already Exists]

    T --> V[Log to TXT File]
    V --> K
    U --> K
    R --> K

    K -->|No| W{Goal Reached?}
    W -->|No| X[Next Process]
    W -->|Yes| Y[End & Log Statistics]

    X --> F

    Y --> Z[Close Puppeteer]
    Z --> AA[Exit Successfully]

    subgraph "Email Validation"
        N --> N1[Check Format]
        N1 --> N2[Check Common Typos]
        N2 --> N3[Validate Domain]
        N3 --> N4[Gibberish Detection]
        N4 --> N5[Final Validation]
    end

    subgraph "Monitoring"
        BB[Monitor Process] --> CC{Timeout?}
        CC -->|Yes| DD[Auto Restart]
        CC -->|No| BB
        DD --> E
    end
Loading

Architecture Flow

  1. Monitor Layer: Manages process lifecycle and auto-restart
  2. Crawl Logic: Orchestrates the crawling process
  3. Search Service: Generates search keys and builds search URLs
  4. Crawl Link Service: Fetches and extracts links from search engines
  5. Puppeteer Service: Handles browser automation
  6. Crawl Email Service: Extracts emails from page sources
  7. Email Validation Service: Validates and corrects emails
  8. MongoDB Service: Handles database operations
  9. Log Service: Manages console output and file logging

Email Validation Features

The email validation service includes:

  • Format Validation: Checks proper email structure
  • Typo Correction: Automatically fixes common typos (e.g., gmial.comgmail.com)
  • Domain Validation: Verifies domain endings and structure
  • Gibberish Detection: Filters out randomly generated strings
  • Common Domain Recognition: Special handling for Gmail, Hotmail, etc.
  • Character Validation: Removes invalid characters
  • Length Validation: Enforces min/max length constraints

Console Status Example

===IMPORTANT SETTINGS===
SEARCH ENGINES: bing, google
DATABASE: crawl032021
IS_PRODUCTION_MODE: true
IS_DROP_COLLECTION: false
GOAL_TYPE: MINUTES
GOAL_VALUE: 700
========================

===[SETTINGS] Mode: PRODUCTION | Plan: STANDARD | Database: crawl032021 | Active Methods: LINKS,CRAWL===
===[GENERAL] Time: 00.00:05:23 | Goal: MINUTES | Progress: 5/700 (00.71%) | Status: CRAWL | Restarts: 0===
===[PROCESS] Process: 3/10,000 | Page: 1/1 | Engine: Bing | Key: job developer===
===[LINK] Crawl: ✅  15 | Total: 42 | Filter: 27 | Error: 0 | Current: 3/15===
===[EMAIL ADDRESS] Save: ✅  12 | Total: 28 | Database: 15,927 | Exists: 14 | Invalid: ❌  2===

Output Files

All output files are saved in dist/production/YYYYMMDD_HHMMSS/ or dist/development/:

  • valid_email_addresses.txt - Successfully validated emails
  • fix_email_addresses.txt - Emails that were auto-corrected
  • invalid_email_addresses.txt - Invalid emails that couldn't be fixed
  • crawl_links.txt - All crawled page URLs
  • crawl_error_links.txt - URLs that failed to load

Development

Running Tests

# Test email validation
npm run val

# Test link crawling
npm run link

# Test email generation
npm run generator

# Test typo correction
npm run typos

Development Mode

Set IS_PRODUCTION_MODE: false in settings to:

  • Use local HTML sources instead of real requests
  • Test without Puppeteer
  • Avoid rate limiting from search engines
  • Debug faster without network delays

Best Practices

  • Polite Crawling: Respect robots.txt and avoid aggressive crawling frequencies.
  • Goal Setting: Start with smaller goals (e.g., 50 emails) to verify settings before long runs.
  • Database Maintenance: Regularly backup your MongoDB collections using npm run backup.
  • Search Key Diversity: Use advanced search keys to improve discovery rates and avoid search engine pattern detection.
  • Monitor Output: Keep an eye on the console status line to ensure the crawler is making progress.

Contributing

Contributions to this project are released to the public under the project's open source license.

Everyone is welcome to contribute. Contributing doesn't just mean submitting pull requests—there are many different ways to get involved, including answering questions and reporting issues.

See CONTRIBUTING.md for detailed guidelines.

Built With

License

This application has an MIT license - see the LICENSE file for details.

Author

Acknowledgments

  • Built for educational and research purposes
  • Respects robots.txt and implements rate limiting
  • Uses user-agent rotation to avoid detection
  • Implements polite crawling practices

About

A Node.js + Puppeteer.js application to fetch and validate email addresses by crawling web pages from search engine results using specific and random search keys. Built in February 2020. This application automates the process of discovering email addresses across the web using intelligent search strategies, advanced validation, and MongoDB storage

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors