Skip to content

logimaxx/site_crawler

Repository files navigation

Site Crawler

A Node.js website crawler that finds broken links and performs SEO analysis on each page, generating a comprehensive HTML and JSON report.

Features

  • 🔍 Website Crawling: Automatically discovers and crawls all pages on a website
  • 🔗 Broken Link Detection: Identifies broken links (4xx, 5xx status codes, network errors)
  • 📊 SEO Analysis: Analyzes each page for:
    • Title tags (presence, length)
    • Meta descriptions (presence, length)
    • Heading structure (H1, H2, H3)
    • Image alt text
    • Word count
    • Open Graph tags
    • Language attributes
    • Canonical URLs
    • Viewport meta tags
    • And more...
  • 📄 Report Generation: Creates detailed HTML and JSON reports

Installation

npm install

Usage

Command Line

node index.js <url> [options]

Linux Launcher (GUI)

For Linux users with a desktop environment, you can use the provided launcher script:

./crawler.sh

The launcher provides a graphical interface using zenity that:

  • Prompts for the website URL
  • Shows a progress dialog while crawling
  • Saves reports to the reports/ directory with timestamps
  • Optionally opens the generated report in your default browser

Requirements for launcher:

  • zenity (for GUI dialogs)
  • xdg-open (for opening the report)

Note: You may need to adjust the PROJECT_DIR variable in crawler.sh to match your installation path, and make the script executable:

chmod +x crawler.sh

Examples

# Basic usage (https:// is added automatically if missing)
node index.js example.com
node index.js https://example.com

# Limit number of pages to crawl
node index.js example.com --max-pages 50

# Set custom timeout
node index.js example.com --timeout 15000

# Specify output path
node index.js example.com --output ./my-report.html

# Combine options
node index.js example.com --max-pages 200 --timeout 20000 --output ./reports/example-report.html

Options

  • --max-pages <number>: Maximum number of pages to crawl (default: 100)
  • --timeout <number>: Request timeout in milliseconds (default: 10000)
  • --output <path>: Output path for report files (default: ./report.html)

Output

The crawler generates two files:

  1. HTML Report (report.html): A beautiful, interactive HTML report with:

    • Summary statistics
    • List of broken links
    • SEO analysis for each page
    • Detailed issue listings
  2. JSON Report (report.json): Machine-readable JSON data with all crawl results

Report Contents

Summary

  • Total pages crawled
  • Pages with errors
  • Total and unique links found
  • Broken links count
  • SEO issues (critical and warnings)

Broken Links

  • URL
  • HTTP status code
  • Status text/error message
  • Pages where the link was found

SEO Analysis

For each page:

  • Title tag analysis
  • Meta description analysis
  • Heading structure (H1, H2, H3)
  • Image alt text compliance
  • Word count
  • Language attributes
  • Open Graph tags
  • Issues and recommendations

Requirements

  • Node.js 18+ (ES modules support)
  • Internet connection for crawling

For Linux launcher (crawler.sh):

  • zenity (for GUI dialogs)
  • xdg-open (for opening reports in browser)

License

MIT

About

A Node.js website crawler that finds broken links and performs SEO analysis on each page, generating a comprehensive HTML and JSON report.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors