Site Crawler

A Node.js website crawler that finds broken links and performs SEO analysis on each page, generating a comprehensive HTML and JSON report.

Features

🔍 Website Crawling: Automatically discovers and crawls all pages on a website
🔗 Broken Link Detection: Identifies broken links (4xx, 5xx status codes, network errors)
📊 SEO Analysis: Analyzes each page for:
- Title tags (presence, length)
- Meta descriptions (presence, length)
- Heading structure (H1, H2, H3)
- Image alt text
- Word count
- Open Graph tags
- Language attributes
- Canonical URLs
- Viewport meta tags
- And more...
📄 Report Generation: Creates detailed HTML and JSON reports

Installation

npm install

Usage

Command Line

node index.js <url> [options]

Linux Launcher (GUI)

For Linux users with a desktop environment, you can use the provided launcher script:

./crawler.sh

The launcher provides a graphical interface using zenity that:

Prompts for the website URL
Shows a progress dialog while crawling
Saves reports to the reports/ directory with timestamps
Optionally opens the generated report in your default browser

Requirements for launcher:

zenity (for GUI dialogs)
xdg-open (for opening the report)

Note: You may need to adjust the PROJECT_DIR variable in crawler.sh to match your installation path, and make the script executable:

chmod +x crawler.sh

Examples

# Basic usage (https:// is added automatically if missing)
node index.js example.com
node index.js https://example.com

# Limit number of pages to crawl
node index.js example.com --max-pages 50

# Set custom timeout
node index.js example.com --timeout 15000

# Specify output path
node index.js example.com --output ./my-report.html

# Combine options
node index.js example.com --max-pages 200 --timeout 20000 --output ./reports/example-report.html

Options

--max-pages <number>: Maximum number of pages to crawl (default: 100)
--timeout <number>: Request timeout in milliseconds (default: 10000)
--output <path>: Output path for report files (default: ./report.html)

Output

The crawler generates two files:

HTML Report (report.html): A beautiful, interactive HTML report with:
- Summary statistics
- List of broken links
- SEO analysis for each page
- Detailed issue listings
JSON Report (report.json): Machine-readable JSON data with all crawl results

Report Contents

Summary

Total pages crawled
Pages with errors
Total and unique links found
Broken links count
SEO issues (critical and warnings)

Broken Links

URL
HTTP status code
Status text/error message
Pages where the link was found

SEO Analysis

For each page:

Title tag analysis
Meta description analysis
Heading structure (H1, H2, H3)
Image alt text compliance
Word count
Language attributes
Open Graph tags
Issues and recommendations

Requirements

Node.js 18+ (ES modules support)
Internet connection for crawling

For Linux launcher (crawler.sh):

zenity (for GUI dialogs)
xdg-open (for opening reports in browser)

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
reports		reports
.gitignore		.gitignore
README.md		README.md
brokenLinkChecker.js		brokenLinkChecker.js
crawler.js		crawler.js
crawler.sh		crawler.sh
index.js		index.js
reportGenerator.js		reportGenerator.js
seoAnalyzer.js		seoAnalyzer.js

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Site Crawler

Features

Installation

Usage

Command Line

Linux Launcher (GUI)

Examples

Options

Output

Report Contents

Summary

Broken Links

SEO Analysis

Requirements

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Site Crawler

Features

Installation

Usage

Command Line

Linux Launcher (GUI)

Examples

Options

Output

Report Contents

Summary

Broken Links

SEO Analysis

Requirements

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages