Name	Name	Last commit message	Last commit date
parent directory ..
backend	backend
frontend	frontend
screenshots	screenshots
tests	tests
.gitignore	.gitignore
CLAUDE.md	CLAUDE.md
README.md	README.md
architecture.md	architecture.md
docker-compose.dev.yml	docker-compose.dev.yml
docker-compose.yml	docker-compose.yml
package.json	package.json
playwright.config.ts	playwright.config.ts
system-design-answer-backend.md	system-design-answer-backend.md
system-design-answer-frontend.md	system-design-answer-frontend.md
system-design-answer-fullstack.md	system-design-answer-fullstack.md

Web Crawler

A distributed web crawling system for indexing the internet with a focus on politeness, scalability, and efficient URL management.

Codebase Stats

Metric	Value
Total SLOC	10,239
Source Files	55
.ts	5,959
.md	2,170
.tsx	1,632
.yml	179
.json	152

Features

URL Frontier with Priority Queue: Manages URLs to crawl with three priority levels (high, medium, low)
Distributed Workers: Multiple crawler workers that can run in parallel
Politeness: Respects robots.txt directives and implements per-domain rate limiting
Deduplication: Uses Redis sets for efficient URL deduplication
Content Extraction: Parses HTML to extract links, titles, and metadata
Real-time Dashboard: Monitor crawl progress, view statistics, and manage the crawler
Admin Interface: Add seed URLs, recover stale jobs, and manage system settings

Architecture

                    +------------------+
                    |   Frontend       |
                    |   (Dashboard)    |
                    +--------+---------+
                             |
                    +--------v---------+
                    |   API Server     |
                    |   (Express.js)   |
                    +--------+---------+
                             |
         +-------------------+-------------------+
         |                   |                   |
+--------v-------+  +--------v-------+  +--------v-------+
| Worker 1       |  | Worker 2       |  | Worker N       |
| - Fetch pages  |  | - Fetch pages  |  | - Fetch pages  |
| - Parse HTML   |  | - Parse HTML   |  | - Parse HTML   |
| - Extract links|  | - Extract links|  | - Extract links|
+----------------+  +----------------+  +----------------+
         |                   |                   |
         +-------------------+-------------------+
                             |
              +-----------------------------+
              |           Data Layer        |
              |  +----------+ +----------+  |
              |  |PostgreSQL| |  Redis   |  |
              |  | Frontier | | Dedup/   |  |
              |  | Metadata | | Cache    |  |
              |  +----------+ +----------+  |
              +-----------------------------+

Tech Stack

Frontend: TypeScript, React 19, Vite, TanStack Router, Zustand, Tailwind CSS
Backend: Node.js, Express.js, TypeScript
Database: PostgreSQL (URL frontier, crawled pages metadata)
Cache/Dedup: Redis (visited URLs set, rate limiting, robots.txt cache)
Parsing: Cheerio (HTML parsing)

Getting Started

Prerequisites

Node.js 20+
Docker and Docker Compose
npm or yarn

Quick Start with Docker

The easiest way to run the entire system:

# Clone and navigate to project
cd web-crawler

# Start all services (PostgreSQL, Redis, API, 3 workers, frontend)
docker-compose up -d

# View logs
docker-compose logs -f

# Access the dashboard
open http://localhost:5173

Local Development Setup

For development, you can run the infrastructure in Docker and the application natively:

1. Start Infrastructure

# Start PostgreSQL and Redis only
docker-compose -f docker-compose.dev.yml up -d

2. Setup Backend

cd backend

# Install dependencies
npm install

# Run database migrations
npm run db:migrate

# Seed initial URLs (optional)
npm run db:seed

# Start API server (port 3001)
npm run dev

3. Start Crawler Workers

Open additional terminals to run workers:

cd backend

# Terminal 2: Start worker 1
npm run dev:worker1

# Terminal 3: Start worker 2
npm run dev:worker2

# Terminal 4: Start worker 3
npm run dev:worker3

4. Setup Frontend

cd frontend

# Install dependencies
npm install

# Start development server (port 5173)
npm run dev

5. Access Dashboard

Open http://localhost:5173 in your browser.

Running Distributed Crawlers

To run multiple crawler instances on different ports/machines:

# Terminal 1: API server
PORT=3001 npm run dev

# Terminal 2-4: Workers with different IDs
WORKER_ID=1 npm run dev:worker
WORKER_ID=2 npm run dev:worker
WORKER_ID=3 npm run dev:worker

For true distributed deployment, configure workers to connect to the same PostgreSQL and Redis instances by setting environment variables:

export POSTGRES_HOST=your-db-host
export REDIS_HOST=your-redis-host
export WORKER_ID=unique-worker-id
npm run start:worker

API Documentation

Health Check

GET /health

Returns the health status of the API and connected services.

Statistics

GET /api/stats

Returns comprehensive crawl statistics including:

Pages crawled/failed
Bytes downloaded
Links discovered
Frontier status
Active workers
Recent pages
Top domains

GET /api/stats/timeseries?hours=24

Returns time-series data for charts.

URL Frontier

GET /api/frontier/stats

Returns frontier statistics (pending, in-progress, completed, failed counts).

GET /api/frontier/urls?limit=50&status=pending

Returns URLs from the frontier with optional status filter.

POST /api/frontier/add
Content-Type: application/json

{
  "urls": ["https://example.com", "https://another.com"],
  "priority": 2
}

Adds URLs to the frontier. Priority: 1 (low), 2 (medium), 3 (high).

POST /api/frontier/seed
Content-Type: application/json

{
  "urls": ["https://example.com"],
  "priority": 3
}

Adds seed URLs with high priority.

POST /api/frontier/recover?minutes=10

Recovers stale in-progress URLs (useful after worker crashes).

DELETE /api/frontier/clear

Clears the entire frontier (use with caution).

Crawled Pages

GET /api/pages?limit=50&offset=0&domain=example.com&search=keyword

Returns crawled pages with optional filtering.

GET /api/pages/:urlHash

Returns details for a specific crawled page.

GET /api/pages/domain/:domain?limit=50

Returns pages for a specific domain.

Domains

GET /api/domains?limit=50&offset=0&sortBy=page_count&order=desc

Returns all crawled domains.

GET /api/domains/:domain

Returns details for a specific domain.

GET /api/domains/:domain/robots

Returns the cached robots.txt for a domain.

POST /api/domains/:domain/refresh-robots

Forces a refresh of the robots.txt cache.

PUT /api/domains/:domain/settings
Content-Type: application/json

{
  "crawlDelay": 2.0,
  "isAllowed": true
}

Updates domain-specific settings.

Configuration

Environment Variables

Backend/API

Variable	Default	Description
`PORT`	`3001`	API server port
`NODE_ENV`	`development`	Environment mode
`POSTGRES_HOST`	`localhost`	PostgreSQL host
`POSTGRES_PORT`	`5432`	PostgreSQL port
`POSTGRES_DB`	`webcrawler`	Database name
`POSTGRES_USER`	`postgres`	Database user
`POSTGRES_PASSWORD`	`postgres`	Database password
`REDIS_HOST`	`localhost`	Redis host
`REDIS_PORT`	`6379`	Redis port

Crawler Workers

Variable	Default	Description
`WORKER_ID`	`1`	Unique worker identifier
`CRAWLER_USER_AGENT`	`WebCrawlerBot/1.0`	User agent string
`CRAWLER_DELAY`	`1000`	Default delay between requests (ms)
`MAX_CONCURRENT`	`10`	Max concurrent requests per worker
`REQUEST_TIMEOUT`	`30000`	Request timeout (ms)
`MAX_PAGE_SIZE`	`10485760`	Max page size to download (10MB)
`ROBOTS_CACHE_TTL`	`3600`	Robots.txt cache TTL (seconds)

Project Structure

web-crawler/
├── backend/
│   ├── src/
│   │   ├── config.ts           # Configuration
│   │   ├── server.ts           # API server entry point
│   │   ├── worker.ts           # Crawler worker entry point
│   │   ├── models/
│   │   │   ├── database.ts     # PostgreSQL setup
│   │   │   └── redis.ts        # Redis setup
│   │   ├── routes/
│   │   │   ├── frontier.ts     # Frontier API routes
│   │   │   ├── stats.ts        # Statistics API routes
│   │   │   ├── pages.ts        # Pages API routes
│   │   │   └── domains.ts      # Domains API routes
│   │   ├── services/
│   │   │   ├── frontier.ts     # URL Frontier service
│   │   │   ├── crawler.ts      # Crawler worker service
│   │   │   ├── robots.ts       # Robots.txt service
│   │   │   └── stats.ts        # Statistics service
│   │   ├── utils/
│   │   │   └── url.ts          # URL utilities
│   │   └── scripts/
│   │       ├── migrate.ts      # Database migration
│   │       └── seed.ts         # Seed URLs
│   ├── package.json
│   ├── tsconfig.json
│   ├── Dockerfile
│   └── Dockerfile.worker
├── frontend/
│   ├── src/
│   │   ├── main.tsx            # Entry point
│   │   ├── App.tsx             # Main app component
│   │   ├── router.tsx          # TanStack Router setup
│   │   ├── components/         # Reusable components
│   │   ├── routes/             # Page components
│   │   ├── stores/             # Zustand stores
│   │   ├── services/           # API client
│   │   └── types/              # TypeScript types
│   ├── package.json
│   ├── vite.config.ts
│   ├── tailwind.config.js
│   ├── Dockerfile
│   └── nginx.conf
├── docker-compose.yml          # Full stack deployment
├── docker-compose.dev.yml      # Development (infra only)
├── architecture.md             # System design docs
├── claude.md                   # Development notes
└── README.md                   # This file

Politeness Features

robots.txt Compliance: Fetches and caches robots.txt for each domain, respects Disallow directives
Crawl-Delay: Honors the Crawl-delay directive from robots.txt
Per-Domain Rate Limiting: Uses Redis locks to ensure only one request per domain at a time
User-Agent: Sends a proper User-Agent string identifying the crawler
Request Timeout: Limits time spent on slow servers

Testing

# Backend
cd backend
npm run lint
npm run type-check

# Frontend
cd frontend
npm run lint
npm run type-check

Future Enhancements

Content storage (S3/local filesystem)
Near-duplicate detection (SimHash)
Distributed work stealing for load balancing
Prometheus metrics integration
Sitemap.xml parsing
JavaScript rendering (Puppeteer integration)
API authentication and rate limiting

License

MIT

References & Inspiration

Mercator: A Scalable, Extensible Web Crawler - Foundational paper on scalable web crawler architecture from Compaq/HP Labs
The Anatomy of a Large-Scale Hypertextual Web Search Engine - Original Google paper describing their crawler architecture
IRLbot: Scaling to 6 Billion Pages and Beyond - Academic paper on high-performance web crawling
Politeness for Web Crawlers - The robots exclusion protocol standard
Web Crawling Best Practices - Google's guidelines for well-behaved crawlers
Heritrix Web Crawler - Internet Archive's open-source web crawler
Scrapy Architecture - Architecture overview of the popular Python crawling framework
Distributed Web Crawling with Apache Nutch - Apache's distributed crawler built on Hadoop

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Web Crawler

Codebase Stats

Features

Architecture

Tech Stack

Getting Started

Prerequisites

Quick Start with Docker

Local Development Setup

1. Start Infrastructure

2. Setup Backend

3. Start Crawler Workers

4. Setup Frontend

5. Access Dashboard

Running Distributed Crawlers

API Documentation

Health Check

Statistics

URL Frontier

Crawled Pages

Domains

Configuration

Environment Variables

Backend/API

Crawler Workers

Project Structure

Politeness Features

Testing

Future Enhancements

License

See Also

References & Inspiration

FilesExpand file tree

web-crawler

Directory actions

More options

Directory actions

More options

Latest commit

History

web-crawler

Folders and files

parent directory

README.md

Web Crawler

Codebase Stats

Features

Architecture

Tech Stack

Getting Started

Prerequisites

Quick Start with Docker

Local Development Setup

1. Start Infrastructure

2. Setup Backend

3. Start Crawler Workers

4. Setup Frontend

5. Access Dashboard

Running Distributed Crawlers

API Documentation

Health Check

Statistics

URL Frontier

Crawled Pages

Domains

Configuration

Environment Variables

Backend/API

Crawler Workers

Project Structure

Politeness Features

Testing

Future Enhancements

License

See Also

References & Inspiration