AI News WebCrawler

An intelligent web crawler designed to aggregate and process AI-related news from multiple high-quality sources. The crawler monitors RSS feeds, Reddit submissions, and other news APIs to collect relevant AI and technology news articles, scoring them for video content potential.

🚀 Features

Multi-source aggregation: Supports RSS feeds, Reddit API, and web scraping
Tiered source organization: Sources organized by reliability and quality (Tier 1-3)
Intelligent scoring: Articles scored based on keywords, source authority, engagement, recency, and title quality
Automated scheduling: Built-in support for scheduled crawling tasks
Data persistence: SQLite database for storing collected articles
Deduplication: Smart duplicate detection using URL normalization and title similarity
Daily reports: Automated HTML and JSON report generation
Configurable architecture: Easy-to-modify configuration system
CLI interface: User-friendly command-line interface with rich output support

🛠️ Installation

Prerequisites

Python 3.8 or higher
pip (Python package installer)
Git (for cloning the repository)

Step-by-Step Installation

Step 1: Clone the Repository

git clone <repository-url>
cd AI_WebScrapper

Step 2: Create a Virtual Environment (Recommended)

Windows:

python -m venv venv
venv\Scripts\activate

Linux/Mac:

python3 -m venv venv
source venv/bin/activate

Step 3: Install Dependencies

pip install -r requirements.txt

Or install as a package:

pip install -e .

Step 4: Configure Environment Variables

Copy the example environment file:

# Windows
copy .env.example .env

# Linux/Mac
cp .env.example .env

Edit .env and fill in your configuration:
- Reddit API credentials (required for Reddit sources):
  - Visit https://www.reddit.com/prefs/apps to create a Reddit application
  - Add your REDDIT_CLIENT_ID and REDDIT_CLIENT_SECRET
  - Set REDDIT_USER_AGENT to a descriptive name for your app (e.g., "AI_News_Crawler/1.0")
- Optional configurations:
  - DATABASE_PATH: Path to SQLite database (default: data/news_crawler.db)
  - OUTPUT_DIR: Directory for output files (default: output/)
  - LOG_LEVEL: Logging level (default: INFO)
  - REQUEST_TIMEOUT: Request timeout in seconds (default: 30)
  - REQUEST_DELAY: Delay between requests in seconds (default: 1.0)

Step 5: Verify Installation

python src/main.py --list-sources

This should display all configured sources organized by tier.

🚀 Quick Start

Manual Scraping

Scrape a specific tier:

python src/main.py --manual --tier 1

Scrape a specific source:

python src/main.py --manual --source "TechCrunch"

Test Mode (No Database Saving)

Test scraping without saving to database:

python src/main.py --test --tier 1

Automated Scheduling

Start the automated scheduler:

python src/main.py --auto

The scheduler will run continuously and scrape sources at scheduled times:

Tier 1: Every 6 hours (0:00, 6:00, 12:00, 18:00 UTC)
Tier 2: Every 12 hours (8:00, 20:00 UTC)
Tier 3: Twice daily (9:00, 17:00 UTC)
Daily Report: Daily (22:00 UTC)

Press Ctrl+C to stop the scheduler.

Generate Reports

Generate a daily report:

python src/main.py --report --date 2024-11-28

Or generate report for today:

python src/main.py --report

📖 Usage

CLI Commands

List All Sources

python src/main.py --list-sources

Displays all configured sources organized by tier, showing:

Source name
Source type (RSS, Reddit, API)
Base URL
Whether a scraper is available

Manual Scraping

Scrape by Tier:

python src/main.py --manual --tier 1
python src/main.py --manual --tier 2
python src/main.py --manual --tier 3

Scrape Specific Source:

python src/main.py --manual --source "TechCrunch"
python src/main.py --manual --source "Hacker News"

With Verbose Logging:

python src/main.py --manual --tier 1 --verbose

Test Mode

Test scraping without saving to database:

python src/main.py --test --tier 1
python src/main.py --test --source "The Verge"

Automated Scheduling

Start automated scheduler:

python src/main.py --auto

The scheduler runs in the background and will:

Scrape sources at scheduled intervals
Save articles to the database
Generate daily reports
Log all activities

Generate Reports

Today's Report:

python src/main.py --report

Specific Date:

python src/main.py --report --date 2024-11-28

Reports are saved as both JSON and HTML files in the output/ directory.

Programmatic Usage

Access Configuration

from config import DATABASE_PATH, OUTPUT_DIR, LOG_LEVEL
from src.config.sources import ALL_SOURCES, TIER_1_SOURCES
from src.config.keywords import HIGH_VALUE_KEYWORDS, MEDIUM_VALUE_KEYWORDS

# View all configured sources
print(f"Total sources configured: {len(ALL_SOURCES)}")
for source in ALL_SOURCES:
    print(f"- {source['name']} (Tier {source['priority']}, Type: {source['type']})")

# Access high-value keywords
print(f"\nHigh-value keywords: {len(HIGH_VALUE_KEYWORDS)} keywords")
print(HIGH_VALUE_KEYWORDS[:5])  # Print first 5

Using Scrapers Directly

from src.scrapers import TechCrunchScraper
from src.utils.storage import init_database, insert_article
from src.utils.filters import calculate_video_score

# Initialize database
init_database("data/news_crawler.db")

# Create scraper instance
scraper = TechCrunchScraper()

# Fetch and parse
html = scraper.get_page(scraper.scrape_url)
soup = scraper.parse_html(html)

# Extract articles
articles = scraper.extract_articles(soup)

# Process and save
for article in articles:
    article['source'] = 'TechCrunch'
    score = calculate_video_score(article)
    article['score'] = score
    insert_article("data/news_crawler.db", article)

Using the Scheduler

from src.utils.scheduler import ScraperScheduler
from config import DATABASE_PATH, OUTPUT_DIR

# Create scheduler
scheduler = ScraperScheduler(db_path=DATABASE_PATH, output_dir=OUTPUT_DIR)

# Start automated scheduling
scheduler.start()

# ... scheduler runs in background ...

# Stop scheduler
scheduler.stop()

⚙️ Configuration

Environment Variables

Create a .env file in the project root with the following variables:

# Reddit API Configuration (Required for Reddit sources)
REDDIT_CLIENT_ID=your_client_id_here
REDDIT_CLIENT_SECRET=your_client_secret_here
REDDIT_USER_AGENT=AI_News_Crawler/1.0

# Database Configuration
DATABASE_PATH=data/news_crawler.db

# Output Configuration
OUTPUT_DIR=output

# Logging Configuration
LOG_LEVEL=INFO
LOG_FILE=logs/crawler.log

# Request Configuration
REQUEST_TIMEOUT=30
REQUEST_DELAY=1.0
MAX_RETRIES=3

# User Agent
USER_AGENT=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36

# Rate Limiting
RATE_LIMIT_ENABLED=true
RATE_LIMIT_REQUESTS_PER_MINUTE=60

Source Configuration

Sources are configured in src/config/sources.py. Each source includes:

name: Source name
base_url: Base URL of the source
rss_feeds: List of RSS feed URLs (for RSS sources)
type: Source type (rss, reddit, api)
priority: Tier number (1, 2, or 3)
rate_limit: Rate limiting delay in seconds

Keyword Configuration

Keywords are configured in src/config/keywords.py:

HIGH_VALUE_KEYWORDS: Critical AI/ML topics (60+ keywords)
MEDIUM_VALUE_KEYWORDS: Important tech topics (40+ keywords)

🏗️ Architecture

The project follows a modular architecture:

AI_WebScrapper/
├── src/                        # Main source code
│   ├── config/                 # Configuration modules
│   │   ├── sources.py         # News source configurations
│   │   └── keywords.py        # Keyword lists for filtering
│   ├── scrapers/              # Web scraping modules
│   │   ├── base_scraper.py    # Base scraper class
│   │   ├── techcrunch_scraper.py
│   │   ├── venturebeat_scraper.py
│   │   └── ...                # Other scraper implementations
│   ├── utils/                 # Utility functions
│   │   ├── storage.py         # Database operations
│   │   ├── filters.py         # Scoring and filtering
│   │   ├── deduplicator.py    # Duplicate detection
│   │   ├── scheduler.py       # Automated scheduling
│   │   └── reporter.py        # Report generation
│   └── main.py                # CLI entry point
├── data/                      # Data directory (database files)
├── output/                    # Output directory (reports, JSON)
├── logs/                      # Log files
├── tests/                     # Test suite
├── docs/                      # Documentation
│   ├── ARCHITECTURE.md        # System architecture details
│   ├── ADDING_SOURCES.md      # Guide for adding new sources
│   └── API.md                 # API reference
├── config.py                  # Global configuration
├── requirements.txt           # Python dependencies
├── setup.py                   # Package installation
├── .env.example               # Environment variables template
└── README.md                  # This file

For detailed architecture information, see docs/ARCHITECTURE.md.

📰 News Sources

The crawler is configured with 15 news sources organized into three tiers:

Tier 1: Premium High-Quality Sources (3 sources)

The Verge
TechCrunch
Ars Technica

Tier 2: Reliable Mainstream Sources (5 sources)

Wired
MIT Technology Review
The Guardian - Technology
Reuters - Technology
BBC Technology

Tier 3: Specialized Tech/AI Sources (7 sources)

Hacker News (API)
Reddit - r/MachineLearning
Reddit - r/artificial
Reddit - r/singularity
IEEE Spectrum
VentureBeat AI
ZDNet AI

🔍 Scoring Algorithm

Articles are scored on a 0-10 scale based on:

Keyword Match (3.0 points): Contains high-value AI/ML keywords
Source Authority (2.0 points): Based on source tier
Engagement (2.0 points): Normalized upvotes/comments
Recency (1.5 points): Published within 24/48 hours
Title Quality (1.5 points): Numbers, action words, controversy terms

Articles with scores ≥ 7.5 are considered high-priority.

For detailed scoring information, see docs/ARCHITECTURE.md.

🐛 Troubleshooting

Common Issues

1. Reddit API Errors

Problem: praw.exceptions.PRAWException or authentication errors

Solution:

Verify your Reddit API credentials in .env
Ensure REDDIT_USER_AGENT is set correctly
Check that your Reddit app has the correct permissions
Visit https://www.reddit.com/prefs/apps to verify your app settings

2. Database Locked Errors

Problem: sqlite3.OperationalError: database is locked

Solution:

Ensure only one instance of the crawler is running
Close any database viewers or tools accessing the database
Wait a few seconds and retry

3. No Articles Extracted

Problem: Scraper runs but extracts 0 articles

Solution:

Check if the website structure has changed (inspect HTML)
Verify the scraper's CSS selectors are still valid
Enable verbose logging: --verbose
Check logs in logs/crawler.log

4. Import Errors

Problem: ModuleNotFoundError or import errors

Solution:

Ensure virtual environment is activated
Reinstall dependencies: pip install -r requirements.txt
Verify Python version: python --version (should be 3.8+)

5. Rate Limiting / 429 Errors

Problem: Too many requests errors

Solution:

Increase REQUEST_DELAY in .env
Enable RATE_LIMIT_ENABLED=true
Reduce scraping frequency in scheduler

Getting Help

Check the logs in logs/crawler.log
Enable verbose logging: --verbose
Review docs/ARCHITECTURE.md for system details
Check docs/API.md for function reference
Open an issue on GitHub with:
- Error message
- Steps to reproduce
- Log file excerpt
- Python version and OS

🤝 Contributing

We welcome contributions! Here's how you can help:

Adding New Sources

See docs/ADDING_SOURCES.md for a complete guide on adding new scrapers.

Reporting Issues

Check if the issue already exists
Create a new issue with:
- Clear description
- Steps to reproduce
- Expected vs actual behavior
- Environment details (OS, Python version)

Submitting Pull Requests

Fork the repository
Create a feature branch: git checkout -b feature/amazing-feature
Make your changes
Add tests if applicable
Commit your changes: git commit -m 'Add amazing feature'
Push to the branch: git push origin feature/amazing-feature
Open a Pull Request

Code Style

Follow PEP 8 style guide
Use type hints where possible
Add docstrings to functions and classes
Write tests for new features

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

BeautifulSoup for HTML parsing
PRAW for Reddit API access
All the news sources for providing RSS feeds and APIs

📚 Additional Documentation

Architecture Overview - System design and data flow
Adding Sources Guide - How to add new scrapers
API Reference - Complete function reference
Changelog - Version history

Made with ❤️ for AI news enthusiasts

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
docs		docs
output		output
src		src
tests		tests
.env.example		.env.example
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
config.py		config.py
requirements.txt		requirements.txt
setup.py		setup.py

Folders and files

Latest commit

History

Repository files navigation

AI News WebCrawler

🚀 Features

📋 Table of Contents

🛠️ Installation

Prerequisites

Step-by-Step Installation

Step 1: Clone the Repository

Step 2: Create a Virtual Environment (Recommended)

Step 3: Install Dependencies

Step 4: Configure Environment Variables

Step 5: Verify Installation

🚀 Quick Start

Manual Scraping

Test Mode (No Database Saving)

Automated Scheduling

Generate Reports

📖 Usage

CLI Commands

List All Sources

Manual Scraping

Test Mode

Automated Scheduling

Generate Reports

Programmatic Usage

Access Configuration

Using Scrapers Directly

Using the Scheduler

⚙️ Configuration

Environment Variables

Source Configuration

Keyword Configuration

🏗️ Architecture

📰 News Sources

Tier 1: Premium High-Quality Sources (3 sources)

Tier 2: Reliable Mainstream Sources (5 sources)

Tier 3: Specialized Tech/AI Sources (7 sources)

🔍 Scoring Algorithm

🐛 Troubleshooting

Common Issues

1. Reddit API Errors

2. Database Locked Errors

3. No Articles Extracted

4. Import Errors

5. Rate Limiting / 429 Errors

Getting Help

🤝 Contributing

Adding New Sources

Reporting Issues

Submitting Pull Requests

Code Style

📄 License

🙏 Acknowledgments

📚 Additional Documentation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages