An intelligent web crawler designed to aggregate and process AI-related news from multiple high-quality sources. The crawler monitors RSS feeds, Reddit submissions, and other news APIs to collect relevant AI and technology news articles, scoring them for video content potential.
- Multi-source aggregation: Supports RSS feeds, Reddit API, and web scraping
- Tiered source organization: Sources organized by reliability and quality (Tier 1-3)
- Intelligent scoring: Articles scored based on keywords, source authority, engagement, recency, and title quality
- Automated scheduling: Built-in support for scheduled crawling tasks
- Data persistence: SQLite database for storing collected articles
- Deduplication: Smart duplicate detection using URL normalization and title similarity
- Daily reports: Automated HTML and JSON report generation
- Configurable architecture: Easy-to-modify configuration system
- CLI interface: User-friendly command-line interface with rich output support
- Python 3.8 or higher
- pip (Python package installer)
- Git (for cloning the repository)
git clone <repository-url>
cd AI_WebScrapperWindows:
python -m venv venv
venv\Scripts\activateLinux/Mac:
python3 -m venv venv
source venv/bin/activatepip install -r requirements.txtOr install as a package:
pip install -e .-
Copy the example environment file:
# Windows copy .env.example .env # Linux/Mac cp .env.example .env
-
Edit
.envand fill in your configuration:-
Reddit API credentials (required for Reddit sources):
- Visit https://www.reddit.com/prefs/apps to create a Reddit application
- Add your
REDDIT_CLIENT_IDandREDDIT_CLIENT_SECRET - Set
REDDIT_USER_AGENTto a descriptive name for your app (e.g., "AI_News_Crawler/1.0")
-
Optional configurations:
DATABASE_PATH: Path to SQLite database (default:data/news_crawler.db)OUTPUT_DIR: Directory for output files (default:output/)LOG_LEVEL: Logging level (default:INFO)REQUEST_TIMEOUT: Request timeout in seconds (default:30)REQUEST_DELAY: Delay between requests in seconds (default:1.0)
-
python src/main.py --list-sourcesThis should display all configured sources organized by tier.
Scrape a specific tier:
python src/main.py --manual --tier 1Scrape a specific source:
python src/main.py --manual --source "TechCrunch"Test scraping without saving to database:
python src/main.py --test --tier 1Start the automated scheduler:
python src/main.py --autoThe scheduler will run continuously and scrape sources at scheduled times:
- Tier 1: Every 6 hours (0:00, 6:00, 12:00, 18:00 UTC)
- Tier 2: Every 12 hours (8:00, 20:00 UTC)
- Tier 3: Twice daily (9:00, 17:00 UTC)
- Daily Report: Daily (22:00 UTC)
Press Ctrl+C to stop the scheduler.
Generate a daily report:
python src/main.py --report --date 2024-11-28Or generate report for today:
python src/main.py --reportpython src/main.py --list-sourcesDisplays all configured sources organized by tier, showing:
- Source name
- Source type (RSS, Reddit, API)
- Base URL
- Whether a scraper is available
Scrape by Tier:
python src/main.py --manual --tier 1
python src/main.py --manual --tier 2
python src/main.py --manual --tier 3Scrape Specific Source:
python src/main.py --manual --source "TechCrunch"
python src/main.py --manual --source "Hacker News"With Verbose Logging:
python src/main.py --manual --tier 1 --verboseTest scraping without saving to database:
python src/main.py --test --tier 1
python src/main.py --test --source "The Verge"Start automated scheduler:
python src/main.py --autoThe scheduler runs in the background and will:
- Scrape sources at scheduled intervals
- Save articles to the database
- Generate daily reports
- Log all activities
Today's Report:
python src/main.py --reportSpecific Date:
python src/main.py --report --date 2024-11-28Reports are saved as both JSON and HTML files in the output/ directory.
from config import DATABASE_PATH, OUTPUT_DIR, LOG_LEVEL
from src.config.sources import ALL_SOURCES, TIER_1_SOURCES
from src.config.keywords import HIGH_VALUE_KEYWORDS, MEDIUM_VALUE_KEYWORDS
# View all configured sources
print(f"Total sources configured: {len(ALL_SOURCES)}")
for source in ALL_SOURCES:
print(f"- {source['name']} (Tier {source['priority']}, Type: {source['type']})")
# Access high-value keywords
print(f"\nHigh-value keywords: {len(HIGH_VALUE_KEYWORDS)} keywords")
print(HIGH_VALUE_KEYWORDS[:5]) # Print first 5from src.scrapers import TechCrunchScraper
from src.utils.storage import init_database, insert_article
from src.utils.filters import calculate_video_score
# Initialize database
init_database("data/news_crawler.db")
# Create scraper instance
scraper = TechCrunchScraper()
# Fetch and parse
html = scraper.get_page(scraper.scrape_url)
soup = scraper.parse_html(html)
# Extract articles
articles = scraper.extract_articles(soup)
# Process and save
for article in articles:
article['source'] = 'TechCrunch'
score = calculate_video_score(article)
article['score'] = score
insert_article("data/news_crawler.db", article)from src.utils.scheduler import ScraperScheduler
from config import DATABASE_PATH, OUTPUT_DIR
# Create scheduler
scheduler = ScraperScheduler(db_path=DATABASE_PATH, output_dir=OUTPUT_DIR)
# Start automated scheduling
scheduler.start()
# ... scheduler runs in background ...
# Stop scheduler
scheduler.stop()Create a .env file in the project root with the following variables:
# Reddit API Configuration (Required for Reddit sources)
REDDIT_CLIENT_ID=your_client_id_here
REDDIT_CLIENT_SECRET=your_client_secret_here
REDDIT_USER_AGENT=AI_News_Crawler/1.0
# Database Configuration
DATABASE_PATH=data/news_crawler.db
# Output Configuration
OUTPUT_DIR=output
# Logging Configuration
LOG_LEVEL=INFO
LOG_FILE=logs/crawler.log
# Request Configuration
REQUEST_TIMEOUT=30
REQUEST_DELAY=1.0
MAX_RETRIES=3
# User Agent
USER_AGENT=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36
# Rate Limiting
RATE_LIMIT_ENABLED=true
RATE_LIMIT_REQUESTS_PER_MINUTE=60Sources are configured in src/config/sources.py. Each source includes:
name: Source namebase_url: Base URL of the sourcerss_feeds: List of RSS feed URLs (for RSS sources)type: Source type (rss,reddit,api)priority: Tier number (1, 2, or 3)rate_limit: Rate limiting delay in seconds
Keywords are configured in src/config/keywords.py:
HIGH_VALUE_KEYWORDS: Critical AI/ML topics (60+ keywords)MEDIUM_VALUE_KEYWORDS: Important tech topics (40+ keywords)
The project follows a modular architecture:
AI_WebScrapper/
├── src/ # Main source code
│ ├── config/ # Configuration modules
│ │ ├── sources.py # News source configurations
│ │ └── keywords.py # Keyword lists for filtering
│ ├── scrapers/ # Web scraping modules
│ │ ├── base_scraper.py # Base scraper class
│ │ ├── techcrunch_scraper.py
│ │ ├── venturebeat_scraper.py
│ │ └── ... # Other scraper implementations
│ ├── utils/ # Utility functions
│ │ ├── storage.py # Database operations
│ │ ├── filters.py # Scoring and filtering
│ │ ├── deduplicator.py # Duplicate detection
│ │ ├── scheduler.py # Automated scheduling
│ │ └── reporter.py # Report generation
│ └── main.py # CLI entry point
├── data/ # Data directory (database files)
├── output/ # Output directory (reports, JSON)
├── logs/ # Log files
├── tests/ # Test suite
├── docs/ # Documentation
│ ├── ARCHITECTURE.md # System architecture details
│ ├── ADDING_SOURCES.md # Guide for adding new sources
│ └── API.md # API reference
├── config.py # Global configuration
├── requirements.txt # Python dependencies
├── setup.py # Package installation
├── .env.example # Environment variables template
└── README.md # This file
For detailed architecture information, see docs/ARCHITECTURE.md.
The crawler is configured with 15 news sources organized into three tiers:
- The Verge
- TechCrunch
- Ars Technica
- Wired
- MIT Technology Review
- The Guardian - Technology
- Reuters - Technology
- BBC Technology
- Hacker News (API)
- Reddit - r/MachineLearning
- Reddit - r/artificial
- Reddit - r/singularity
- IEEE Spectrum
- VentureBeat AI
- ZDNet AI
Articles are scored on a 0-10 scale based on:
- Keyword Match (3.0 points): Contains high-value AI/ML keywords
- Source Authority (2.0 points): Based on source tier
- Engagement (2.0 points): Normalized upvotes/comments
- Recency (1.5 points): Published within 24/48 hours
- Title Quality (1.5 points): Numbers, action words, controversy terms
Articles with scores ≥ 7.5 are considered high-priority.
For detailed scoring information, see docs/ARCHITECTURE.md.
Problem: praw.exceptions.PRAWException or authentication errors
Solution:
- Verify your Reddit API credentials in
.env - Ensure
REDDIT_USER_AGENTis set correctly - Check that your Reddit app has the correct permissions
- Visit https://www.reddit.com/prefs/apps to verify your app settings
Problem: sqlite3.OperationalError: database is locked
Solution:
- Ensure only one instance of the crawler is running
- Close any database viewers or tools accessing the database
- Wait a few seconds and retry
Problem: Scraper runs but extracts 0 articles
Solution:
- Check if the website structure has changed (inspect HTML)
- Verify the scraper's CSS selectors are still valid
- Enable verbose logging:
--verbose - Check logs in
logs/crawler.log
Problem: ModuleNotFoundError or import errors
Solution:
- Ensure virtual environment is activated
- Reinstall dependencies:
pip install -r requirements.txt - Verify Python version:
python --version(should be 3.8+)
Problem: Too many requests errors
Solution:
- Increase
REQUEST_DELAYin.env - Enable
RATE_LIMIT_ENABLED=true - Reduce scraping frequency in scheduler
- Check the logs in
logs/crawler.log - Enable verbose logging:
--verbose - Review docs/ARCHITECTURE.md for system details
- Check docs/API.md for function reference
- Open an issue on GitHub with:
- Error message
- Steps to reproduce
- Log file excerpt
- Python version and OS
We welcome contributions! Here's how you can help:
See docs/ADDING_SOURCES.md for a complete guide on adding new scrapers.
- Check if the issue already exists
- Create a new issue with:
- Clear description
- Steps to reproduce
- Expected vs actual behavior
- Environment details (OS, Python version)
- Fork the repository
- Create a feature branch:
git checkout -b feature/amazing-feature - Make your changes
- Add tests if applicable
- Commit your changes:
git commit -m 'Add amazing feature' - Push to the branch:
git push origin feature/amazing-feature - Open a Pull Request
- Follow PEP 8 style guide
- Use type hints where possible
- Add docstrings to functions and classes
- Write tests for new features
This project is licensed under the MIT License - see the LICENSE file for details.
- BeautifulSoup for HTML parsing
- PRAW for Reddit API access
- All the news sources for providing RSS feeds and APIs
- Architecture Overview - System design and data flow
- Adding Sources Guide - How to add new scrapers
- API Reference - Complete function reference
- Changelog - Version history
Made with ❤️ for AI news enthusiasts