Skip to content

flexsw917/AI_webscrapper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AI News WebCrawler

An intelligent web crawler designed to aggregate and process AI-related news from multiple high-quality sources. The crawler monitors RSS feeds, Reddit submissions, and other news APIs to collect relevant AI and technology news articles, scoring them for video content potential.

🚀 Features

  • Multi-source aggregation: Supports RSS feeds, Reddit API, and web scraping
  • Tiered source organization: Sources organized by reliability and quality (Tier 1-3)
  • Intelligent scoring: Articles scored based on keywords, source authority, engagement, recency, and title quality
  • Automated scheduling: Built-in support for scheduled crawling tasks
  • Data persistence: SQLite database for storing collected articles
  • Deduplication: Smart duplicate detection using URL normalization and title similarity
  • Daily reports: Automated HTML and JSON report generation
  • Configurable architecture: Easy-to-modify configuration system
  • CLI interface: User-friendly command-line interface with rich output support

📋 Table of Contents

🛠️ Installation

Prerequisites

  • Python 3.8 or higher
  • pip (Python package installer)
  • Git (for cloning the repository)

Step-by-Step Installation

Step 1: Clone the Repository

git clone <repository-url>
cd AI_WebScrapper

Step 2: Create a Virtual Environment (Recommended)

Windows:

python -m venv venv
venv\Scripts\activate

Linux/Mac:

python3 -m venv venv
source venv/bin/activate

Step 3: Install Dependencies

pip install -r requirements.txt

Or install as a package:

pip install -e .

Step 4: Configure Environment Variables

  1. Copy the example environment file:

    # Windows
    copy .env.example .env
    
    # Linux/Mac
    cp .env.example .env
  2. Edit .env and fill in your configuration:

    • Reddit API credentials (required for Reddit sources):

      • Visit https://www.reddit.com/prefs/apps to create a Reddit application
      • Add your REDDIT_CLIENT_ID and REDDIT_CLIENT_SECRET
      • Set REDDIT_USER_AGENT to a descriptive name for your app (e.g., "AI_News_Crawler/1.0")
    • Optional configurations:

      • DATABASE_PATH: Path to SQLite database (default: data/news_crawler.db)
      • OUTPUT_DIR: Directory for output files (default: output/)
      • LOG_LEVEL: Logging level (default: INFO)
      • REQUEST_TIMEOUT: Request timeout in seconds (default: 30)
      • REQUEST_DELAY: Delay between requests in seconds (default: 1.0)

Step 5: Verify Installation

python src/main.py --list-sources

This should display all configured sources organized by tier.

🚀 Quick Start

Manual Scraping

Scrape a specific tier:

python src/main.py --manual --tier 1

Scrape a specific source:

python src/main.py --manual --source "TechCrunch"

Test Mode (No Database Saving)

Test scraping without saving to database:

python src/main.py --test --tier 1

Automated Scheduling

Start the automated scheduler:

python src/main.py --auto

The scheduler will run continuously and scrape sources at scheduled times:

  • Tier 1: Every 6 hours (0:00, 6:00, 12:00, 18:00 UTC)
  • Tier 2: Every 12 hours (8:00, 20:00 UTC)
  • Tier 3: Twice daily (9:00, 17:00 UTC)
  • Daily Report: Daily (22:00 UTC)

Press Ctrl+C to stop the scheduler.

Generate Reports

Generate a daily report:

python src/main.py --report --date 2024-11-28

Or generate report for today:

python src/main.py --report

📖 Usage

CLI Commands

List All Sources

python src/main.py --list-sources

Displays all configured sources organized by tier, showing:

  • Source name
  • Source type (RSS, Reddit, API)
  • Base URL
  • Whether a scraper is available

Manual Scraping

Scrape by Tier:

python src/main.py --manual --tier 1
python src/main.py --manual --tier 2
python src/main.py --manual --tier 3

Scrape Specific Source:

python src/main.py --manual --source "TechCrunch"
python src/main.py --manual --source "Hacker News"

With Verbose Logging:

python src/main.py --manual --tier 1 --verbose

Test Mode

Test scraping without saving to database:

python src/main.py --test --tier 1
python src/main.py --test --source "The Verge"

Automated Scheduling

Start automated scheduler:

python src/main.py --auto

The scheduler runs in the background and will:

  • Scrape sources at scheduled intervals
  • Save articles to the database
  • Generate daily reports
  • Log all activities

Generate Reports

Today's Report:

python src/main.py --report

Specific Date:

python src/main.py --report --date 2024-11-28

Reports are saved as both JSON and HTML files in the output/ directory.

Programmatic Usage

Access Configuration

from config import DATABASE_PATH, OUTPUT_DIR, LOG_LEVEL
from src.config.sources import ALL_SOURCES, TIER_1_SOURCES
from src.config.keywords import HIGH_VALUE_KEYWORDS, MEDIUM_VALUE_KEYWORDS

# View all configured sources
print(f"Total sources configured: {len(ALL_SOURCES)}")
for source in ALL_SOURCES:
    print(f"- {source['name']} (Tier {source['priority']}, Type: {source['type']})")

# Access high-value keywords
print(f"\nHigh-value keywords: {len(HIGH_VALUE_KEYWORDS)} keywords")
print(HIGH_VALUE_KEYWORDS[:5])  # Print first 5

Using Scrapers Directly

from src.scrapers import TechCrunchScraper
from src.utils.storage import init_database, insert_article
from src.utils.filters import calculate_video_score

# Initialize database
init_database("data/news_crawler.db")

# Create scraper instance
scraper = TechCrunchScraper()

# Fetch and parse
html = scraper.get_page(scraper.scrape_url)
soup = scraper.parse_html(html)

# Extract articles
articles = scraper.extract_articles(soup)

# Process and save
for article in articles:
    article['source'] = 'TechCrunch'
    score = calculate_video_score(article)
    article['score'] = score
    insert_article("data/news_crawler.db", article)

Using the Scheduler

from src.utils.scheduler import ScraperScheduler
from config import DATABASE_PATH, OUTPUT_DIR

# Create scheduler
scheduler = ScraperScheduler(db_path=DATABASE_PATH, output_dir=OUTPUT_DIR)

# Start automated scheduling
scheduler.start()

# ... scheduler runs in background ...

# Stop scheduler
scheduler.stop()

⚙️ Configuration

Environment Variables

Create a .env file in the project root with the following variables:

# Reddit API Configuration (Required for Reddit sources)
REDDIT_CLIENT_ID=your_client_id_here
REDDIT_CLIENT_SECRET=your_client_secret_here
REDDIT_USER_AGENT=AI_News_Crawler/1.0

# Database Configuration
DATABASE_PATH=data/news_crawler.db

# Output Configuration
OUTPUT_DIR=output

# Logging Configuration
LOG_LEVEL=INFO
LOG_FILE=logs/crawler.log

# Request Configuration
REQUEST_TIMEOUT=30
REQUEST_DELAY=1.0
MAX_RETRIES=3

# User Agent
USER_AGENT=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36

# Rate Limiting
RATE_LIMIT_ENABLED=true
RATE_LIMIT_REQUESTS_PER_MINUTE=60

Source Configuration

Sources are configured in src/config/sources.py. Each source includes:

  • name: Source name
  • base_url: Base URL of the source
  • rss_feeds: List of RSS feed URLs (for RSS sources)
  • type: Source type (rss, reddit, api)
  • priority: Tier number (1, 2, or 3)
  • rate_limit: Rate limiting delay in seconds

Keyword Configuration

Keywords are configured in src/config/keywords.py:

  • HIGH_VALUE_KEYWORDS: Critical AI/ML topics (60+ keywords)
  • MEDIUM_VALUE_KEYWORDS: Important tech topics (40+ keywords)

🏗️ Architecture

The project follows a modular architecture:

AI_WebScrapper/
├── src/                        # Main source code
│   ├── config/                 # Configuration modules
│   │   ├── sources.py         # News source configurations
│   │   └── keywords.py        # Keyword lists for filtering
│   ├── scrapers/              # Web scraping modules
│   │   ├── base_scraper.py    # Base scraper class
│   │   ├── techcrunch_scraper.py
│   │   ├── venturebeat_scraper.py
│   │   └── ...                # Other scraper implementations
│   ├── utils/                 # Utility functions
│   │   ├── storage.py         # Database operations
│   │   ├── filters.py         # Scoring and filtering
│   │   ├── deduplicator.py    # Duplicate detection
│   │   ├── scheduler.py       # Automated scheduling
│   │   └── reporter.py        # Report generation
│   └── main.py                # CLI entry point
├── data/                      # Data directory (database files)
├── output/                    # Output directory (reports, JSON)
├── logs/                      # Log files
├── tests/                     # Test suite
├── docs/                      # Documentation
│   ├── ARCHITECTURE.md        # System architecture details
│   ├── ADDING_SOURCES.md      # Guide for adding new sources
│   └── API.md                 # API reference
├── config.py                  # Global configuration
├── requirements.txt           # Python dependencies
├── setup.py                   # Package installation
├── .env.example               # Environment variables template
└── README.md                  # This file

For detailed architecture information, see docs/ARCHITECTURE.md.

📰 News Sources

The crawler is configured with 15 news sources organized into three tiers:

Tier 1: Premium High-Quality Sources (3 sources)

  • The Verge
  • TechCrunch
  • Ars Technica

Tier 2: Reliable Mainstream Sources (5 sources)

  • Wired
  • MIT Technology Review
  • The Guardian - Technology
  • Reuters - Technology
  • BBC Technology

Tier 3: Specialized Tech/AI Sources (7 sources)

  • Hacker News (API)
  • Reddit - r/MachineLearning
  • Reddit - r/artificial
  • Reddit - r/singularity
  • IEEE Spectrum
  • VentureBeat AI
  • ZDNet AI

🔍 Scoring Algorithm

Articles are scored on a 0-10 scale based on:

  1. Keyword Match (3.0 points): Contains high-value AI/ML keywords
  2. Source Authority (2.0 points): Based on source tier
  3. Engagement (2.0 points): Normalized upvotes/comments
  4. Recency (1.5 points): Published within 24/48 hours
  5. Title Quality (1.5 points): Numbers, action words, controversy terms

Articles with scores ≥ 7.5 are considered high-priority.

For detailed scoring information, see docs/ARCHITECTURE.md.

🐛 Troubleshooting

Common Issues

1. Reddit API Errors

Problem: praw.exceptions.PRAWException or authentication errors

Solution:

  • Verify your Reddit API credentials in .env
  • Ensure REDDIT_USER_AGENT is set correctly
  • Check that your Reddit app has the correct permissions
  • Visit https://www.reddit.com/prefs/apps to verify your app settings

2. Database Locked Errors

Problem: sqlite3.OperationalError: database is locked

Solution:

  • Ensure only one instance of the crawler is running
  • Close any database viewers or tools accessing the database
  • Wait a few seconds and retry

3. No Articles Extracted

Problem: Scraper runs but extracts 0 articles

Solution:

  • Check if the website structure has changed (inspect HTML)
  • Verify the scraper's CSS selectors are still valid
  • Enable verbose logging: --verbose
  • Check logs in logs/crawler.log

4. Import Errors

Problem: ModuleNotFoundError or import errors

Solution:

  • Ensure virtual environment is activated
  • Reinstall dependencies: pip install -r requirements.txt
  • Verify Python version: python --version (should be 3.8+)

5. Rate Limiting / 429 Errors

Problem: Too many requests errors

Solution:

  • Increase REQUEST_DELAY in .env
  • Enable RATE_LIMIT_ENABLED=true
  • Reduce scraping frequency in scheduler

Getting Help

  1. Check the logs in logs/crawler.log
  2. Enable verbose logging: --verbose
  3. Review docs/ARCHITECTURE.md for system details
  4. Check docs/API.md for function reference
  5. Open an issue on GitHub with:
    • Error message
    • Steps to reproduce
    • Log file excerpt
    • Python version and OS

🤝 Contributing

We welcome contributions! Here's how you can help:

Adding New Sources

See docs/ADDING_SOURCES.md for a complete guide on adding new scrapers.

Reporting Issues

  1. Check if the issue already exists
  2. Create a new issue with:
    • Clear description
    • Steps to reproduce
    • Expected vs actual behavior
    • Environment details (OS, Python version)

Submitting Pull Requests

  1. Fork the repository
  2. Create a feature branch: git checkout -b feature/amazing-feature
  3. Make your changes
  4. Add tests if applicable
  5. Commit your changes: git commit -m 'Add amazing feature'
  6. Push to the branch: git push origin feature/amazing-feature
  7. Open a Pull Request

Code Style

  • Follow PEP 8 style guide
  • Use type hints where possible
  • Add docstrings to functions and classes
  • Write tests for new features

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

  • BeautifulSoup for HTML parsing
  • PRAW for Reddit API access
  • All the news sources for providing RSS feeds and APIs

📚 Additional Documentation


Made with ❤️ for AI news enthusiasts

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages