Skip to content

Pipeline to extract, clean, and classify web data using Python, BeautifulSoup, and scikit-learn.

Notifications You must be signed in to change notification settings

josedaniel-dev/scrapper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

8 Commits
Β 
Β 
Β 
Β 

Repository files navigation

πŸ•ΈοΈ Scrape & Classify – Web Data Extraction Pipeline

A modular Python pipeline for extracting, cleaning, and classifying web data using open-source tools. Designed for smart web prospecting, offline research, and scalable automation.


πŸ“Œ Features

  • πŸ” Web Scraping: Collects structured data from target websites using requests and BeautifulSoup.
  • 🧹 Data Cleaning: Standardizes and prepares text for analysis using pandas, re, and unicodedata.
  • 🧠 ML Classification: Predicts categories or relevance using a trained scikit-learn classifier.
  • πŸ—ƒοΈ Export Options: Saves results to .csv, .json, and .txt formats for integration and sharing.
  • πŸ” Offline-First: Fully operable without internet after initial setup. No external API dependencies.
  • πŸ“ Modular Design: Easy to adapt, reuse, or extend in parts or as a full pipeline.

πŸš€ Use Cases

  • Local business discovery and qualification
  • Lead generation and client research
  • Academic data collection and preprocessing
  • Custom classifiers for specific domains or keywords

🧰 Stack

Function Tools Used
Scraping requests, BeautifulSoup
Cleaning pandas, re, unicodedata
ML Classification scikit-learn, joblib
Exporting csv, json, plain text handling
Logging & CLI argparse, logging, .env

πŸ› οΈ How It Works

# 1. Install dependencies
pip install -r requirements.txt

# 2. Run the scraper (customize inside config or via CLI)
python scrape.py --target "https://example.com" --output data/raw.csv

# 3. Clean and preprocess the scraped data
python clean.py --input data/raw.csv --output data/clean.csv

# 4. Classify the cleaned data using the trained model
python classify.py --input data/clean.csv --model models/classifier.pkl --output results/predictions.csv

Each step is independent and can be used modularly.


πŸ“Š Example Output

name,website,description,category,score
"ACME Bakery","acme.com","...","food",0.92
"FutureTech Solutions","futuretech.ai","...","tech",0.88
...

πŸ“‚ Folder Structure

scrape_classify_pipeline/
β”‚
β”œβ”€β”€ data/                 # Raw and processed data
β”œβ”€β”€ models/               # Trained models (.pkl)
β”œβ”€β”€ results/              # Classification outputs
β”œβ”€β”€ scrape.py             # Web scraping logic
β”œβ”€β”€ clean.py              # Data cleaning script
β”œβ”€β”€ classify.py           # ML classification pipeline
β”œβ”€β”€ utils/                # Reusable utility functions
β”œβ”€β”€ config.env            # Environment variables
└── README.md             # Project overview

πŸ“¦ Dependencies

  • Python 3.8+
  • pandas
  • beautifulsoup4
  • scikit-learn
  • requests
  • joblib

πŸ”’ Privacy-First Design

This pipeline does not rely on cloud APIs or send data externally. Everything runs locally, making it ideal for private research or restricted environments.


πŸ“Œ License

Licensed under the MIT License – feel free to modify and use for personal or commercial projects.


πŸ™Œ Author

Jose Daniel Soto πŸ“§ Email | 🌐 GitHub | πŸ”— LinkedIn

About

Pipeline to extract, clean, and classify web data using Python, BeautifulSoup, and scikit-learn.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages