Skip to content

LLM-powered, schema-driven web scraping agents: extract structured data from any webpage using user-defined Pydantic models.

Notifications You must be signed in to change notification settings

nehalvaghasiya/web-scraper-agent

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

40 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Web Scraper Agent

Table of Contents

Overview

web-scraper-agent is a modular, prompt-driven scraping system built on top of Pydantic-AI and Groq’s LLaMA/Scout models, designed to extract structured data from arbitrary websites using just HTML and a schema.

It’s perfect for developers and researchers who want:

  • Structured data from any web page, via prompt + schema only
  • A clean interface to define and validate schemas using Pydantic
  • Seamless integration with Groq LLaMA, OpenAI, or local backends
  • Simple plug-and-play examples for scraping products, blog posts, and more

✨ Think of it as ChatGPT meets BeautifulSoup — but with strict output validation.

Technical Architecture

Layer Purpose File
Scraping Tool Downloads and cleans up web page HTML scraper.py > fetch_html_text()
Schema Wrapper Defines output structure with Pydantic models examples/
Prompt Execution Uses Groq + Pydantic-AI to extract structured data scraper.py > scrape()
Configuration Central config for model/backend/runtime params config.yaml

Installation

web-scraper-agent works with Python 3.10 – 3.12

Using uv (recommended)

uv is a Rust-powered fast dependency manager and virtual environment tool.

# 1 - Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh

# 2 - Set up and activate virtual environment
uv venv .venv
source .venv/bin/activate

# 3 - Lock and install dependencies
uv lock
uv sync

Using pip / virtual‑env

python -m venv .venv
source .venv/bin/activate       # Windows: .venv\Scripts activate
pip install -r requirements.txt

Groq users: Don’t forget to export your Groq key before running:

export GROQ_API_KEY="sk-..."

Quick Start

Here's how to use the agent with any schema and prompt.

from scraper import scrape
from pydantic import BaseModel, Field
from typing import List

class Product(BaseModel):
    brand: str
    name: str
    price: str | None
    stars: float | None
    reviews: int | None

class Results(BaseModel):
    products: List[Product]

prompt_template = """
Extract product listings from this HTML.

Return JSON in this format:
{
  "products": [
    {
      "brand": "Brand name",
      "name": "Full title",
      "price": "EUR 29.99",
      "stars": 4.5,
      "reviews": 123
    }
  ]
}

HTML:
{html}
"""

url = "https://www.amazon.de/s?k=wireless+headphones"
data = scrape(url, prompt_template, Results)

# ✅ Expected output:
# products = [
#   Product(brand="Sony", name="WH-CH520...", price="€32.99", stars=4.5, reviews=29694),
#   Product(brand="Anker", name="Q20i...", price="€29.99", stars=4.6, reviews=31020),
#   ...
# ]

print(data.model_dump_json(indent=2))

Configuration (config.yaml)

Control your model, generation parameters, and backend settings from a single file:

# config.yaml
model: "meta-llama/llama-4-scout-17b-16e-instruct"
temperature: 0.7
max_tokens: 1024
base_url: "https://api.groq.com/openai/v1"
timeout: 30
retry_count: 3

GROQ_API_KEY is pulled from environment variables by default.

Directory Tree

web-scraper-agent/
├── scraper.py                  # Core logic: fetch HTML, run model, validate schema
├── config.yaml                 # Groq model + runtime settings
├── requirements.txt            # pip-compatible dependencies
├── examples/
│   ├── extract_product.py      # Amazon product scraper
│   ├── extract_blog.py         # Blog/news scraper
│   └── extract_jobs.py         # Job listings or LinkedIn data
└── README.md                   # This file

About

LLM-powered, schema-driven web scraping agents: extract structured data from any webpage using user-defined Pydantic models.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages