content-extraction

Here are 202 public repositories matching this topic...

firecrawl / firecrawl-mcp-server

🔥 Official Firecrawl MCP Server - Adds powerful web scraping and search to Cursor, Claude and any other LLM clients.

mcp web-crawler web-scraping data-collection batch-processing content-extraction search-api claude llm-tools firecrawl model-context-protocol mcp-server firecrawl-ai javascript-rendering

Updated Mar 23, 2026
JavaScript

vakra-dev / reader

Star

Open-source, production-grade web scraping engine built for LLMs. Scrape and crawl the entire web, clean markdown, ready for your agents.

Updated Feb 2, 2026
TypeScript

graphlit / graphlit-mcp-server

Star

Model Context Protocol (MCP) Server for Graphlit Platform

web-crawler web-scraping data-collection content-extraction search-api claude unstructured-data content-ingestion llm-tools model-context-protocol mcp-server

Updated Jan 12, 2026
TypeScript

currentslab / extractnet

Star

A fork of Dragnet that also extract author, headline, date, keywords from context, as well as built in metadata extraction all in one package

python machine-learning text-mining news web-scraping webscraping news-articles news-extractor content-extraction news-extraction text-cleaning date-extraction author-extraction

Updated May 19, 2025
HTML

teng-lin / agent-fetch

Star

Full-content web fetcher for AI agents — Chrome TLS fingerprinting, browser impersonation, and multi-strategy article extraction

nodejs typescript html-to-markdown web-scraping readability fetcher content-extraction ai-agents tls-fingerprint anti-bot-detection httpcloak

Updated Mar 15, 2026
TypeScript

A powerful MCP server extension providing web search and content extraction capabilities. Integrates DuckDuckGo search functionality and URL content extraction into your MCP environment, enabling AI assistants to search the web and extract webpage content programmatically.

crawler cheerio mcp web-crawler duckduckgo web-scraper web-scraping google-search content-extraction duckduckgo-search web-search ai-assistant ai-tools web-content mcp-server web-search-agent

Updated Feb 13, 2026
JavaScript

mvasilkov / readability2

Star

Readability2 converts HTML to plain text.

javascript html readability plaintext content-extraction

Updated Dec 12, 2018
TypeScript

tuffstuff9 / nextjs-pdf-parser

Star

Next.js template for seamless PDF parsing using pdf2json and FilePond. Ideal for developers seeking a ready-to-use solution for PDF content extraction in Next.js projects.

nextjs content-extraction pdf-parsing react-pdf pdf-parser pdf2json filepond pdf-upload pdf-parse nextjs-pdf-parser nextjs-pdf react-pdf-parser nextjs-pdf-parse nextjs-pdf-parsing

Updated Dec 8, 2023
TypeScript

blessonism / openclaw-skills

Star

A collection of OpenClaw Agent Skills — search, analysis, content extraction, and more.

search skills content-extraction github-explorer ai-agent multi-source-search openclaw

Updated Mar 17, 2026
Python

gregors / boilerpipe-ruby

Star

Pure ruby implementation of the Boilerpipe content extraction algorithm tuned for online articles

news webscraping content-extraction boilerpipe boilerpipe-algorithm

Updated Feb 21, 2021
Ruby

oiwn / dom-content-extraction

Sponsor

Star

DOM Based Content Extraction via Text Density

rust scraping web-crawling content-extraction dom-based

Updated Sep 23, 2025
Rust

nikitautiu / learnhtml

Star

Web content extraction using machine learning

html deep-learning content-extraction

Updated Mar 3, 2021
HTML

spences10 / mcp-jinaai-reader

Star

🔍 Model Context Protocol (MCP) tool for parsing websites using the Jina.ai Reader

mcp documentation-tool text-extraction web-scraping content-extraction web-content jinaai llm-tools model-context-protocol

Updated Apr 5, 2025
JavaScript

k-kolomeitsev / agent-browser-workspace

Star

Local browser toolkit for AI agents: deep research and browser use automation with local Chrome (CDP) + Playwright. Flexible, extensible scripts for web navigation, extraction and workflow automatization - built for reproducible research and agent-driven browsing.

Updated Mar 14, 2026
JavaScript

mavam / pi-web-providers

Star

Configurable web access extension for pi that routes search, contents, answers, and research across Claude, Codex, Exa, Gemini, Parallel, and Valyu providers.