content-extraction

Here are 248 public repositories matching this topic...

firecrawl / firecrawl-mcp-server

🔥 Official Firecrawl MCP Server - Adds powerful web scraping and search to Cursor, Claude and any other LLM clients.

mcp web-crawler web-scraping data-collection batch-processing content-extraction search-api claude llm-tools firecrawl model-context-protocol mcp-server firecrawl-ai javascript-rendering

Updated Jun 30, 2026
JavaScript

vakra-dev / reader

Star

Open source web infrastructure for AI. Scrape, crawl, and automate the web, clean markdown, browser sessions, ready for your agents.

Updated Jul 1, 2026
TypeScript

graphlit / graphlit-mcp-server

Star

Model Context Protocol (MCP) Server for Graphlit Platform

web-crawler web-scraping data-collection content-extraction search-api claude unstructured-data content-ingestion llm-tools model-context-protocol mcp-server

Updated Jan 12, 2026
TypeScript

currentslab / extractnet

Star

A fork of Dragnet that also extract author, headline, date, keywords from context, as well as built in metadata extraction all in one package

python machine-learning text-mining news web-scraping webscraping news-articles news-extractor content-extraction news-extraction text-cleaning date-extraction author-extraction

Updated May 19, 2025
HTML

teng-lin / agent-fetch

Star

Full-content web fetcher for AI agents — Chrome TLS fingerprinting, browser impersonation, and multi-strategy article extraction

nodejs typescript html-to-markdown web-scraping readability fetcher content-extraction ai-agents tls-fingerprint anti-bot-detection httpcloak

Updated Mar 15, 2026
TypeScript

A powerful MCP server extension providing web search and content extraction capabilities. Integrates DuckDuckGo search functionality and URL content extraction into your MCP environment, enabling AI assistants to search the web and extract webpage content programmatically.

crawler cheerio mcp web-crawler duckduckgo web-scraper web-scraping google-search content-extraction duckduckgo-search web-search ai-assistant ai-tools web-content mcp-server web-search-agent

Updated Jun 4, 2026
JavaScript

mvasilkov / readability2

Star

Readability2 converts HTML to plain text.

javascript html readability plaintext content-extraction

Updated Dec 12, 2018
TypeScript

mavam / pi-web-providers

Star

Configurable web access extension for pi that routes search, contents, answers, and research across Claude, Codex, Exa, Gemini, Parallel, and Valyu providers.

typescript gemini content-extraction codex claude web-search exa web-research coding-agent valyu pi-extension pi-package parallel-web

Updated Jun 21, 2026
TypeScript

tuffstuff9 / nextjs-pdf-parser

Star

Next.js template for seamless PDF parsing using pdf2json and FilePond. Ideal for developers seeking a ready-to-use solution for PDF content extraction in Next.js projects.

nextjs content-extraction pdf-parsing react-pdf pdf-parser pdf2json filepond pdf-upload pdf-parse nextjs-pdf-parser nextjs-pdf react-pdf-parser nextjs-pdf-parse nextjs-pdf-parsing

Updated Dec 8, 2023
TypeScript

blessonism / openclaw-skills

Star

A collection of OpenClaw Agent Skills — search, analysis, content extraction, and more.

search skills content-extraction github-explorer ai-agent multi-source-search openclaw

Updated Mar 17, 2026
Python

k-kolomeitsev / agent-browser-workspace

Star

Local browser toolkit for AI agents: deep research and browser use automation with local Chrome (CDP) + Playwright. Flexible, extensible scripts for web navigation, extraction and workflow automatization - built for reproducible research and agent-driven browsing.

Updated Mar 14, 2026
JavaScript

youdotcom-oss / agent-skills

Star

Agent Skills for integrating You.com capabilities into agentic workflows and AI development tools - guided integrations for Claude, OpenAI, Vercel AI SDK, and Teams.ai

Updated Jun 9, 2026
TypeScript

oiwn / dom-content-extraction

Sponsor

Star

DOM Based Content Extraction via Text Density

rust scraping web-crawling content-extraction dom-based

Updated Jun 29, 2026
Rust

developer0hye / anytomd-rs

Star

Pure Rust document-to-Markdown converter for LLM workflows (DOCX, PPTX, XLSX, HTML, CSV, JSON, XML, images).

html markdown rust converter json csv xml xlsx docx pptx text-processing content-extraction image-extraction llm anytomd

Updated Jun 7, 2026
Rust

gregors / boilerpipe-ruby

Star

Pure ruby implementation of the Boilerpipe content extraction algorithm tuned for online articles

news webscraping content-extraction boilerpipe boilerpipe-algorithm

Updated Feb 21, 2021
Ruby

Murrough-Foley / rs-trafilatura

Star

Fast, accurate web content extraction in Rust. ML page-type classification, per-type extraction, confidence scoring. F1=0.966 on ScrapingHub (#1), F1=0.859 across 2,008 annotated pages (1,497 development + 511 held-out test

nlp rust machine-learning web-scraping content-extraction search-engine-optimization trafilatura

Updated Apr 3, 2026
Rust

nikitautiu / learnhtml

Star

Web content extraction using machine learning

html deep-learning content-extraction

Updated Mar 3, 2021
HTML

zoharbabin / web-researcher-mcp

Sponsor

Star

The AI research assistant that cites real sources honestly — and searches the web. Your AI research assistant that cites real sources and stays honest. Works with Claude, Cursor, any MCP client.

Updated Jul 1, 2026
Go

spences10 / mcp-jinaai-reader

Star

🔍 Model Context Protocol (MCP) tool for parsing websites using the Jina.ai Reader

mcp documentation-tool text-extraction web-scraping content-extraction web-content jinaai llm-tools model-context-protocol

Updated Apr 5, 2025
JavaScript

mukul975 / mcp-web-scrape

Sponsor

Star

🚀 mcp-web-scrape — Clean, cache-aware web content fetcher for AI agents. Fetch any URL → extract readable content → return Markdown/JSON with citations. ⚡ Fast caching, 🤝 robots.txt compliant, 📝 Markdown-ready output, �� works with ChatGPT/Claude Desktop.

nodejs api agent markdown scraper typescript ai mcp cache web-crawler sse citations web-scraping stdio content-extraction claude llm chatgpt model-context-protocol

Updated Feb 16, 2026
TypeScript

Improve this page

Add a description, image, and links to the content-extraction topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the content-extraction topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

content-extraction

Here are 248 public repositories matching this topic...

firecrawl / firecrawl-mcp-server

vakra-dev / reader

graphlit / graphlit-mcp-server

currentslab / extractnet

teng-lin / agent-fetch

pinkpixel-dev / web-scout-mcp

mvasilkov / readability2

mavam / pi-web-providers

tuffstuff9 / nextjs-pdf-parser

blessonism / openclaw-skills

k-kolomeitsev / agent-browser-workspace

youdotcom-oss / agent-skills

oiwn / dom-content-extraction

developer0hye / anytomd-rs

gregors / boilerpipe-ruby

Murrough-Foley / rs-trafilatura

nikitautiu / learnhtml

zoharbabin / web-researcher-mcp

spences10 / mcp-jinaai-reader

mukul975 / mcp-web-scrape

Improve this page

Add this topic to your repo