You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Open-source web retrieval & research agent built for AI agents.
agentfetch is a free, local alternative to Firecrawl, Exa, Parallel.ai, and Tavily. It fetches any webpage, crawls any site, searches the web, and researches any topic — returning clean markdown and structured reports that AI agents can consume directly.
Works with LangChain, LlamaIndex, CrewAI, AutoGen, Claude MCP, OpenAI function calling, Gemini, Groq, and plain REST. No vendor lock-in, no API keys required.
No PyPI account, no API tokens, no sign-up needed. GitHub is the source.
What makes it different
Research Agent — Tavily-style deep research: auto-decomposes questions into sub-queries, searches multiple engines, gathers full content, and synthesizes a structured report with citations via Ollama or Claude. Supports iterative follow-up (depth="deep"), structured output schemas, and four citation formats (numbered, MLA, APA, Chicago).
Smart Mode Router — detects JavaScript-heavy SPAs (Next.js, Nuxt, React) and falls back to Playwright headless browser automatically. Static pages use direct HTTP.
5-layer extraction pipeline — trafilatura → newspaper3k → readability-lxml → BeautifulSoup → plain text. Best-effort extraction from any HTML.
Never raises exceptions — always returns structured FetchResult with confidence scores, error fields, and injection detection. Agents can trust the output.
Post-extraction quality check — detects SPA shell text, low prose ratio, missing sentence structure, and downgrades confidence so agents can retry with browser engine.
In-memory LRU cache — deduplicates repeated URL fetches within a session. Configurable size/TTL via env vars. No Redis required.
Information saturation crawling — no arbitrary depth limits. CrawlStopper detects vocabulary saturation and content redundancy, stopping when enough data is gathered.
Persistent crawl store — SQLite-backed job persistence when Redis is not configured. Results survive process restarts.
Prompt injection firewall — 13 patterns detected and redacted to [REDACTED BY AGENTFETCH].
Cloudflare bypass — optional curl_cffi integration with 12 TLS fingerprint profiles (Chrome 99–124, Safari 15/17) and auto-rotation.
Browser stealth — optional playwright-stealth integration for advanced anti-detection (WebGL vendor, canvas fingerprint, navigator.webdriver removal). Enabled by default.
Robots.txt compliance — optional async parser with caching, crawl-delay, and sitemap discovery.
Proxy rotation — round-robin or random proxy pools with automatic failure tracking.
Local LLM extraction — optional Ollama integration for structured data extraction without API costs.
Redis-backed job queue — horizontal scaling for crawl operations with background workers.
Tools
Tool
Description
agent_scrape
Fetch any URL; auto-detects browser need. Supports ScrapeConfig (wait_for selectors, tag filtering, citation markers, proxies, JA3 profile).
agent_crawl
Recursive crawl with information saturation stopping, robots.txt compliance, deduplication.
agent_search
Web search via SearXNG, DuckDuckGo, Google, or Bing with optional result scraping.
agent_extract
Structured data extraction by JSON schema via Ollama, Anthropic Claude, or CSS fallback.
agent_map
Discover all URLs on a website via sitemap.xml and BFS crawling.
agent_status
Poll crawl job progress (in-memory or Redis).
agent_research
Research a topic: decomposes into sub-queries, searches multi-engine, gathers content, synthesizes a structured report with citations via LLM. Supports deep iterative follow-up and structured output schemas.
Library API
Function
Description
smart_fetch(url, config=)
Fetch a single URL; auto-detects browser need. Returns FetchResult.
Search and optionally scrape results. Returns SearchResult.
parallel_search(query, sources=, max_results=)
Search engine results without scraping. Returns tuple[list[EngineResult], list[str], dict[str, str]].
smart_research(prompt, config=)
Research a topic: decomposes query, gathers sources, synthesizes report with citations via LLM. Returns ResearchResult.
Quickstart
LangChain
fromagentfetch.integrations.langchain.toolsimportAgentFetchToolstools=AgentFetchTools# Use with any LangChain agent
MCP (Claude Desktop, Cursor, etc.)
pip install git+https://github.com/SID1ART/agentfetch.git
agentfetch-mcp
# configure in Claude Desktop or any MCP host
REST API
pip install git+https://github.com/SID1ART/agentfetch.git
agentfetch serve
# Scrape
curl -X POST http://localhost:8080/agent_scrape \
-d '{"url": "https://example.com"}'# Research (async — returns job ID, poll for result)
curl -X POST http://localhost:8080/agent_research \
-d '{"prompt": "Latest AI developments", "max_sources": 10}'# → {"request_id":"abc123","status":"pending",...}# Poll research result
curl http://localhost:8080/agent_research/abc123
# → {"request_id":"abc123","status":"complete","answer":"# Report...","sources":[...]}# Research with streaming (SSE)
curl -X POST http://localhost:8080/agent_research/stream \
-d '{"prompt": "Compare OpenAI and Anthropic pricing", "citation_format": "apa"}'# → event: progress, event: result (full report)
Python library
importasynciofromagentfetchimportsmart_fetch, search_fetchfromagentfetch.core.schemaimportScrapeConfig# Fetch a single URLresult=asyncio.run(smart_fetch(
"https://en.wikipedia.org/wiki/Obsession_(2025_film)",
config=ScrapeConfig(
wait_for=".main-content",
exclude_tags=["nav", "footer"],
citation_links=True,
)
))
print(result.content) # clean markdownprint(result.citations) # [1], [2] URLs# Search with multiple enginessr=asyncio.run(search_fetch(
"latest AI news",
sources=["duckduckgo", "google", "bing"],
max_results=5,
))
print(sr.results) # list[FetchResult]print(sr.errors) # per-engine errors, e.g. {"google": "rate limited (429)"}print(sr.sources_used) # engines that returned results# Research a topic (uses Ollama or Claude for query decomposition and synthesis)fromagentfetchimportsmart_research, ResearchConfigreport=asyncio.run(smart_research(
"Compare pricing of OpenAI, Anthropic, and Google AI APIs",
config=ResearchConfig(
max_sources=15,
citation_format="apa",
depth="deep",
)
))
print(report.answer) # Comprehensive report with [Author, Year] citationsprint(report.sources) # list of ResearchSource with full contentprint(report.response_time) # e.g. 12.34s
Open-source web retrieval & research agent built for AI agents. Works with LangChain, LlamaIndex, CrewAI, OpenAI, MCP, and any REST agent. Supports scrape, search, crawl, map, extract, and research.